Tomato Pedicel Picking-Point Localization via Improved YOLOv8n-EED-Seg and RGB-D Fusion

Wu, Liping; Liu, Lilin; Teng, Dongdong

doi:10.3390/agriculture16111197

Open AccessArticle

Tomato Pedicel Picking-Point Localization via Improved YOLOv8n-EED-Seg and RGB-D Fusion

by

Liping Wu

¹

,

Lilin Liu

^1,*

and

Dongdong Teng

²

¹

State Key Laboratory of Optoelectronic Materials and Technologies, School of Electronics and Information Technology, Sun Yat-sen University, No.135 Xin’gang West Road, Guangzhou 510275, China

²

School of Physics, Sun Yat-sen University, No.135 Xin’gang West Road, Guangzhou 510275, China

^*

Author to whom correspondence should be addressed.

Agriculture 2026, 16(11), 1197; https://doi.org/10.3390/agriculture16111197

Submission received: 23 April 2026 / Revised: 16 May 2026 / Accepted: 20 May 2026 / Published: 29 May 2026

(This article belongs to the Special Issue Artificial Intelligence in Precision Agriculture: Applications in Crop Management)

Download

Browse Figures

Versions Notes

Abstract

Accurate and rapid localization of tomato pedicel picking points presents a significant challenge for automated harvesting, due to factors such as occlusion by dense foliage, overlapping fruits, variable lighting conditions, and the slender morphology of pedicels. To address these, we propose an integrated picking decision system combining enhanced instance segmentation with RGB-D fusion. In this study, a lightweight detection model named YOLOv8n-EED-seg is introduced. An optimized EfficientRep backbone is integrated to enhance computational efficiency, while the EMAttention mechanism and a refined DynamicHead module strengthen multi-scale feature representation for slender pedicels. The model further incorporates the Zhang–Suen algorithm for skeleton extraction and a large-neighborhood mean method for depth restoration, enabling precise 3D localization. Experiments are conducted on a dataset of 3310 images collected in a greenhouse environment. Compared with the baseline YOLOv8n-seg, our model improves precision, recall, F1 score, and mAP₅₀ by 5.09%, 2.78%, 3.63%, and 4.31%, respectively. The system achieves an inference speed of 4.8 ms per frame, enabling real-time performance, while attaining a 93.88% success rate in 3D picking-point localization. Furthermore, the proposed model demonstrates superior robustness in complex environments compared with common segmentation models, effectively balancing accuracy, speed, and model complexity. This study provides a reliable technical pathway for high-precision, vision-based tomato-harvesting robots.

Keywords:

tomato picking; pedicel segmentation; picking point positioning; RGB-D camera; information fusion

1. Introduction

As a globally cultivated high-value crop, tomato production remains heavily dependent on manual labor [1]. While automated harvesting systems exist, they typically rely on one-time harvesting. The selective batch-picking approach based on fruit ripeness has yet to be fully automated. Labor scarcity and rising costs underscore the urgency for tomato-picking robots to ensure sustainable production. Central to such robots is a visual system capable of real-time fruit recognition and precise 3D localization [2]. However, field conditions such as leaf occlusion, variable environmental lighting, and fruit overlap pose significant challenges to reliable perception [3]. Traditional machine vision methods (e.g., color, texture, or shape-based thresholding) show limited robustness and generalization in unstructured environments, where pedicels often adhere to foliage or structures [4]. Dynamic illumination conditions significantly compromise the accuracy of RGB-based segmentation in greenhouse settings, with recent studies reporting performance degradation of up to 7% under varying light intensities [5], while single-feature approaches like Hu moments achieved only 65% accuracy in cross-cultivar tasks [6]. Moreover, cost-effective depth sensors often suffer from data loss; specifically, the largest missing regions average approximately 2.23% of the total image area [7], which can compromise positioning precision.

In information-based agriculture, deep learning-driven object detection is now the mainstream approach in fruit recognition, surpassing early limited-feature methods [8,9]. The emergence of CNNs (YOLO, R-CNN series) has greatly enhanced both accuracy and speed. Yan et al. [10] adapted YOLOv5s for real-time apple detection, achieving a mAP of 86.75% on apple targets. Gai et al. [11] introduced TL-YOLOv8 for blueberries, integrating attention and reparameterization to accelerate training and enrich features, achieving 84.6% precision and 94.1% mAP. However, these methods target fruit-level detection, without addressing the finer-grained challenge of pedicel localization, a task requiring sub-centimeter precision rather than fruit-level detection. Recent research has shifted toward instance segmentation for precise robotic picking. Wang et al. [12] used Mask R-CNN for overlapping apple segmentation, achieving 96.5% precision and 97.4% recall, yet its two-stage design requires 270 ms per image, unsuitable for real-time harvesting. Yuan et al. [13] developed a lightweight SSD variant for cherry tomato segmentation, balancing speed and accuracy, yet its single-shot design struggles with small, slender targets. Liu et al. [14] proposed YOLACTFusion, an attention-guided RGB-NIR fusion method for tomato stem detection, improving mAP from 39.20% to 46.29%; however, it relies on multimodal input not always available in greenhouses. Attention mechanisms in YOLO models further improved robustness under occlusion and varying light [15,16], exemplified by Song et al. [17], who integrated SegNext-attention into YOLOv8-seg for tomato segmentation and maturity classification, achieving 86.9% precision and 84.8% mAP. Nevertheless, these attention mechanisms focus on backbone enhancement rather than multi-scale fusion across detection heads, leaving room for improvement in pedicel segmentation.

Although instance segmentation provides richer 2D shape information, precise 3D spatial positioning remains a key challenge for automated picking. Yoshida et al. [18] used an RGB-D camera to obtain 3D tomato point clouds, applied region growing for clustering, and determined optimal picking points by integrating voxel connectivity, Mahalanobis distance, and pedicel geometry, achieving 90% success in 15 s, which limits real-time application. Zhang et al. [19] proposed Completion-BiPy-Disp, fusing bilateral filtering with pyramid models to restore missing depth in disparity maps. Yet over 15% of regions lacked reliable depth, with RMSE reaching 3–5 pixels (4–7 mm) on texture-sparse surfaces. Zheng et al. [20] combined RAFT-Stereo with improved YOLOv5 to segment and crop point clouds via masks, then fitted spheres to compute centroids and radii, yielding a mean absolute radius error of 2.4 mm. However, the systematic depth deviated up to 3.7 mm due to point cloud holes and noise. Rong et al. [21] fused RGB-D data for dynamic tomato cluster tracking using YOLOv5 and ByteTrack (mAP 94.5%), yet failed to identify pedicel cutting or grasping points.

Skeletonization of segmentation masks enables precise extraction of picking points by converting instance masks into one-pixel-wide topological skeletons that preserve the morphological characteristics of pedicels. This approach has been effectively applied to branch-like structure analysis in agricultural and forestry contexts, including tree branch reconstruction [22,23], root system phenotyping [24], and plant phenotyping [25], demonstrating improved localization accuracy over centroid-based methods. For depth recovery, we adopt a neighborhood-based compensation strategy to reconstruct missing depth data caused by sensor limitations by interpolating invalid regions using valid depth values from surrounding pixels. Learning-based depth completion methods typically require extensive training data, which limits their practicality for rapid deployment [26]. In contrast, conventional interpolation and filtering approaches are computationally efficient but often struggle with complex occlusions and texture-sparse surfaces, producing over-smoothed results that blur fine-grained pedicel structures [27]. To balance these trade-offs, neighborhood-based depth compensation strategies offer a lightweight alternative suitable for real-time harvesting applications while maintaining local depth consistency [28]. Table 1 summarizes the critical analysis of existing methods and the corresponding improvements proposed in this study.

Building on these insights, current research has advanced fruit recognition and positioning, but core challenges remain in incomplete 3D perception and limited accuracy. Most methods rely on idealized point cloud fitting, which is sensitive to depth loss and noise, or only perform cluster-level detection without providing precise operation points. Despite the progress reviewed above, three core scientific gaps remain unaddressed: lack of sub-centimeter pedicel localization, insufficient cross-scale feature fusion across detection heads, and poor depth completion robustness on weakly textured, elongated pedicel surfaces. To address these gaps, this work is built upon two design considerations: (1) the incorporation of multiple attention mechanisms along with an efficient backbone enhances pedicel segmentation accuracy without compromising real-time inference speed; (2) skeletonization of segmentation masks enables precise extraction of picking points, while a neighborhood-based depth compensation strategy effectively reconstructs missing depth data caused by sensor limitations. Guided by these considerations, we propose a tomato picking-point localization method that integrates a modified YOLOv8n-seg model with RGB-D fusion. Based on YOLOv8n-seg, we incorporate an optimized EfficientRep backbone, the EMAttention mechanism, and a refined DynamicHead module, tailored to pedicel morphology and model efficiency [29,30,31], yielding the YOLOv8n-EED-seg model. The pedicel mask is skeletonized to extract its main structure and derive picking-point coordinates [32]. The core contribution of this study lies in a systematic redesign of the detection pipeline for pedicel perception, rather than a simple assembly of existing components. In addition, specific improvements have been introduced to the existing EfficientRep and DynamicHead modules to better suit the characteristics of slender pedicel targets.

Specifically, the 8-direction shift convolution in the EfficientRep backbone enlarges the receptive field without adding parameters. The enriched features then feed into the EMAttention module for cross-scale fusion across detection heads. Finally, the refined DynamicHead decouples classification, regression, and segmentation tasks, avoiding gradient interference.

Beyond segmentation, we further address the challenge of missing depth data caused by sensor limitations on weakly textured, elongated pedicel surfaces. A large-neighborhood mean method is introduced to compensate for invalid depth values, enabling accurate 3D localization through RGB-D fusion. This depth compensation strategy, together with the cascaded feature refinement architecture, forms a complete perception pipeline from 2D segmentation to 3D localization. This problem-driven module reorganization has been validated through greenhouse harvesting experiments, bridging the gap between laboratory research and field applications. The main contributions of this work are as follows:

This study integrates the EMAttention mechanism into the YOLOv8n-seg model to enhance the recognition and segmentation of small pedicels via cross-dimensional interaction and multi-scale feature calibration.
To address the trade-off between inference speed and feature representation, this work introduces an improved EfficientRep lightweight backbone network. Furthermore, an improved DynamicHead module is employed to replace the original detection head. These modifications are expected to enhance feature representation and detection performance, thereby making the model more suitable for embedded deployment.
A 3D positioning system integrating image segmentation, skeletonization analysis, and depth restoration algorithms is designed. This system achieves stable and high-precision localization of tomato picking points, providing a reliable visual perception solution for picking robots.

2. Materials and Methods

2.1. Image Dataset

Image data were collected at the Tianhe Smart Agriculture Park (Guangzhou, China) between 10 December 2024 and 26 March 2025, daily from 10:00 to 12:00. In their controlled glass greenhouse, two cherry tomato cultivars: ‘Israel Red Cluster’ (red-fruit) and ‘Yuekeda’ (yellow-fruit) are cultivated by a trellis system. During data collection, to simulate real harvesting conditions, images were captured by a vivo S19 mobile phone holding by a human operator walking along the greenhouse pathway. The main rear camera features a 1/1.55-inch CMOS sensor with 50 MP resolution. In total, 3310 tomato cluster images at a resolution of

3468 \times 4624

pixels were collected. The operator held the phone at chest height (1.2–1.5 m) with shooting angles of

30 °

–

60 °

relative to the horizontal plane, oriented to capture either front or side views of the pedicel axis. Working distances ranged from 10 to 70 cm. Based on the EXIF information, the ISO was set to 200, shutter speed to 1/80 s, and white balance to 5000 K to maintain natural color reproduction under greenhouse illumination. No additional lighting or filters were used, ensuring the images reflect genuine greenhouse conditions.

To enhance model robustness, the dataset encompasses diverse illumination scenarios: direct sunlight, diffuse light, and backlighting. Following Lalonde et al. [33], illumination is classified by contrast and shadow patterns. We adapted these criteria to greenhouse conditions, where visible sky and reliable vertical shadows are absent. Illumination categories are quantified via statistical image analysis: direct sunlight is characterized by high contrast (standard deviation

> 60

) and high mean brightness (>120); diffuse light by low contrast (standard deviation

< 50

) and uniform illumination; backlighting by a center-to-periphery brightness ratio

< 0.77

, calculated as the mean brightness of the central region (

25 %

to

75 %

of image dimensions) divided by that of the peripheral region. This classification is performed via grayscale histogram analysis, as illustrated in Figure 1e. Detailed statistical validation of these thresholds (including mean brightness, standard deviation, confidence intervals, and p-values) is provided in Appendix A (Table A1).

Viewing angles are categorized based on the orientation of the primary tomato pedicel relative to the camera: front view (pedicel axis faces the camera) and side view (pedicel axis perpendicular to the camera). As summarized in Table 2, the dataset includes 1432 direct sunlight, 1091 diffuse light, and 787 backlighting images, comprising 1745 front views and 1565 side views.

Prior to manual annotation, a color-based pre-filtering step in the HSV color space was applied to identify images containing mature fruit [34]. Fruit regions were segmented using saturation and value thresholds (

S > 60

,

V > 60

) [34], while green leaves (hue

60 °

–

80 °

) [35] and white background (

S < 30

,

V > 200

) were removed. Based on statistical analysis of randomly selected samples (103 red-fruit, 101 yellow-fruit images), red fruit is defined by hue in the range

0 °

–

25 °

, and yellow fruit by hue in the range

33 °

–

45 °

. Detailed statistical results, including means, standard deviations, confidence intervals, percentiles, and adopted ranges, are provided in Appendix A (Figure A1, Table A2). Images with a mature fruit area ratio below 0.5% were excluded, as they lack harvestable fruit. Following pre-filtering, 3230 valid images were retained for annotation. No images were discarded based on this filtering; instead, they were retained to preserve dataset diversity and to enable the model to learn the ability to identify images without harvestable pedicels.

Pedicels were annotated at the instance level using ISAT v2 software. A pickable pedicel is defined as a pedicel connected to at least one mature fruit with a sufficiently visible structure to enable human annotation (i.e., the pedicel can be identified and its midline traced) under partial occlusion. Fully occluded or completely invisible pedicels were not annotated. After annotation, the RGB images were randomly split into training, validation, and test sets in an 8:1:1 ratio with a fixed random seed (42) to ensure reproducibility. Dataset statistics are summarized in Table 3. Representative annotation examples are provided in Appendix A (Figure A3).

2.2. The Proposed YOLOv8n-EED-Seg Model

To address the challenges in segmenting slender tomato pedicels, we propose YOLOv8n-EED-seg, an improved instance segmentation model based on YOLOv8n-seg. Three enhancements are introduced: an improved EfficientRep backbone for better hardware efficiency, an EMAttention mechanism for adaptive multi-scale feature fusion, and an improved DynamicHead module for detection accuracy. The overall model architecture is shown in Figure 2a. The C2f, Bottleneck, and CBS modules are illustrated in Figure 2b,c; detailed descriptions are provided in the figure caption. Appendix A (Table A4) provides a layer-by-layer architectural comparison between the baseline YOLOv8n-seg and the proposed YOLOv8n-EED-seg model.

2.3. Improved EfficientRep Network Architecture

The standard stacked convolutional backbone of YOLOv8n-seg suffers from restricted receptive field, limited multi-scale fusion, and high computational cost, which is problematic for segmenting slender tomato pedicels with low pixel proportion and sparse features.

To address these limitations, we propose an improved EfficientRep architecture. Specifically, we replace RepConv with 8-direction shift convolution [30], which achieves directional feature extraction by directly shifting pixels of the input feature map in memory space across eight spatial directions (Figure 3a) to develop the S-RepConv module, where the multi-branch structure of the original RepConv is substituted with an 8-direction shift convolution-based multi-branch reparameterization structure [8]. During training, parallel multi-branch reparameterization is adopted—comprising 8-direction shift convolution branches with diverse shift scales and a 1 × 1 convolution branch, with branch outputs aggregated for feature fusion. SiLU is used as the activation function due to its smoother gradient and improved performance. During inference, all branches are fused into a single equivalent 8-direction shift convolution structure via the reparameterization strategy (Figure 3b), eliminating the need for 9-weight multiplications per pixel (3 × 3 convolution) and maintaining the lightweight inference property.The SPPF module (replacing SimSPPF), as shown in Figure 3c, consists of three sequential max-pooling layers with kernel sizes of 5 × 5, 9 × 9, and 13 × 13. Each pooling layer preserves the spatial dimensions through appropriate padding, and the outputs from each stage are concatenated along the channel dimension to achieve efficient multi-scale fusion, enhancing slender-target perception while lowering computational cost. The improved EfficientRep network architecture is illustrated in Figure 3d. Each block uses a Conv-BN-SiLU structure to strengthen feature robustness and generalization.

2.4. EMAttention Mechanism

Tomato pedicels, typically slender in images, are often occluded by foliage and affected by lighting variations. Robust detection therefore requires models capable of capturing both local details and global contextual information across multiple scales. Accordingly, this study integrates the lightweight cross-spatial attention module EMAttention [29] into the feature fusion layer, as illustrated in Figure 3e. The EMA module captures both long- and short-range dependencies through multi-scale and cross-spatial learning without dimensionality reduction, thereby enhancing feature representation while preserving computational efficiency (Detailed mathematical derivations are provided in Appendix B).

2.5. Improved DynamicHead Module

In tomato pedicel segmentation, multi-scale and small-scale targets pose a challenge. The YOLOv8-seg detection head uses a decoupled architecture with parallel branches for box and mask prediction, processing three-scale FPN features independently. This design lacks adaptive multi-scale interaction, limiting its ability to capture fine-grained details of slender pedicels (typically smaller than

64 \times 64

pixels) and fuse cross-scale context, leading to suboptimal segmentation for small or occluded targets.To address this, an improved DynamicHead (DyHead) module (Figure 4b) replaces the original detection head. DyHead includes three core attention modules: scale-aware (

π_{L}

), spatial-aware (

π_{S}

), and task-aware (

π_{C}

).

π_{L}

dynamically fuses multi-scale features based on semantic relevance, while

π_{S}

enhances attention sparsity via deformable convolution. The core optimization focuses on

π_{C}

: traditional fully connected (FC) layers are replaced with an Efficient Channel Attention (ECA) module (Figure 4a). The ECA module operates on the principle of local cross-channel interaction without dimensionality reduction. It first performs global average pooling (GAP) on the input feature map to compress spatial information into channel-wise statistics, producing a

1 \times 1 \times C

vector where C represents the number of input channels (

C = 256

in this study). Subsequently, a 1D convolution kernel of size K is applied to capture local cross-channel interactions. The kernel size K is adaptively determined by the channel dimension according to the equation:

K = {⌊\frac{{log}_{2} (C)}{2} + \frac{1}{2}⌋}_{odd}

(1)

where C is the number of input channels and the subscript “odd” indicates rounding to the nearest odd number. For the 256-channel feature maps used in tomato pedicel segmentation, this calculation yields K = 3.

Compared with the original

π_{C}

module using two fully connected layers (

256 \times 256

each, totaling 65,536 parameters), the proposed ECA-based replacement reduces the number of parameters to 768 (

K \times C / 2

), achieving a 98.8% reduction. This is accomplished by replacing dense FC layers with local 1D convolution and removing bias terms, capturing local dependencies via a sliding window. The original hard sigmoid in DyHead is replaced with standard sigmoid, as hard sigmoid’s piecewise linear approximation causes fine-grained channel weight loss, impairing weak feature capture for small-scale pedicels. Standard sigmoid’s smooth nonlinearity enables more precise adaptive channel weight learning, preserving these fragile features. Furthermore, a batch normalization layer is added after the ECA block to stabilize training. A residual fusion mechanism with learnable initial weights (

[1, 0, 0, 0]

) ensures stable initial output, prevents gradient vanishing, and enables adaptive weight allocation across classification, box regression, and segmentation tasks.

2.6. Skeletonization Processing

In the visual system of tomato-picking robots, the picking point is defined as a pixel on the pedicel midline. To accurately extract this key coordinate from the segmentation mask, our system employs the Zhang–Suen algorithm—characterized by simple implementation and high computational efficiency [32]. This parallel thinning method iteratively removes boundary points meeting specific criteria via two alternating sub-iterations until convergence, yielding a single-pixel-width pedicel midline skeleton.

As illustrated in Figure 5, the algorithm evaluates each foreground pixel

P_{1} (i, j)

(marked in red) and its eight surrounding neighborhood pixels

P_{2} \sim P_{9}

. In each iteration step, a pixel can only be deleted if it meets the following conditions simultaneously: the number of foreground pixels in the 8-neighborhood satisfies

2 \leq B \leq 6

; the number of

0 \to 1

transitions is one during sequential traversal of the neighborhood; there is at least one background pixel in the specific neighborhood combination.

Among them, the conditions for the first sub-iteration are:

\{\begin{matrix} P_{2} \times P_{4} \times P_{6} = 0 \\ P_{4} \times P_{6} \times P_{8} = 0 \end{matrix}

(2)

The conditions for the second sub-iteration are as follows:

\{\begin{matrix} P_{2} \times P_{4} \times P_{8} = 0 \\ P_{2} \times P_{6} \times P_{8} = 0 \end{matrix}

(3)

Equation (2) is used to remove objects on the lower and right boundaries, while Equation (3) is used to remove objects on the upper and left boundaries.

2.7. Fusion of Depth Information for Picking-Point Localization

Depth information extraction is critical for acquiring 3D coordinates of picking points. Due to the slender morphology of tomato pedicels and the limited resolution of depth cameras, direct depth sampling from raw depth maps frequently leads to missing or invalid values. Conventional hole-filling strategies, such as region-growing algorithms, can only repair small, localized missing regions, failing to handle large-area depth loss caused by uneven illumination or specular reflection, as visualized in Figure 6. To overcome such drawbacks, this study adopts a large-neighborhood mean filtering strategy for depth completion, with the complete procedure detailed in Algorithm A1. In contrast to complex inpainting networks and variational optimization methods that demand high computational costs and extensive training data, the large-neighborhood mean method robustly estimates missing values using valid neighboring depth information. The depth completion process proceeds as follows. First, the binary mask obtained from pedicel segmentation is used to extract the depth set of the pedicel region via element-wise multiplication with the depth map. The mean depth

z_{1}

of the pedicel region is then calculated. Let z be the original depth value at the picking point coordinates. The final depth

z_{P}

is determined by:

z_{P} = \{\begin{matrix} z_{1}, & if z = 0 \\ z, & if z \neq 0 and | z_{1} - z | \leq k \\ z_{1}, & if z \neq 0 and | z_{1} - z | > k \end{matrix}

(4)

where k is a reference threshold. The optimal value of k was determined through preliminary experiments under the specific harvesting environment, as detailed in Table A3 in Appendix A.

After depth completion, the RGB and depth images must be aligned to establish pixel-wise correspondence. Alignment is performed by warping the depth image into the RGB coordinate frame using the SDK’s align module, which relies on factory-calibrated intrinsic and extrinsic parameters. For an aligned pixel

(u, v)

with depth value d (mm), the 3D coordinate

(X, Y, Z)

(mm) is given by the pinhole model:

X = \frac{(u - c_{x}) \cdot d}{f_{x}}, Y = \frac{(v - c_{y}) \cdot d}{f_{y}}, Z = d

where

f_{x}

and

f_{y}

are the focal lengths, and

(c_{x}, c_{y})

is the principal point. Detailed calibration parameters are listed in Appendix B.

2.8. Experiment Environment and Model Evaluation

The experiments were performed on a Windows 10 system using the PyTorch 2.0+cu118 framework and Python 3.8. Hardware included a 24 GB NVIDIA GeForce RTX 3090 GPU and an Intel Xeon Platinum 8362 CPU with 60 GB RAM. Training was conducted for 300 epochs (determined empirically as the point where validation loss and mAP₅₀ stabilized beyond 250 epochs in preliminary experiments), requiring approximately 5 h. Detailed hyperparameters are listed in Table 4. All model training and comparisons were conducted under identical conditions. The training data was augmented with a consistent pipeline: Mosaic (multi-scale context enrichment), MixUp (linear interpolation with

λ \sim Beta (1.5, 1.5)

), and random affine transformations (rotation

\pm 15 °

, scaling

0.8

to

1.2

, translation

\pm 10 %

, shearing

\pm 10 °

) to enhance spatial invariance. Model performance—including Precision, Recall, F1-score, and mAP—was evaluated using the best checkpoint after 300 epochs of training. For each model, three independent training runs were conducted with different random seeds (42, 123, 456).

In this study, model performances were evaluated by four key metrics: mean average precision at 50% IoU threshold (mAP₅₀), Precision, Recall, and F1-score. Here, mAP₅₀ specifically denotes the mAP value computed with an Intersection over Union (IoU) threshold of 0.5.

\begin{matrix} Precision & = \frac{T P}{T P + F P}, \end{matrix}

(5)

\begin{matrix} Recall & = \frac{T P}{T P + F N}, \end{matrix}

(6)

\begin{matrix} F_{1} - score & = 2 \times \frac{Precision \times Recall}{Precision + Recall}, \end{matrix}

(7)

where

T P

,

F P

, and

F N

denote true positives, false positives, and false negatives, respectively; C represents the number of classes;

3. Results and Analysis

To comprehensively assess the proposed tomato-picking decision system and validate its core improved modules, systematic experiments were conducted on self-constructed tomato pedicel RGB and RGB-D test datasets under real agricultural field conditions. The analysis quantifies each module’s contribution to pedicel recognition and picking-point localization, and elaborates on the integrated system’s operational mechanisms, focusing on the synergistic effects of its key techniques on slender pedicel detection and precise 3D localization in complex unstructured environments.

3.1. Ablation Experiments

This section evaluates the impacts of the improved EfficientRep backbone, the EMAttention module, and the improved Dyhead module through ablation experiments. All reported results are presented as mean values from three independent runs (random seeds: 42, 123, and 456). The stability and reproducibility of the experimental results are validated in Appendix C (Table A6), where all metrics are presented as mean ± standard deviation.

Table 5 summarizes results of ablation experiments on the proposed framework. The improved Dyhead alone (Row 4) enables adaptive fusion and task decoupling, which increases mAP₅₀ from 82.70% to 84.27% (+1.57%) with minimal computational overhead (model size: 6.6 MB to 7.1 MB; inference: 4.5 ms to 4.8 ms). EMAttention (Row 3) strengthens cross-scale feature aggregation under occlusion, which further improves mAP₅₀ to 86.58% (+3.88%) and F₁-score to 85.46, but increases model size to 7.8 MB and inference time to 5.0 ms (model size +18.2%, inference +11.1%). The improved EfficientRep backbone with Dyhead (Row 2) enlarges the receptive field without adding parameters, which reduces model size to 7.0 MB and inference time to 4.7 ms. Simultaneously, a 84.74% mAP₅₀ is achieved, enabling better feature extraction for slender pedicels.

The full YOLOv8n-EED-seg model (Row 1) integrates all three enhancements and delivers the best performance: mAP₅₀ reaches 87.01% (+4.31% over baseline), Precision 92.08%, Recall 82.10%, and F₁-score 86.64%, with competitive FLOPs of 9.1 G and inference time of 4.8 ms. Relative to Dyhead-seg (Row 4), the full model achieves a 2.74% higher mAP₅₀ with negligible increases in model size (0.4 MB) and FLOPs (0.2 G). Relative to EMAD-seg (Row 3), it achieves higher accuracy accompanied by a smaller model size of 7.5 MB (compared to 7.8 MB) and faster inference of 4.8 ms (compared to 5.0 ms). Relative to ERD-seg (Row 2), it delivers substantial accuracy gains (mAP₅₀ +2.27%, precision +2.47%,

F_{1}

-score +3.07%) with only modest increases in model size (0.5 MB) and FLOPs (0.6 G). These results confirm that the three modules work synergistically: EfficientRep enlarges the receptive field without adding parameters, EMAttention strengthens cross-scale feature fusion under occlusion, and DynamicHead decouples tasks for fine-grained localization.

Figure 7 presents the qualitative ablation results across five model variants. In the first column (Israel Red Cluster pedicel segmentation), the baseline model achieves a confidence score of 0.78, which improves to 0.85 with DyHead, to 0.88 with EMAttention, and the full EED-seg model reaches the highest confidence score of 0.89. In the second column (apical pedicel segmentation), both EED-seg and EMAD-seg achieve 0.92, while ERD-seg, Dyhead-seg, and the baseline incorrectly split the pedicel into two instances due to branch interference, with confidence scores of 0.89, 0.85, and 0.85, respectively. In the third column (two harvestable pedicels), EED-seg achieves the best performance among all compared models. In the fourth column (Yuekeda cultivar with three pickable pedicels), EED-seg achieves the highest confidence scores across all instances (0.93, 0.87, 0.91), outperforming EMAD-seg (0.92, 0.81, 0.90), ERD-seg (0.92, 0.87, 0.82), Dyhead-seg (0.92, 0.85, 0.83), and the baseline (0.89, 0.85, 0.82). These results demonstrate that the full EED-seg model attains the highest overall recognition confidence among all evaluated approaches, exhibiting balanced and robust performance in segmenting both fine-grained and regular pedicels.

3.2. Performance Comparisons of Different Models on Target Detection Tasks

To evaluate the proposed YOLOv8n-EED-seg, we compared it with YOLOv9-seg [36], YOLOv11-seg [37], YOLACT [38], and Seg-rtdetr [39] under identical conditions. All reported results are presented as mean values from three independent runs with different random seeds (42, 123, and 456). The stability and reproducibility of the experimental results are further validated in Appendix C (Table A7), where all metrics are presented as mean ± standard deviation, with the best overall performance per metric across all models highlighted in bold. As summarized in Table 6, the proposed model (Row 1) achieves the best performance across all accuracy metrics:

{mAP}_{50}

of 87.1%, precision of 92.08%, and

F_{1}

-score of 86.82%. It outperforms YOLOv9-seg by 4.8% in

{mAP}_{50}

, 6.03% in precision, and 4.49% in

F_{1}

-score; outperforms YOLOv11-seg by 3.77%, 5.18%, and 3.08%; outperforms YOLACT by 4.3%, 4.94%, and 3.68%; and outperforms Seg-rtdetr by 3.06%, 5.41%, and 3.06%, respectively.

In terms of computational efficiency, the proposed model requires 9.1 G FLOPs (Column 7) and 7.5 MB of parameters (Column 6), with an inference speed of 4.8 ms per frame (Column 8). Compared to YOLOv9-seg (8.7 G, 6.4 MB, 4.9 ms) and YOLOv11-seg (8.5 G, 6.3 MB, 4.7 ms), it has modestly higher computational cost but achieves substantially better accuracy. Compared to YOLACT (32.4 G, 46.5 MB, 20 ms) and Seg-rtdetr (12.8 G, 11.3 MB, 5.1 ms), it is significantly more efficient.

Among the compared models, YOLOv9-seg and YOLOv11-seg are lightweight successors in the YOLO series, designed for efficient deployment but with limited accuracy for fine-grained pedicel segmentation. YOLACT is a real-time instance segmentation model that generates prototype masks, but its large model size (46.5 MB) and high computational cost (32.4 G FLOPs, 20 ms per frame) make it unsuitable for real-time harvesting applications. Seg-rtdetr adopts a Transformer-based architecture with multi-head self-attention and a hybrid encoder, achieving competitive accuracy (84.04%

{mAP}_{50}

) but at the cost of larger model size (11.3 MB) and slower inference (5.1 ms) compared to lightweight YOLO variants.

Although YOLOv11-seg offers the fastest inference (4.7 ms) and smallest model size (6.3 MB), its

{mAP}_{50}

is only 83.24%, approximately 3.77% lower than that of the proposed model (87.01%). This trade-off between lightweight deployment and detection accuracy is effectively balanced by the proposed YOLOv8n-EED-seg, which achieves superior accuracy while maintaining competitive efficiency.

Figure 8 qualitatively compares the inference results of different models in real greenhouse scenarios. Rows (a–e) correspond to YOLOv8n-EED-seg (proposed), YOLOv9-seg, YOLOv11-seg, YOLACT, and SEG-RTDETR, respectively. In Column 1 (Israel Red Cluster with two harvestable pedicels), the proposed model achieves the highest confidence scores (0.90, 0.87). In Column 2 (slender pedicel), only EED-seg and YOLACT succeed, with EED-seg achieving superior confidence scores (0.82, 0.85) versus YOLACT (0.72, 0.86), while EED-seg has a much smaller model size. In Column 3 (standard pedicel), EED-seg achieves the highest confidence score (0.86), matching YOLOv11-seg and outperforming others by 0.04–0.05. In Column 4 (Yuekeda with 60% occlusion), EED-seg achieves the best confidence scores for both the upper pedicel (surpassing others by 0.01–0.05) and the lower pedicel (0.93). These results demonstrate that EED-seg outperforms competing models across diverse challenging scenarios while maintaining a compact size (7.5 MB) and achieving an optimal balance between detection accuracy and computational efficiency.

Furthermore, the YOLOv8n-EED-seg model exhibits robust generalization across varying illumination conditions, with detailed results provided in Appendix D (Table A8).

3.3. Results on Picking-Point Localization

Figure 9 illustrates the complete recognition pipeline, including skeletonization and picking-point localization, based on a mobile-acquired test dataset.The process consists of five stages. In the first stage (data acquisition), RGB images

I_{rgb}

and depth images

I_{depth}

are synchronously captured using an Intel RealSense D455 RGB-D camera, where the RGB image provides texture and color information of the pedicel while the depth image directly provides the depth value (in mm) for each pixel. In the second stage (semantic segmentation), the RGB image is fed into the trained YOLOv8n-EED-seg model to generate a binary pedicel mask

M_{stem}

, with

M_{stem} (u, v) = 1

indicating that pixel

(u, v)

belongs to the pedicel region and

M_{stem} (u, v) = 0

indicating background. The third stage is depth extraction, in which the depth information of the pedicel region is extracted by pixel-wise multiplication:

I_{depth_stem} (u, v) = M_{stem} (u, v) \cdot I_{depth} (u, v)

. In the fourth stage (depth completion), the large-neighborhood mean method is applied to compensate for missing depth values, resulting in the completed depth map

I_{depth_completed}

. In the fifth and final stage (coordinate transformation), for each pixel

(u, v)

with completed depth value

d = I_{depth_completed} (u, v)

, the RealSense SDK function

rs 2_deproject_pixel_to_point ()

is used to directly convert the pixel coordinates and depth value into 3D world coordinates, obviating the need for manual manipulation of intrinsic and extrinsic matrices.

The pipeline demonstrates robust performance across diverse scenarios. In Column 1, the pedicel, fragmented by branch breakage, produces two independent skeletons; the picking point is defined as the midpoint between their endpoints, treating the disconnected parts as a single structure. Columns 2 and 3 exhibit accurate localization for the ‘Yuekeda’ cultivar despite morphological challenges—forward curvature (Column 2) and limited visibility (Column 3). Column 4 highlights the method’s superiority under favorable conditions. Collectively, these results validate the robustness and adaptability of the proposed approach across challenging greenhouse scenarios.

As has been emphasized in Section 2 of Methods, depth information extraction is critical for acquiring 3D coordinates of picking points. This paper introduces the large-neighborhood mean method to robustly estimate missing values using valid neighboring depth information. To determine the optimal threshold k for the large-neighborhood mean method, a comparative experiment is conducted. The evaluation is based on two metrics: first, the abnormal depth rejection rate, defined as the proportion of outliers correctly identified and replaced by the mean depth

z_{1}

; second, the picking localization error, measured as the distance deviation between the picking point computed from the restored depth and its actual position.

In the tomato greenhouse, the camera mounted on the robotic arm is typically positioned 20–30 cm from the target pedicel to balance depth measurement accuracy and operational safety. The fixed distance of 25 cm for 3D evaluation falls within this optimal range, ensuring the experimental setup is representative of actual harvesting conditions. This specific distance was determined empirically: the depth camera was fixed perpendicular to the greenhouse rail with an initial offset of 23 cm from the cultivation pot, then incrementally adjusted using real-time depth feedback until the final working distance of 25 cm was set. A total of 53 samples are used in the experiment; this sample size is determined based on the availability of representative pedicel instances with complete depth annotations across varying occlusion levels and lighting conditions. The 53 samples ensure statistical validity while maintaining manual measurement feasibility, covering diverse scenarios including slender pedicels (the bounding boxes of these pedicels are typically smaller than <32 × 32 pixels, following the MS COCO definition of small objects), and different cultivars (Israel Red Cluster and Yuekeda), thereby providing a robust basis for threshold optimization.

Table A3 summarizes the depth completion performance across different threshold values of k, from which

k = 2.5

cm is identified as the optimal threshold, balancing a relatively high abnormal depth rejection rate (88.6%) and the lowest average localization error (1.05 cm). Taking the best threshold

k = 2.5

cm, the corresponding localization results of picking-point depth information are presented in detail in Figure 10. The workflow begins with acquiring the original RGB image for 2D picking-point localization to accurately locate the target pedicel and its picking position. The original scene depth image is captured synchronously with the RGB image to record the initial depth data for 3D coordinate calculation. Finally, the large-neighborhood mean method is applied to compensate for missing depth values, yielding the precise 3D depth localization result of the picking point. In this randomly selected example, the obtained depth value is approximately 24.175 cm.

To quantify the localization accuracy of the proposed approach, a systematic evaluation was performed on a depth camera-captured test dataset comprising 324 images, which collectively contain 343 manually annotated pickable pedicels. During image acquisition, the distance between the depth camera and prominent primary pickable pedicels was deliberately fixed at 25 cm to establish uniform experimental conditions. A localization was considered successful if the Euclidean distance between the estimated and ground-truth positions was ≤15 mm, a tolerance determined based on the pedicel diameter (3–5 mm) and the end-effector grasping tolerance (∼10 mm). All test images were processed through the picking-point localization pipeline detailed in the preceding sections, including 2D image target detection for pedicel region identification, depth completion via the large-neighborhood mean method for missing depth data compensation, skeleton extraction for picking point derivation, and coordinate transformation for 3D positioning. Experimental results reveal that 322 of the 343 picking points are accurately localized within the 15 mm tolerance, yielding an overall success rate of 93.88%, with depth errors bounded to approximately

\pm 1.2

cm. Detailed uncertainty analysis, including localization variability, depth sensor accuracy, and confidence estimation, is provided in Appendix D.2.

To further validate the effectiveness of the depth completion component, we conducted quantitative comparisons of the proposed large-neighborhood mean method with alternative approaches, including conventional interpolation methods (bilinear and bicubic interpolation) and a learning-based method (BP-Net). As shown in Table 7, our method achieves a localization RMSE of 1.05 cm and MAE of 0.81 cm at a working distance of 25 cm, with an inference time of 3.9 ms, while BP-Net achieves slightly better accuracy (RMSE of 0.92 cm, MAE of 0.72 cm), it requires extensive training data and has a much higher inference time of 23.0 ms, making it unsuitable for real-time harvesting applications. In contrast, our method requires no training and achieves a fast inference time of 3.9 ms, significantly outperforming BP-Net in efficiency while maintaining competitive accuracy. Compared to bilinear interpolation (RMSE 1.34 cm, MAE 1.27 cm, 3.2 ms) and bicubic interpolation (RMSE 1.28 cm, MAE 1.12 cm, 3.5 ms), our method achieves substantially better accuracy with a modest increase in inference time.

These results demonstrate that our method achieves a favorable balance between accuracy and efficiency, making it particularly suitable for real-time greenhouse harvesting applications where computational resources are limited. The high localization success rate (93.88%), coupled with the stringent depth error tolerance and efficient depth completion, thoroughly validates the robustness and reliability of the proposed method, confirming its suitability for real-world agricultural automation scenarios.

3.4. Real-Time Performance Evaluation on Edge Devices

To validate real-world deployability, we evaluated the proposed model on a Jetson Orin NX edge device (100 TOPS). As shown in Table 8, the baseline YOLOv8n achieves 7.6 ms (132 FPS), while our YOLOv8n-EED-seg achieves 9.3 ms (108 FPS). With full post-processing (mask extraction, skeletonization, and depth completion), the total time increases to 16.2 ms (62 FPS). Since 30 FPS is the standard real-time benchmark for embedded systems, both 108 FPS and 62 FPS far exceed this requirement, validating the feasibility of real-time picking-point localization in greenhouse environments. However, this level of performance requires at least 100 TOPS of computational power, which is provided by the Jetson Orin NX. Future work will focus on model compression techniques to enable deployment on lower-power devices such as the Jetson Nano.

4. Discussion

This study proposed a tomato pedicel picking-point localization method based on an improved YOLOv8n-EED-seg model and RGB-D fusion. The main contributions of this work are as follows: (1) The proposed YOLOv8n-EED-seg achieves 92.08% precision and 87.01%

{mAP}_{50}

for pedicel segmentation. (2) The picking-point localization success rate reaches 93.88% with depth error bounded to

\pm 1.2

cm. (3) The inference time is 4.8 ms per image, enabling real-time operation. (4) The model maintains robust performance under partial occlusion (success rate 0.89 with partial occlusion). These results support the effectiveness of our idea: (I) Combining backbone enhancement (EfficientRep), attention-guided fusion (EMAttention), and task decoupling (improved DynamicHead) yields synergistic improvements for pedicel-level perception. (II) Skeletonization of segmentation masks enables precise picking point extraction and neighborhood-based depth compensation effectively reconstructs missing depth data. To contextualize the proposed method, we compare it with existing approaches in terms of segmentation accuracy, real-time inference, and localization precision, as detailed below.

4.1. Comparison with Previous Work

Segmentation accuracy: The proposed YOLOv8n-EED-seg model achieves 87.1% mAP₅₀ and 92.08% precision for pedicel segmentation. Direct comparisons are confined to methods that specifically target pedicel or stem segmentation. In the domain of tomato stem detection, Liu et al. [14] reported 46.29% mAP₅₀ using YOLACTFusion, while for tomato main stem and lateral branch segmentation, Ji et al. [39] achieved 79.3% mAP₅₀ with EDI YOLO. The proposed method substantially outperforms both approaches, yielding advantages of 40.81 and 7.8 percentage points, respectively. For broader context, we also compare with fruit-level detection methods, although they address different tasks. Qin et al. [40] achieved 85.3% AP for cherry tomato detection using YOLO-CT, focusing on fruit-level recognition rather than pixel-wise pedicel segmentation. Similarly, Chen et al. [41] reported 87.2% precision for cherry tomato bunch identification, emphasizing ripeness classification rather than fine-grained pedicel segmentation. These reference comparisons are provided solely for contextual purposes and are not intended as strictly fair benchmarks.

Real-time inference: Our model achieves an inference time of 4.8 ms, which is 3125 times faster than Yoshida et al. [18] (15 s per image). This dramatic difference stems from fundamental methodological differences: Yoshida et al. relied on hand-crafted geometric features (region growing, voxel connectivity, and Mahalanobis distance) with time-consuming post-processing, whereas our end-to-end deep learning architecture performs all computations in a single forward pass. Similarly, our model is 56 times faster than Wang et al. [12] (270 ms), whose two-stage Mask R-CNN requires sequential region proposal and classification steps, and 2.8 times faster than Sun et al. [42] (13.5 ms), whose S-YOLO focused on fruit-level detection rather than fine-grained pedicel segmentation. Gao et al. [43] proposed YOLOR-Slim, achieving 87.5% mAP₅₀ with only 1.4 M parameters and 1.9 G FLOPs, yet their inference time on workstation is 7.4 ms, which is slower than our 4.8 ms. Our inference speed is comparable to Chen et al. [41] (4.9 ms), but their MTD-YOLOv7 focused on ripeness classification rather than pedicel segmentation. In contrast to Yoshida et al. [18], which required a dedicated high-performance computing unit, and Wang et al. [12], which struggled with embedded deployment. To evaluate real-world deployability, we further tested our model on a Jetson Orin NX edge device (100 TOPS). On this platform, the model achieved 9.3 ms for detection and segmentation (approximately 108 FPS) and 16.2 ms with full post-processing (approximately 62 FPS), both exceeding the 30 FPS real-time requirement for greenhouse harvesting. However, claims about deployment on Jetson Nano, Xavier, or Raspberry Pi have not been verified and are therefore deferred to future work.

Localization accuracy: Our method achieves a localization RMSE of 1.05 cm at a working distance of 25 cm, a typical range for greenhouse picking operations. Zheng et al. [20] reported a maximum average localization error of

\pm 5.0

mm for tomato centroid localization using binocular stereo vision within 280–480 mm, and reported an RMSE of 4–7 mm on texture-sparse surfaces. Note that these comparisons are provided for reference only, as they involve distinct tasks and experimental conditions. For LiDAR-based systems, Wang et al. [44] developed an integrated LiDAR/IMU/GNSS navigation method for orchard robots, achieving a global localization accuracy of 2.215 cm (

σ = 1

cm) in field tests. In contrast to stereo matching methods that are sensitive to illumination changes, our RGB-D fusion method provides stable depth information and does not rely on external positioning signals. These advantages are particularly critical for greenhouse environments, where lighting varies considerably throughout the day and LiDAR deployment is constrained by cost and environmental robustness.

Robustness under occlusion: The proposed model demonstrates strong generalization across diverse cultivars, viewing angles, and occlusion levels, achieving a success rate of 0.89 under partial occlusion. Under severe occlusion, the success rate drops to 0.69 in the worst-case scenario, a 22% decline. This performance is comparable to human performance under identical visual conditions, suggesting the model has reached a practical limit given single-view input. Compared with existing methods, Lu et al. [45] reported an 89.44% success rate for grape pedicel localization under occlusion, Ye et al. [46] achieved an 82% picking success rate for citrus fruit stem estimation, and Burusa et al. [47] achieved an 82.7% detection rate under occlusion using seven optimally planned viewpoints. These comparisons confirm that occlusion remains a fundamental challenge, motivating future work on multi-view fusion to enhance robustness. Representative examples of successful detections under varying occlusion levels are provided in Appendix D (Figure A4).

4.2. Limitations and Future Research Directions

Dataset limitations: The primary limitation is the limited dataset size and diversity. The depth completion threshold was optimized using only 53 samples, and the skeletonization evaluation used 500 samples. The lack of a public pedicel segmentation benchmark limits direct comparison with other methods. This issue is widespread: Jiang et al. [48] noted data scarcity for fine-grained agricultural targets, and Akbar et al. [49] highlighted limited dataset diversity as a key bottleneck in greenhouse vision systems. Our dataset includes only two cultivars under controlled conditions; performance on other cultivars or outdoor settings is unknown. Saiz-Rubio and Rovira-Más [50] emphasized that cross-cultivar and cross-environment data are essential yet rarely available. Image acquisition was also restricted to a narrow time window (10:00–12:00). We acknowledge that model performance under variable lighting has not been systematically evaluated. Images were captured across a wide distance range (10–70 cm), while the target application requires a constant 25–30 cm distance. To address this, we tested the model on 53 samples at 25 cm, achieving 91.87% precision and 82.10%

{mAP}_{50}

, consistent with the overall test set (92.08% precision, 87.1%

{mAP}_{50}

). Future work will include multi-temporal data collection (Wang et al. [51]) and a dedicated dataset at the target distance.

Occlusion handling: As shown in Figure 11a, the model fails under severe occlusion (success rate 0.69). This is because occluded pedicels lack visible features, making detection impossible from a single viewpoint. This is not a model deficiency but a fundamental limitation of vision-based perception. As reviewed by Xiao et al. [52], occlusion remains a core challenge in agricultural robotics, where fruits and pedicels are often obscured by branches and leaves, leading to reduced recognition accuracy and positioning failures. Thus, the robot needs to adjust its viewpoint or use multi-sensor fusion to resolve occlusions. Burusa et al. [47] proposed a semantics-aware next-best-view planning strategy, achieving an 82.7% detection rate under heavy occlusion using seven optimally planned viewpoints, significantly outperforming non-semantic methods. This demonstrates that single-view perception is inherently insufficient for occlusion-heavy environments. Beyond occlusion, other failure conditions include motion blur, reflective surfaces, and sensor drift. Motion blur from rapid camera or plant motion may degrade segmentation accuracy, though this has not been systematically quantified. Reflective surfaces can cause specular reflections, leading to invalid depth measurements. Sensor drift over extended operation may affect RGB-depth alignment, especially under temperature fluctuations. Future work will explore multi-view fusion, temporal information, and active perception to address these issues.

Depth completion threshold and error analysis: The optimal threshold (

k = 2.5

cm) was determined at a fixed distance of 25 cm. The depth estimation error comprises a systematic error and a random error. Our method achieves a systematic error of <0.5 cm and a random error of

\pm 0.3

cm. Depth holes are the primary source of random error that neighborhood-based compensation cannot fully eliminate. Compared with existing methods, our approach performs favorably. A depth completion survey noted that even high-end LiDAR sensors produce sparse and noisy depth maps near object boundaries [53]. SelfReDepth achieves real-time depth restoration but relies on multiple sequential depth frames coupled with color data [54]. In contrast, our method uses a simpler neighborhood-based compensation strategy at 25 cm to achieve high precision (<0.5 cm,

\pm 0.3

cm), meeting millimeter-level picking requirements while maintaining computational efficiency. Validation at 15 cm and 35 cm showed stable performance (average localization error <1.2 cm, see Appendix D, Table A9). However, recalibration may be needed for distances outside this range or for different sensor configurations. As shown in Figure 11c, the depth map exhibits significant depth information loss (holes covering >15% of the pedicel neighborhood), affecting depth measurement reliability. Future work will explore adaptive thresholding and learning-based depth completion for severe depth holes, informed by recent studies [53,54].

In summary, while the proposed method achieves promising performance in pedicel segmentation and picking-point localization, several limitations remain regarding dataset diversity, occlusion robustness, and depth completion accuracy. Addressing these limitations through multi-temporal data collection, multi-view fusion, adaptive threshold mechanisms, and learning-based depth completion will be the focus of our future research.

5. Conclusions

We propose an innovative deep learning method for tomato pedicel picking-point localization based on an improved YOLOv8n-seg model and RGB-D fusion. In a greenhouse environment, we build a tomato planting scene and capture multi-view images using an RGB-D camera. To improve YOLOv8n-seg, we introduce an improved EfficientRep backbone with 8-direction shift convolution, integrate an EMAttention mechanism, and add an improved DynamicHead module. Compared to the original model, the proposed YOLOv8n-EED-seg achieves 92.08% precision (up 5.09%) and 87.01% mAP₅₀ (up 4.31%), while maintaining 7.5 MB and 4.8 ms inference. A skeletonization-based method is proposed to extract picking points from segmentation masks, and a neighborhood-based depth compensation strategy is developed to restore missing depth data. Applied to greenhouse tomato picking, the method achieves a 93.88% picking success rate with depth errors bounded to

\pm 1.2

cm, meeting the speed and accuracy requirements of a tomato-harvesting robot. This method has currently been tested only on two tomato cultivars under controlled conditions; adaptability tests on more cultivars and outdoor scenes are needed. Nevertheless, the proposed method can serve as a reference for other fruit and pedicel recognition tasks in agricultural robotics.

Author Contributions

L.W.: Writing—original draft and editing, Methodology, Investigation; D.T.: Writing—review and editing, Methodology, Formal analysis, Supervision, Funding acquisition; L.L.: Writing—review and editing, Conceptualization, Methodology, Formal analysis, Supervision, Funding acquisition, Data curation. All authors have read and agreed to the published version of the manuscript.

Funding

This study was funded by the Guangdong Province KeyR&D projects under Grant 2019B010154002.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data that support the findings of this study are available on request from the corresponding author, upon reasonable request.

Acknowledgments

The authors would like to thank the anonymous reviewers for the insightful discussions.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Appendix A.1. Detailed Dataset Statistics and Validation

This appendix provides supplementary dataset statistics and validation results. These details support the dataset characterization but are not essential for understanding the core methodology.

Appendix A.1.1. Illumination Classification Validation

To validate the illumination classification thresholds defined in Section 2.1, statistical analysis was performed on 100 randomly sampled images per category. Table A1 summarizes the results.

Table A1. Statistical validation of illumination classification thresholds.

Illumination Category	Metric	Value (Mean ± Std)	95% CI
Direct sunlight	Mean brightness	$142.3 \pm 18.7$	$[138.6, 146.0]$
Direct sunlight	Standard deviation	$68.4 \pm 12.3$	–
Diffuse light	Mean brightness	$98.5 \pm 15.2$	$[95.4, 101.6]$
Diffuse light	Standard deviation	$42.1 \pm 8.5$	–
Backlighting	Center-to-periphery ratio	$0.51 \pm 0.09$	$[0.49, 0.53]$

The center-to-periphery ratio for backlighting (

0.51 \pm 0.09

) is significantly lower than the adopted threshold of

0.77

(

p < 0.001

, one-sample t-test). These results confirm the robustness of the adopted thresholds.

Appendix A.1.2. Hue Threshold Statistical Analysis

To determine the optimal hue thresholds for discriminating mature fruit, hue frequency histograms were computed for randomly selected samples (103 red-fruit, 101 yellow-fruit images). Figure A1 shows the distribution of peak hue values for both cultivars.

Table A2 presents the complete statistical results, including sample sizes, means, standard deviations, 95% confidence intervals, percentiles, calculated ranges (

μ \pm 2 σ

), and adopted ranges for both cultivars.

Figure A1. HSV frequency histogram analysis of fruit regions. (a) Red-fruit cultivar (‘Israel Red Cluster’): mean dominant peak =

3.11 °

(

σ = 4.95 °

), 5th/95th percentiles at

1.00 °

and

19.30 °

. (b) Yellow-fruit cultivar (‘Yuekeda’): mean dominant peak =

38.89 °

(

σ = 2.92 °

), 5th/95th percentiles at

35 °

and

42 °

.

Figure A1. HSV frequency histogram analysis of fruit regions. (a) Red-fruit cultivar (‘Israel Red Cluster’): mean dominant peak =

3.11 °

(

σ = 4.95 °

), 5th/95th percentiles at

1.00 °

and

19.30 °

. (b) Yellow-fruit cultivar (‘Yuekeda’): mean dominant peak =

38.89 °

(

σ = 2.92 °

), 5th/95th percentiles at

35 °

and

42 °

.

Table A2. Statistical analysis of fruit region hue peaks with confidence intervals.

Cultivar	Sample Size (n)	Mean ( $μ$ )	Std ( $σ$ )	95% Confidence Interval	5th Percentile	95th Percentile	Calculated Range ( $μ \pm 2 σ$ )	Adopted Range
Israel Red Cluster	103	$3.11 °$	$4.95 °$	$[2.14 °, 4.08 °]$	$1.00 °$	$19.30 °$	$- 6.8 °$ – $13.0 °$	$0 °$ – $25 °$
Yuekeda	101	$38.89 °$	$2.92 °$	$[38.31 °, 39.47 °]$	$35.00 °$	$42.00 °$	$33.0 °$ – $44.7 °$	$33 °$ – $45 °$

Appendix A.1.3. HSV Histogram Analysis

Figure A2 presents the HSV histogram analysis and segmentation results for tomato fruit detection. The hue channel shows prominent peaks within the predefined red (

0 °

–

25 °

) and yellow (

33 °

–

45 °

) intervals, corresponding to mature red and ripening yellow tomato regions, respectively. For the two samples, the hue peaks in fruit regions reach approximately 0.045 and 0.08, which are 4.5 and 8 times the global average hue frequency (approximately 0.01), indicating a strong discriminative response to mature fruit.

Appendix A.1.4. Depth Completion Threshold Optimization

Table A3 presents the depth completion performance with different threshold values k. The optimal threshold

k = 2.5

cm was selected as it balances a relatively high abnormal depth rejection rate (88.6%) and the lowest average localization error (1.05 cm).

Table A3. Depth completion performance with different k values.

Threshold k (cm)	Abnormal Depth Rejection Rate (%)	Average Localization Error (cm)
1.5	94.3	1.82
2.0	92.4	1.35
2.5	88.6	1.05
3.0	86.7	1.48
3.5	83.0	1.75

Figure A2. HSV frequency histogram and segmentation results for tomato fruit detection. (a) Red-fruit cultivar (‘Israel Red Cluster’): hue peaks within red intervals (

0 °

–

25 °

) reach approximately 0.045. (b) Yellow-fruit cultivar (‘Yuekeda’): hue peak within yellow interval (

33 °

–

45 °

) reaches approximately 0.08.

Figure A2. HSV frequency histogram and segmentation results for tomato fruit detection. (a) Red-fruit cultivar (‘Israel Red Cluster’): hue peaks within red intervals (

0 °

–

25 °

) reach approximately 0.045. (b) Yellow-fruit cultivar (‘Yuekeda’): hue peak within yellow interval (

33 °

–

45 °

) reaches approximately 0.08.

Appendix A.1.5. Annotation Examples

Figure A3 presents representative examples of pedicel annotations using ISAT software.

Figure A3. Representative examples of pedicel annotations using ISAT software. (a) Standard annotation of a single ‘Israel Red Cluster’ pedicel; (b) annotation of two ‘Israel Red Cluster’ pedicels; (c) annotation of a ‘Yuekeda’ pedicel partially occluded by fruits.

Appendix A.2. Layer-by-Layer Architectural Comparison

This appendix provides a layer-by-layer architectural comparison between the baseline YOLOv8n-seg and the proposed YOLOv8n-EED-seg model. These details support the reproducibility of the proposed method but are not essential for understanding the core methodology.

Table A4. Layer-by-layer architectural comparison between baseline YOLOv8n-seg and proposed YOLOv8n-EED-seg.

Stage	Baseline YOLOv8n-Seg	Proposed YOLOv8n-EED-Seg	Description of Modification
Backbone	Standard C2f modules (stages 1–4)	Improved EfficientRep with 8-direction shift convolution	Expands the receptive field without increasing parameters.
Neck	Concatenation + C2f	EMAttention after P3, P4, P5	Enhances cross-scale feature fusion under occlusion.
Detection Head	Original decoupled head (FC in $π_{C}$ )	Improved DynamicHead (ECA replaces FC)	Decouples tasks while reducing head parameters by 98.8%.
Post-processing	Mask decoding only	Zhang–Suen skeletonization + large-neighborhood depth completion	Enables 3D picking-point localization.

Appendix B

Appendix B.1. Detailed Mathematical Formulation of EMAttention

This appendix provides the complete mathematical derivation of the EMAttention mechanism [29].

Let the original input tensor

X \in R^{C \times H \times W}

denote the intermediate feature map, where C denotes the number of input channels, and H and W indicate the spatial dimensions, respectively. The EMAttention module first partitions

X

into G disjoint groups along the channel dimension:

X_{g} = {Group}_{g} (X), g = 1, 2, \dots, G

(A1)

where

X_{g} \in R^{C / G \times H \times W}

represents the grouped feature map at the g-th group.

The grouped feature maps

X_{g}

are then fed into a multi-scale parallel subnetwork composed of three pathways. The

3 \times 3

branch captures local contextual information:

F_{local} = {Conv}_{3 \times 3} (X_{g})

(A2)

The two

1 \times 1

branches perform 1D global average pooling along the horizontal and vertical dimensions:

P_{x} = {GAP}_{x} (X_{g}) = \frac{1}{W} \sum_{0 \leq i \leq W} x_{c, g} (H, i)

(A3)

P_{y} = {GAP}_{y} (X_{g}) = \frac{1}{H} \sum_{0 \leq j \leq H} x_{c, g} (j, W)

(A4)

The pooled features are concatenated and processed by a shared

1 \times 1

convolution. The output is factorized into two parallel 1D feature encoding vectors, activated by Sigmoid, and aggregated via element-wise multiplication.

The first spatial attention map is generated by:

A_{1} = Softmax ({GAP}_{2 d} (GN (P_{x}, P_{y}))) ⊙ F_{local}

(A5)

The second spatial attention map is generated by:

A_{2} = {GAP}_{2 d} (F_{local}) ⊙ (P_{x} \oplus P_{y})

(A6)

The two attention maps are aggregated to yield the final attention map:

A = σ (A_{1} + A_{2})

(A7)

Finally, the output feature map is obtained by re-weighting the original input:

X_{out} = A \otimes X

(A8)

The EMA module captures both long- and short-range dependencies through multi-scale and cross-spatial learning without dimensionality reduction, thereby enhancing feature representation while preserving computational efficiency.

Appendix B.2. Detailed Description of Large-Neighborhood Depth Completion

This appendix provides a detailed description of the large-neighborhood depth completion method used in this study.

Appendix B.2.1. Neighborhood Selection

For each pixel

(u, v)

in the pedicel mask, a square neighborhood

N (u, v)

of size

w \times w

(

w = 15

) is defined, centered at

(u, v)

. This window size was empirically determined based on the average width of pedicel structures (approximately 15–20 pixels) and the typical size of depth hole regions (5–10 pixels). If the number of valid depth pixels within

N (u, v)

is fewer than 5, the window size is incrementally expanded by 2 pixels until sufficient valid samples are obtained or a maximum size of

31 \times 31

is reached.

Appendix B.2.2. Invalid Depth Identification

A depth value is classified as invalid if it meets one of the following criteria:

The value is equal to 0, which denotes missing data resulting from sensor limitations.
The value exceeds 2500 mm (2.5 m), a distance that is physically unrealistic for greenhouse harvesting operations given the nominal working distance of 25 cm.

Only valid depth values are incorporated into the mean calculation.

Appendix B.2.3. Threshold Parameter Determination

The optimal threshold k in Equation (4) was determined through a grid search on 53 annotated samples (Table A3, Appendix A). The search ranged from 0.5 cm to 5.0 cm with a step of 0.5 cm. The selection criterion minimized the average 3D localization error while maintaining an abnormal depth rejection rate above 85%. The optimal value was found to be

k = 2.5

cm, which balances sensitivity to true depth variations and robustness against outliers.

Appendix B.2.4. Pseudocode of the Depth Completion Algorithm

Algorithm A1 Large-neighborhood depth completion

1:: $I_{completed} \leftarrow I_{depth}$
2:: for each pixel $(u, v)$ where $M_{stem} (u, v) = 1$ do
3:: $w_{curr} \leftarrow w$
4:: $valid \leftarrow \emptyset$
5:: repeat
6:: define neighborhood $N (u, v)$ of size $w_{curr} \times w_{curr}$ centered at $(u, v)$
7:: for each $(i, j)$ in $N (u, v)$ do
8:: if $0 < I_{depth} (i, j) < 2500$ then
9:: $valid \leftarrow valid \cup {I_{depth} (i, j)}$
10:: end if
11:: end for
12:: $w_{curr} \leftarrow w_{curr} + 2$
13:: until $| valid | \geq \min_valid$ or $w_{curr} > w_{\max}$
14:: if $| valid | \geq \min_valid$ then
15:: $z_{1} \leftarrow avg (valid)$
16:: $z \leftarrow I_{depth} (u, v)$
17:: if $z = 0$ then
18:: $I_{completed} (u, v) \leftarrow z_{1}$
19:: else if $| z - z_{1} | > k$ then
20:: $I_{completed} (u, v) \leftarrow z_{1}$
21:: else
22:: $I_{completed} (u, v) \leftarrow z$
23:: end if
24:: end if
25:: end for
26:: return $I_{completed}$

Appendix B.3. RGB-D Camera Calibration and Coordinate Conversion

This appendix provides the calibration parameters and coordinate conversion details for the RGB-D camera used in this study. The coordinate conversion described in Section 2.7 is implemented using the RealSense SDK, which automates the application of calibration parameters. The specific values used in this study are provided below as a reference for reproducibility. Readers using the same SDK do not need to manually apply these parameters.

Appendix B.3.1. Hardware Configuration

An Intel RealSense D455 RGB-D camera was used for data acquisition. The camera was configured with an RGB resolution of

640 \times 480

pixels, a depth resolution of

640 \times 480

pixels, and a frame rate of 30 fps. The working distance was fixed at 25 cm for 3D evaluation, and the depth sensor has a manufacturer-specified accuracy of

\pm 2 %

at 2 m.

Appendix B.3.2. Calibration Parameters

The intrinsic and extrinsic parameters were obtained using the RealSense SDK 2.0 functions: rs2_get_intrinsics() for intrinsic parameters and rs2_get_extrinsics() for extrinsic parameters between the RGB and depth sensors. Table A5 summarizes the calibration values used in this study.

Table A5. Calibration parameters of the Intel RealSense D455 (

640 \times 480

resolution).

Table A5. Calibration parameters of the Intel RealSense D455 (

640 \times 480

resolution).

Parameter	Symbol	Value
RGB focal length (x)	$f_{x}$	642.29 pixels
RGB focal length (y)	$f_{y}$	641.56 pixels
RGB principal point (x)	$c_{x}$	326.30 pixels
RGB principal point (y)	$c_{y}$	240.22 pixels
Depth-to-RGB rotation	$R$	$3 \times 3$ identity matrix
Depth-to-RGB translation	$t$	$[15.0, 0.0, 0.0]$ mm

Appendix B.3.3. RGB-Depth Alignment Procedure

Raw RGB and depth images are not pixel-aligned due to parallax between the two sensors. To achieve per-pixel correspondence, alignment was performed using the RealSense SDK’s align module. The depth image is warped into the RGB coordinate frame through the following steps.

First, for each pixel

(u_{d}, v_{d})

in the raw depth image

D_{raw}

, its 3D point

P_{d} = (X_{d}, Y_{d}, Z_{d})

in the depth sensor coordinate system is computed using the depth intrinsic parameters:

[\begin{matrix} X_{d} \\ Y_{d} \\ Z_{d} \end{matrix}] = [\begin{matrix} \frac{u_{d} - c_{x}^{depth}}{f_{x}^{depth}} \cdot Z_{d} \\ \frac{v_{d} - c_{y}^{depth}}{f_{y}^{depth}} \cdot Z_{d} \\ Z_{d} \end{matrix}], Z_{d} = D_{raw} (u_{d}, v_{d})

(A9)

Second, the 3D point is transformed to the RGB camera coordinate system as

P_{rgb} = R \cdot P_{d} + t

, where

R

and

t

are the extrinsic parameters from Table A5.

Third, the transformed 3D point

P_{rgb} = (X_{rgb}, Y_{rgb}, Z_{rgb})

is projected onto the RGB image plane using the RGB intrinsic parameters to obtain the aligned pixel coordinates

(u_{align}, v_{align})

:

[\begin{matrix} u_{align} \\ v_{align} \end{matrix}] = [\begin{matrix} \frac{f_{x}^{rgb} \cdot X_{rgb}}{Z_{rgb}} + c_{x}^{rgb} \\ \frac{f_{y}^{rgb} \cdot Y_{rgb}}{Z_{rgb}} + c_{y}^{rgb} \end{matrix}]

(A10)

This process produces an aligned depth map

D_{align}

where each pixel corresponds to the same spatial location as the RGB image

I_{rgb}

. The alignment was implemented using RealSense SDK’s align module, which optimizes this process for real-time performance.

Appendix C

Appendix C.1. Performance Variability of Ablation Study

This appendix presents the detailed performance variability of all models across three independent runs (random seeds: 42, 123, and 456). All metrics are reported as mean ± standard deviation. Table A6 summarizes the ablation study results for different model variants, with the best performance per metric within each variant bolded. Table A7 summarizes the model comparison results, with the best overall performance per metric across all models bolded.

Table A6. Performance variability of different model variants across three runs (seeds: 42, 123, 456). Results are reported as mean ± std. Best performance per metric within each variant is in bold.

Model	Metric	Random Seed			Mean ± Std
Model	Metric	42	123	456	Mean ± Std
YOLOv8n-EED-seg	Precision (%)	91.98	92.15	92.11	92.08 ± 0.09
	Recall (%)	81.92	82.11	82.26	82.10 ± 0.17
	$F_{1}$ Score (%)	86.40	86.82	86.71	86.64 ± 0.21
	${mAP}_{50}$ (%)	86.92	87.10	87.01	87.01 ± 0.09
YOLOv8n-ERD-seg	Precision (%)	89.69	89.61	89.52	89.61 ± 0.09
	Recall (%)	81.64	81.02	81.86	81.51 ± 0.44
	$F_{1}$ Score (%)	83.83	83.42	83.45	83.57 ± 0.23
	${mAP}_{50}$ (%)	84.70	84.80	84.72	84.74 ± 0.05
YOLOv8n-EMAD-seg	Precision (%)	89.70	90.06	90.03	89.93 ± 0.20
	Recall (%)	82.40	82.73	82.13	82.42 ± 0.30
	$F_{1}$ Score (%)	85.30	85.45	85.62	85.46 ± 0.16
	${mAP}_{50}$ (%)	86.36	86.61	86.77	86.58 ± 0.21
YOLOv8n-Dyhead-seg	Precision (%)	87.76	88.12	88.06	87.98 ± 0.19
	Recall (%)	80.12	80.15	80.36	80.21 ± 0.13
	$F_{1}$ Score (%)	83.72	83.41	83.25	83.46 ± 0.24
	${mAP}_{50}$ (%)	84.21	84.10	84.50	84.27 ± 0.20
YOLOv8n-seg	Precision (%)	86.91	87.04	87.01	86.99 ± 0.07
	Recall (%)	79.43	79.21	79.33	79.32 ± 0.11
	$F_{1}$ Score (%)	82.95	83.06	83.01	83.01 ± 0.06
	${mAP}_{50}$ (%)	82.61	82.80	82.70	82.70 ± 0.10

Appendix C.2. Performance Variability of Model Comparison

Table A7 presents the performance variability of different models in the comparison experiment. Best overall performance per metric across all models is in bold.

Table A7. Performance variability of different models across three runs (seeds: 42, 123, 456). Results are reported as mean ± std. Best overall performance per metric across all models is in bold.

Model	Metric	Random Seed			Mean ± Std
Model	Metric	42	123	456	Mean ± Std
YOLOv8n-EED-seg	Precision (%)	91.98	92.15	92.11	92.08 ± 0.09
	Recall (%)	81.92	82.11	82.26	82.10 ± 0.17
	$F_{1}$ Score (%)	86.40	86.82	86.71	86.64 ± 0.21
	${mAP}_{50}$ (%)	86.92	87.10	87.01	87.01 ± 0.09
YOLOv9-seg	Precision (%)	89.12	89.01	89.12	89.08 ± 0.06
	Recall (%)	79.64	79.22	79.98	79.61 ± 0.38
	$F_{1}$ Score (%)	82.33	82.11	81.98	82.14 ± 0.18
	${mAP}_{50}$ (%)	82.12	82.30	81.87	82.10 ± 0.22
YOLOv11-seg	Precision (%)	86.11	86.97	86.55	86.54 ± 0.43
	Recall (%)	80.22	80.78	80.47	80.49 ± 0.28
	$F_{1}$ Score (%)	83.52	83.12	83.74	83.46 ± 0.31
	${mAP}_{50}$ (%)	83.14	83.26	83.33	83.24 ± 0.10
YOLACT	Precision (%)	87.16	87.05	87.21	87.14 ± 0.08
	Recall (%)	80.17	80.25	80.32	80.25 ± 0.08
	$F_{1}$ Score (%)	83.14	83.11	83.15	83.13 ± 0.02
	${mAP}_{50}$ (%)	82.27	82.74	82.80	82.60 ± 0.29
SEG-RTDETR	Precision (%)	86.51	86.74	86.13	86.46 ± 0.31
	Recall (%)	80.44	80.09	79.33	79.95 ± 0.57
	$F_{1}$ Score (%)	82.91	83.76	83.11	83.26 ± 0.45
	${mAP}_{50}$ (%)	83.41	84.04	83.74	83.73 ± 0.32

Appendix D

Appendix D.1. Supplementary Results for Occlusion, Illumination, and Depth Completion

Appendix D.1.1. Occlusion Robustness

Appendix Figure A4 presents representative examples of successful detections under varying occlusion levels.

Figure A4. YOLOv8n-EED-seg validation under challenging greenhouse conditions. (a) Irregular ‘Israel Red Cluster’ pedicel with lateral fruits: complete segmentation from fruit cluster to main stem. (b) Narrow angle with partial leaf occlusion: harvestable pedicel identified, leaf regions excluded. (c) ‘Yuekeda’ with limited visibility: precise segmentation, no fruit inclusion.

Appendix D.1.2. Illumination Robustness

To evaluate the model’s robustness under different lighting conditions, its performance was analyzed separately on direct sunlight, diffuse light, and backlighting images. As shown in Table A8, the model achieves the highest

{mAP}_{50}

under diffuse light (87.4%), followed by direct sunlight (86.7%), and the lowest under backlighting (86.1%). The performance difference between the best and worst conditions is only 1.3%, demonstrating robust generalization across illumination variations.

Table A8. Model performance under different lighting conditions.

Illumination Condition	Precision (%)	Recall (%)	${mAP}_{50}$ (%)
Direct Sunlight	92.1	81.5	86.7
Diffuse Light	92.5	82.8	87.4
Backlighting	91.9	81.3	86.1

Appendix D.1.3. Depth Completion Performance at Different Working Distances

Table A9 presents the depth completion performance at different working distances. The optimal threshold

k = 25

cm was selected as it balances a relatively high abnormal depth rejection rate (88.6%) and the lowest average localization error (1.05 cm).

Table A9. Experimental results of depth completion at different working distances.

Distance	Number of Samples	Abnormal Depth Rejection Rate (%)	Average Localization Error (cm)
15 cm	48	86.2	1.12
25 cm	53	88.6	1.05
35 cm	49	87.1	1.09

Appendix D.2. Uncertainty Analysis for Robotic Harvesting

This appendix provides detailed uncertainty analysis to support the results reported in Section 3.3.

Appendix D.2.1. Localization Uncertainty

To quantify the variability of picking point localization, we repeated the 3D localization experiment 10 times on the same 53 validation samples. The Euclidean error between the estimated and ground-truth 3D positions was computed for each run. The standard deviation across runs was 0.8 mm, indicating that the localization result is stable and the uncertainty is low.

Appendix D.2.2. Depth Uncertainty

The Intel RealSense D455 depth sensor has a specified accuracy of

\pm 2 %

at 2 m. At our working distance of 25 cm, the theoretical depth uncertainty is approximately

\pm 5

mm. In practice, after applying our large-neighborhood mean depth completion, the empirical depth error RMSE was 1.05 cm with a standard deviation of

\pm 0.3

cm (computed from the 53 validation samples). These values provide a quantitative bound for the robot’s motion planning.

Appendix D.2.3. Confidence Estimation

The segmentation model outputs a confidence score

C_{seg}

for each detected pedicel instance. We define the picking confidence

C_{pick}

as:

C_{pick} = C_{seg} \times R_{depth}

(A11)

where

C_{seg}

is the segmentation confidence from YOLOv8n-EED-seg, and

R_{depth}

is the proportion of valid depth pixels in the pedicel region. In our test set (324 images, 343 pedicels), the average picking confidence was

0.94

with a standard deviation of

0.05

. Based on the distribution, a confidence threshold of

0.85

is suggested to reject uncertain detections, reducing the risk of failed grasps.

References

FAO. The Future of Food and Agriculture: Trends and Challenges; Food and Agriculture Organization of the United Nations (FAO): Rome, Italy, 2017. [Google Scholar]
Zhang, Q.; Su, W.-H. Real-Time Recognition and Localization of Apples for Robotic Picking Based on Structural Light and Deep Learning. Smart Cities 2023, 6, 3393–3410. [Google Scholar] [CrossRef]
Ma, X.; Chang, J.; Chai, X.; Liu, Y.; Li, J. Evaluating the Canopy Light Environment, Photosynthesis, and Fruit Comprehensive Performance of Greenhouse Tomato under Different Mechanized Planting Layouts. HPJ, 2025; in press. [CrossRef]
Mao, W.; Wang, Y.; Wang, Y. Real-Time Detection of Between-row Weeds Using Machine Vision. In 2003 ASAE Annual Meeting; American Society of Agricultural and Biological Engineers: St. Joseph, MI, USA, 2003; Paper No. 031004. [Google Scholar] [CrossRef]
Wang, Q.; Hua, Y.; Lou, Q.; Kan, X. SWMD-YOLO: A Lightweight Model for Tomato Detection in Greenhouse Environments. Agronomy 2025, 15, 1593. [Google Scholar] [CrossRef]
Terven, J.; Córdova-Esparza, D.; Romero-González, J. A Comprehensive Review of YOLO Architectures in Computer Vision: From YOLOv1 to YOLOv8 and YOLO-NAS. Mach. Learn. Knowl. Extr. 2023, 5, 1680–1716. [Google Scholar] [CrossRef]
Jeon, M.C.; Yoo, P.S.; Choi, S. Hole-Filling of RealSense Depth Images Using a Color Edge Map. IEEE Access 2020, 8, 53901–53914. [Google Scholar] [CrossRef]
Parrish, E.A., Jr.; Goksel, A.K. Pictorial Pattern Recognition Applied to Fruit Harvesting. Trans. ASAE 1977, 20, 822–827. [Google Scholar] [CrossRef]
Zhao, J.; Tow, J.; Katupitiya, J. On-Tree Fruit Recognition Using Texture Properties and Color Data. In Proceedings of the 2005 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Edmonton, AB, Canada, 2–6 August 2005; pp. 263–268. [Google Scholar] [CrossRef]
Yan, B.; Fan, P.; Lei, X.; Liu, Z.; Yang, F. A Real-Time Apple Targets Detection Method for Picking Robot Based on Improved YOLOv5. Remote Sens. 2021, 13, 1619. [Google Scholar] [CrossRef]
Gai, R.; Liu, Y.; Xu, G. TL-YOLOv8: A Blueberry Fruit Detection Algorithm Based on Improved YOLOv8 and Transfer Learning. IEEE Access 2024, 12, 86378–86390. [Google Scholar] [CrossRef]
Wang, D.; He, D. Fusion of Mask RCNN and Attention Mechanism for Instance Segmentation of Apples Under Complex Background. Comput. Electron. Agric. 2022, 196, 106864. [Google Scholar] [CrossRef]
Yuan, T.; Li, H.; Liu, Y.; Wang, Z.; Zhang, W.; Li, W. Robust Cherry Tomatoes Detection Algorithm in Greenhouse Scene Based on SSD. Agriculture 2020, 10, 160. [Google Scholar] [CrossRef]
Liu, C.; Li, W.; Yuan, T.; Wang, Z.; Zhang, W.; Li, H. YOLACTFusion: An Instance Segmentation Method for RGB-NIR Multimodal Image Fusion Based on an Attention Mechanism. Comput. Electron. Agric. 2023, 213, 108186. [Google Scholar] [CrossRef]
Liang, X.; Wei, Z.; Chen, K. A Method for Segmentation and Localization of Tomato Lateral Pruning Points in Complex Environments Based on Improved YOLOV5. Comput. Electron. Agric. 2025, 229, 109731. [Google Scholar] [CrossRef]
Shen, Q.; Wang, L.; Zhang, Y.; Liu, X. Multi-Scale Adaptive YOLO for Instance Segmentation of Grape Pedicels. Comput. Electron. Agric. 2025, 229, 109712. [Google Scholar] [CrossRef]
Song, D.; Li, H.; Wang, Z.; Zhang, W.; Liu, Y. FGS-YOLOv8s-seg: A Lightweight and Efficient Instance Segmentation Model for Detecting Tomato Maturity Levels in Greenhouse Environments. Agronomy 2025, 15, 1687. [Google Scholar] [CrossRef]
Yoshida, T.; Fukao, T.; Hasegawa, T. Cutting Point Detection Using a Robot with Point Clouds for Tomato Harvesting. J. Robot. Mechatron. 2020, 32, 437–444. [Google Scholar] [CrossRef]
Zhang, L.; Wang, Z.; Li, S.; Liu, Y.; Zhang, W. Beyond Trade-Off: An Optimized Binocular Stereo Vision Based Depth Estimation Algorithm for Designing Harvesting Robot in Orchards. Agriculture 2023, 13, 1117. [Google Scholar] [CrossRef]
Zheng, S.; Liu, Y.; Weng, W.; Jia, X.; Yu, S.; Wu, Z. Tomato Recognition and Localization Method Based on Improved YOLOv5n-seg Model and Binocular Stereo Vision. Agronomy 2023, 13, 2339. [Google Scholar] [CrossRef]
Rong, J.; Wang, Z.; Li, S.; Liu, Y.; Zhang, W. Tomato Cluster Detection and Counting Using Improved YOLOv5 Based on RGB-D Fusion. Comput. Electron. Agric. 2023, 207, 107741. [Google Scholar] [CrossRef]
Lin, C.; Fan, T.; Wang, W.; Nießner, M. Point2Skeleton: Learning Skeletal Representations from Point Clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 123–132. [Google Scholar] [CrossRef]
Qiu, T.; Zoubi, A.; Spine, N.; Cheng, L.; Jiang, Y. (Real2Sim)⁻¹: 3D Branch Point Cloud Completion for Robotic Pruning in Apple Orchards. In Proceedings of the 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Abu Dhabi, United Arab Emirates, 14–18 October 2024; pp. 23–30. [Google Scholar] [CrossRef]
Xu, C.; Huang, T.; Niu, Z.; Sun, X.; He, Y.; Qiu, Z. A Skeleton-Based Method of Root System 3D Reconstruction and Phenotypic Parameter Measurement from Multi-View Image Sequence. Agriculture 2025, 15, 343. [Google Scholar] [CrossRef]
Wang, Y.; Liu, Q.; Yang, J.; Ren, G.; Wang, W.; Zhang, W.; Li, F. A Method for Tomato Plant Stem and Leaf Segmentation and Phenotypic Extraction Based on Skeleton Extraction and Supervoxel Clustering. Agronomy 2024, 14, 198. [Google Scholar] [CrossRef]
Zhang, J.; Wu, Y.; Jiang, H. Survey on Monocular Metric Depth Estimation. Computers 2025, 14, 502. [Google Scholar] [CrossRef]
Hu, J.; Bao, C.; Ozay, M.; Fan, C.; Gao, Q.; Liu, H.; Lam, T.L. Deep Depth Completion From Extremely Sparse Data: A Survey. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 8244–8264. [Google Scholar] [CrossRef]
Calderero, F.; Henríquez, P. Depth Completion Enhancement with Morphological Filtering and a Variation of the Infinity Laplacian. In Advanced Research in Technologies, Information, Innovation and Sustainability (ARTIIS 2024); Guarda, T., Portela, F., Augusto, M.F., Eds.; Springer: Cham, Switzerland, 2025; Volume 2348, pp. 34–42. [Google Scholar] [CrossRef]
Ouyang, D.; He, S.; Zhang, G.; Luo, M.; Guo, H.; Zhan, J.; Huang, Z. Efficient Multi-Scale Attention Module with Cross-Spatial Learning. In Proceedings of the 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; pp. 1–5. [Google Scholar] [CrossRef]
Weng, K.; Chu, X.; Xu, X.; Huang, J.; Wei, X. EfficientRep: An Efficient RepVGG-style ConvNets with Hardware-aware Neural Network Design. arXiv 2023, arXiv:2302.00386. [Google Scholar] [CrossRef]
Dai, X.; Chen, Y.; Xiao, B.; Chen, D.; Liu, M.; Yuan, L.; Zhang, L. Dynamic Head: Unifying Object Detection Heads with Attentions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Virtual, 19–25 June 2021; pp. 7373–7382. [Google Scholar] [CrossRef]
Zhang, T.Y.; Suen, C.Y. A Fast Parallel Algorithm for Thinning Digital Patterns. Commun. ACM 1984, 27, 236–239. [Google Scholar] [CrossRef]
Lalonde, J.-F.; Efros, A.A.; Narasimhan, S.G. Estimating the Natural Illumination Conditions from a Single Outdoor Image. Int. J. Comput. Vis. 2012, 98, 123–145. [Google Scholar] [CrossRef]
Lian, S.; Li, L.; Tan, W.; Tan, L. Research on Tomato Maturity Detection Based on Machine Vision. In Proceedings of the International Conference on Image, Vision and Intelligent Systems (ICIVIS 2021); Yao, J., Xiao, Y., You, P., Sun, G., Eds.; Springer: Singapore, 2022; Volume 813, pp. 679–690. [Google Scholar] [CrossRef]
Núñez-Andrés, M.A.; Prades, A.; Buill, F. Vegetation filtering using colour for monitoring applications from photogrammetric data. In Proceedings of the 7th International Conference on Geographical Information Systems Theory, Applications and Management (GISTAM), Online, 23–25 April 2021; pp. 98–104. [Google Scholar]
Wang, C.-Y.; Yeh, I.-H.; Liao, H.-Y.M. YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information. In Computer Vision—ECCV 2024; Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2024; Volume 15089, pp. 1–21. [Google Scholar] [CrossRef]
Khanam, R.; Hussain, M. YOLOv11: An Overview of the Key Architectural Enhancements. arXiv 2024, arXiv:2410.17725. [Google Scholar] [CrossRef]
Bolya, D.; Zhou, C.; Xiao, F.; Lee, Y.J. YOLACT: Real-time Instance Segmentation. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9157–9166. [Google Scholar] [CrossRef]
Ji, P.; Yang, N.; Lin, S.; Xiong, Y. EDI-YOLO: An Instance Segmentation Network for Tomato Main Stems and Lateral Branches in Greenhouse Environments. Horticulturae 2025, 11, 1260. [Google Scholar] [CrossRef]
Qin, J.; Chen, Z.; Zhang, Y.; Nie, J.; Yan, T.; Wan, B. YOLO-CT: A method based on improved YOLOv8n-Pose for detecting multi-species mature cherry tomatoes and locating picking points in complex environments. Measurement 2025, 254, 117954. [Google Scholar] [CrossRef]
Chen, W.; Liu, M.; Zhao, C.; Li, X.; Wang, Y. MTD-YOLO: Multi-task deep convolutional neural network for cherry tomato fruit bunch maturity detection. Comput. Electron. Agric. 2024, 216, 108533. [Google Scholar] [CrossRef]
Sun, X. Enhanced tomato detection in greenhouse environments: A lightweight model based on S-YOLO with high accuracy. Front. Plant Sci. 2024, 15, 1451018. [Google Scholar] [CrossRef]
Gao, G.; Fang, L.; Zhang, Z.; Li, J. Advancing lightweight and efficient detection of tomato main stems for edge device deployment. Artif. Intell. Agric. 2026, 16, 458–479. [Google Scholar] [CrossRef]
Wang, W.; Qin, J.; Huang, D.; Zhang, F.; Liu, Z.; Wang, Z.; Yang, F. Integrated Navigation Method for Orchard-Dosing Robot Based on LiDAR/IMU/GNSS. Agronomy 2024, 14, 2541. [Google Scholar] [CrossRef]
Lu, J.; Cao, Z.; Wang, J.; Wang, Z.; Zhao, J.; Zhang, M. A Picking Point Localization Method for Table Grapes Based on PGSS-YOLOv11s and Morphological Strategies. Agriculture 2025, 15, 1622. [Google Scholar] [CrossRef]
Ye, L.; Ma, J.; Lv, Y.; Guo, Z.; Lai, Z.; Ou, C.; Li, J.; Wu, F. The YOLO-OBB-Based Approach for Citrus Fruit Stem Pose Estimation and Robot Picking. Agriculture 2025, 15, 2330. [Google Scholar] [CrossRef]
Burusa, A.K.; Scholten, J.; Wang, X.; Rapado-Rincón, D.; van Henten, E.J.; Kootstra, G. Semantics-aware next-best-view planning for efficient search and detection of task-relevant plant parts. Biosyst. Eng. 2024, 248, 1–14. [Google Scholar] [CrossRef]
Jiang, C.; Miao, K.; Hu, Z.; Gu, F.; Yi, K. Image Recognition Technology in Smart Agriculture: A Review of Current Applications, Challenges and Future Prospects. Processes 2025, 13, 1402. [Google Scholar] [CrossRef]
Akbar, J.U.M.; Ilyas, Q.M.; Ali, T.; Iqbal, S.; Khan, S. A Comprehensive Review on Deep Learning Assisted Computer Vision Techniques for Smart Greenhouse Agriculture. IEEE Access 2024, 12, 4485–4522. [Google Scholar] [CrossRef]
Saiz-Rubio, V.; Rovira-Más, F. From Smart Farming towards Agriculture 5.0: A Review on Crop Data Management. Agronomy 2020, 10, 207. [Google Scholar] [CrossRef]
Wang, Z.; Xun, Y.; Wang, Y.; Yang, Q. Review of Smart Robots for Fruit and Vegetable Picking in Agriculture. Int. J. Agric. Biol. Eng. 2022, 15, 33–54. [Google Scholar] [CrossRef]
Xiao, X.; Jiang, Y.; Wang, Y. Key Technologies for Machine Vision for Picking Robots: Review and Benchmarking. Mach. Intell. Res. 2025, 22, 2–16. [Google Scholar] [CrossRef]
Xie, Z.; Yu, X.; Gao, X.; Li, K.; Shen, S. Recent Advances in Conventional and Deep Learning-Based Depth Completion: A Survey. IEEE Trans. Neural Netw. Learn. Syst. 2024, 35, 3395–3415. [Google Scholar] [CrossRef]
Duarte, A.; Fernandes, F.; Pereira, J.M.; Moreira, C.; Nascimento, J.C.; Jorge, J. SelfReDepth: Self-Supervised Real-Time Depth Restoration for Consumer-Grade RGB-D Cameras. J. Real-Time Image Process. 2024, 21, 124. [Google Scholar] [CrossRef]

Figure 1. Tomato clusters under different conditions: (a) low-illumination front view with backlighting; (b) strong-illumination front view; (c) low-illumination side view; (d) strong-illumination side view. (e) Grayscale histogram analysis of the scene in (a) yields the following metrics: (Mean: 103.4, Contrast: 72.8, Ratio: 0.51), indicating foreground darkening and confirming backlighting.

Figure 2. Architecture of the proposed YOLOv8n-EED-seg model. (a) Overall structure. (b) C2f module: splits input into two branches, one through residual bottlenecks, then concatenates and refines outputs. (c) Bottleneck and CBS: CBS consists of Conv, BN, and SiLU; Bottleneck stacks two CBS modules with a residual connection.

Figure 3. Schematic diagrams of core modules and improved architectures for tomato pedicel segmentation. (a) 8-direction shift convolution for directional feature extraction via pixel shifts. (b) S-RepConv reparameterization: multi-branch to single convolution during inference. (c) SPPF with multi-scale max-pooling for efficient feature fusion. (d) Improved EfficientRep backbone for slender pedicel segmentation. (e) EMA mechanism using parallel sub-networks for cross-channel and spatial attention.

Figure 4. (a) Structural mechanism of the Efficient Channel Attention (ECA) module for channel-wise feature recalibration; (b) Improved Dynamic Head (Dyhead) module, with the FC layer in its

π_{C}

attention modules replaced by the ECA attention mechanism.

Figure 4. (a) Structural mechanism of the Efficient Channel Attention (ECA) module for channel-wise feature recalibration; (b) Improved Dynamic Head (Dyhead) module, with the FC layer in its

π_{C}

attention modules replaced by the ECA attention mechanism.

Figure 5. The 8-neighborhood pixel region considered by the Zhang–Suen algorithm.

Figure 6. Tomato image with missing pedicel depth information in greenhouse scene. Conventional hole-filling methods fail to repair large-area depth loss caused by uneven illumination and specular reflection, as visualized in the tomato pedicel region: (a) RGB image with clearly visible tomato pedicel; (b) Corresponding depth image with severe depth information loss in the pedicel region.

Figure 7. Qualitative comparison of ablation study results across five model variants. Each row corresponds to a different model configuration: full YOLOv8n-EED-seg (Row a, proposed); YOLOv8n-EMAD-seg (Row b); YOLOv8n-ERD-seg (Row c); YOLOv8n-Dyhead-seg (Row d); baseline YOLOv8n-seg (Row e). Column 1, Column 2, Column 3 shows Israel Red Cluster pedicel segmentation; Column 4 presents Yuekeda cultivar pedicel segmentation with three instances. Red masks indicate correctly segmented pedicels, with accuracy scores overlaid on each instance.

Figure 8. Qualitative comparison of segmentation performance under real greenhouse conditions. Rows (a–e): YOLOv8n-EED-seg (proposed), YOLOv9-seg, YOLOv11-seg, YOLACT, and SEG-RTDETR. Columns show four scenarios: (1) Israel Red Cluster with two harvestable pedicels; (2) slender pedicel; (3) standard pedicel; (4) Yuekeda with 60% occlusion. Numbers indicate confidence scores.

Figure 9. Results of picking-point identification using the proposed YOLOv8n-EED-seg model. (a) Original input image of a tomato cluster. (b) Pedicel segmentation mask generated by the YOLOv8n-EED-seg model, where white regions indicate the segmented pedicel area. (c) Binarized mask after thresholding, extracting the pedicel foreground region. (d) Skeletonization result showing the one-pixel-wide topological skeleton of the pedicel. (e) Final picking point (marked as a red dot) located at the midpoint of the pedicel skeleton.

Figure 10. Picking point depth localization results using the optimal threshold (

k = 2.5

cm). (a) Original RGB image of a tomato cluster. (b) Picking-point localization result on the RGB image, where the red dot indicates the detected picking point. (c) Corresponding depth image, where redder regions represent closer distances. (d) 3D localization result showing the picking point (red dot) projected onto the depth map. The proposed method achieves a localization error of 1.05 cm RMSE at a working distance of 25 cm.

Figure 10. Picking point depth localization results using the optimal threshold (

k = 2.5

cm). (a) Original RGB image of a tomato cluster. (b) Picking-point localization result on the RGB image, where the red dot indicates the detected picking point. (c) Corresponding depth image, where redder regions represent closer distances. (d) 3D localization result showing the picking point (red dot) projected onto the depth map. The proposed method achieves a localization error of 1.05 cm RMSE at a working distance of 25 cm.

Figure 11. Limitations of the proposed method. (a) Severe occlusion causes missed detection of the upper pedicel. (b) RGB image and (c) corresponding depth map showing significant depth information loss that cannot be effectively compensated, affecting 3D positioning accuracy.

Table 1. Summary of critical analysis of existing methods and corresponding improvements in this study.

Reference/Method	Identified Limitation	Supporting Evidence	Proposed Improvement
Yan et al. [10]	Constrained to fruit-level detection, pedicel localization remains unaddressed	Achieves mAP of 86.75% on apple fruit detection using YOLOv5s	Enables pedicel-level segmentation achieving sub-centimeter localization accuracy
Gai et al. [11]	Limited to blueberry fruit detection, lacks a dedicated module for pedicel recognition	Attains 84.6% precision and 94.1% mAP on blueberry fruit detection	Introduces a morphology-specific module tailored for pedicel feature extraction
Wang et al. [12]	Two-stage architecture imposes real-time constraints	Achieves 96.5% precision and 97.4% recall, yet requires 270 ms per image	Employs a one-stage lightweight design achieving 4.8 ms inference time
Yuan et al. [13]	Single-shot design struggles with small and slender targets	Demonstrates lightweight efficiency but yields suboptimal performance on slender structures	Implements enhanced feature extraction specifically designed for slender pedicel morphology
Liu et al. [14]	Relies on RGB-NIR multimodal input, limiting practical deployment	Improves mAP from 39.20% to 46.29% using YOLACTFusion	Adopts single-modal RGB input with attention-based feature fusion
Song et al. [17]	Attention mechanism confined to backbone level	Achieves 86.9% precision and 84.8% mAP with FGS-YOLOv8s-seg incorporating SegNext-attention	Facilitates multi-scale feature fusion across detection heads
Yoshida et al. [18]	Excessive processing latency limits real-time viability	Attains 90% picking success within 15 s	Delivers real-time performance with 4.8 ms inference time per frame
Zhang et al. [19]	Elevated RMSE on texture-sparse surfaces	Reports RMSE of 3–5 pixels (approximately 4–7 mm) on texture-sparse surfaces	Implements adaptive depth compensation reducing RMSE to 3.2 mm
Zheng et al. [20]	Systematic depth deviation induced by point cloud holes and noise	Achieves mean radius error of 2.4 mm with systematic depth deviation up to 3.7 mm	Employs neighborhood-based compensation ensuring local depth consistency
Rong et al. [21]	Fails to identify pedicel cutting points despite cluster-level accuracy	Attains 94.5% mAP for cluster detection without pedicel point identification	Integrates pedicel segmentation with picking point detection
Lin et al. [22]	Designed for point cloud structures; requires complete data	Achieves skeletal representations from point clouds but assumes complete structures	Applies iterative refinement for pedicel skeletons under occlusion
Qiu et al. [23]	Assumes complete point cloud input	Achieves point cloud completion and skeletonization but fails under occlusion	Combines skeletonization with depth completion to handle occlusion
Xu et al. [24]	Designed for root system reconstruction, not applicable to above-ground pedicel structures	Achieves skeleton-based 3D reconstruction from multi-view images but targets root phenotyping	Adapts skeletonization approach for pedicel morphology in above-ground tomato plants
Wang et al. [25]	Focuses on main stem and leaves segmentation	Extracts stem and leaf phenotypes from tomato plants, not pedicel-specific	Introduces pedicel-level skeletonization tailored for picking-point localization
Zhang et al. [26]	Surveys learning-based depth estimation without real-time consideration	Reviews monocular depth estimation methods highlighting data requirements	Adopts training-free neighborhood-based depth compensation
Hu et al. [27]	Surveys learning-based depth completion requiring extensive data	Comprehensive review of deep depth completion methods emphasizing data dependency	Implements training-free neighborhood-based depth compensation
Calderero & Henríquez [28]	Fixed kernel size induces over-smoothing	Morphological filtering exhibits inherent limitations with fixed kernels	Adopts adaptive kernel size based on local depth variance

Table 2. Dataset composition by illumination condition and viewing angle.

Category	Subcategory	Images
Illumination Intensity	Direct sunlight	1432
	Diffuse light	1091
	Backlighting	787
Viewing Angle	Front View	1745
Viewing Angle	Side View	1565

Table 3. Strategy of dataset division.

Dataset	Number of Images	Number of Pickable Pedicels
Training set	2584	3065
Validation set	322	383
Test set	324	389

Table 4. Training parameters.

Parameter	Configuration
Epoch	300
Batch size	16
Optimizer	AdamW
Weight decay	0.0005
Initial learning rate	0.001
Cosine learning rate scheduling	True

Table 5. Ablation analyses of different modules’ effects on recognition task.

Model	Precision/%	Recall/%	F1 Score/%	mAP₅₀/%	Size (MB)	FLOPs (G)	Processing Time per Photo (ms)
YOLOv8n-EED-seg	92.08 ± 0.09	82.10 ± 0.17	86.64 ± 0.21	87.01 ± 0.09	7.5	9.1	4.8
YOLOv8n-ERD-seg	89.61 ± 0.09	81.51 ± 0.44	83.57 ± 0.23	84.74 ± 0.05	7.0	8.5	4.7
YOLOv8n-EMAD-seg	89.93 ± 0.20	82.42 ± 0.30	85.46 ± 0.16	86.58 ± 0.21	7.8	9.7	5.0
YOLOv8n-Dyhead-seg	87.98 ± 0.19	80.21 ± 0.13	83.46 ± 0.24	84.27 ± 0.20	7.1	8.9	4.8
YOLOv8n-seg	86.99 ± 0.07	79.32 ± 0.11	83.01 ± 0.06	82.70 ± 0.10	6.6	8.2	4.5

Table 6. Comparison of different models on the test dataset.

Model	Precision/%	Recall/%	F1 Score/%	mAP₅₀/%	Size (MB)	FLOPs (G)	Processing Time per Photo (ms)
YOLOv8n-EED-seg	92.08 ± 0.09	82.10 ± 0.17	86.64 ± 0.21	87.01 ± 0.09	7.5	9.1	4.8
YOLOv9-seg	89.08 ± 0.06	79.61 ± 0.38	82.14 ± 0.18	82.10 ± 0.22	6.4	8.7	4.9
YOLOv11-seg	86.54 ± 0.43	80.49 ± 0.28	83.46 ± 0.31	83.24 ± 0.10	6.3	8.5	4.7
YOLACT	87.14 ± 0.08	80.25 ± 0.08	83.13 ± 0.02	82.60 ± 0.29	46.5	32.4	20.0
Seg-rtdetr	86.46 ± 0.31	79.95 ± 0.57	83.26 ± 0.45	83.73 ± 0.32	11.3	12.8	5.1

Table 7. Quantitative comparison of different depth completion methods.

Method	RMSE (cm)	MAE (cm)	Inference Time (ms)	Training Required
Bilinear Interpolation	1.34	1.27	3.2	No
Bicubic Interpolation	1.28	1.12	3.5	No
BP-Net (learning-based)	0.92	0.72	23.0	Yes
Ours (large-neighborhood mean)	1.05	0.81	3.9	No

Table 8. Real-time performance of different pipeline components on edge devices (Jetson Orin NX, 100 TOPS).

Model/Pipeline	Components	Inference Time (ms)	FPS
YOLOv8n (baseline)	Baseline detection only	7.6	132
YOLOv8n-EED-seg	Detection + segmentation	9.3	108
YOLOv8n-EED-seg + Post-processing	Detection + segmentation + post-processing	16.2	62

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wu, L.; Liu, L.; Teng, D. Tomato Pedicel Picking-Point Localization via Improved YOLOv8n-EED-Seg and RGB-D Fusion. Agriculture 2026, 16, 1197. https://doi.org/10.3390/agriculture16111197

AMA Style

Wu L, Liu L, Teng D. Tomato Pedicel Picking-Point Localization via Improved YOLOv8n-EED-Seg and RGB-D Fusion. Agriculture. 2026; 16(11):1197. https://doi.org/10.3390/agriculture16111197

Chicago/Turabian Style

Wu, Liping, Lilin Liu, and Dongdong Teng. 2026. "Tomato Pedicel Picking-Point Localization via Improved YOLOv8n-EED-Seg and RGB-D Fusion" Agriculture 16, no. 11: 1197. https://doi.org/10.3390/agriculture16111197

APA Style

Wu, L., Liu, L., & Teng, D. (2026). Tomato Pedicel Picking-Point Localization via Improved YOLOv8n-EED-Seg and RGB-D Fusion. Agriculture, 16(11), 1197. https://doi.org/10.3390/agriculture16111197

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Tomato Pedicel Picking-Point Localization via Improved YOLOv8n-EED-Seg and RGB-D Fusion

Abstract

1. Introduction

2. Materials and Methods

2.1. Image Dataset

2.2. The Proposed YOLOv8n-EED-Seg Model

2.3. Improved EfficientRep Network Architecture

2.4. EMAttention Mechanism

2.5. Improved DynamicHead Module

2.6. Skeletonization Processing

2.7. Fusion of Depth Information for Picking-Point Localization

2.8. Experiment Environment and Model Evaluation

3. Results and Analysis

3.1. Ablation Experiments

3.2. Performance Comparisons of Different Models on Target Detection Tasks

3.3. Results on Picking-Point Localization

3.4. Real-Time Performance Evaluation on Edge Devices

4. Discussion

4.1. Comparison with Previous Work

4.2. Limitations and Future Research Directions

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A

Appendix A.1. Detailed Dataset Statistics and Validation

Appendix A.1.1. Illumination Classification Validation

Appendix A.1.2. Hue Threshold Statistical Analysis

Appendix A.1.3. HSV Histogram Analysis

Appendix A.1.4. Depth Completion Threshold Optimization

Appendix A.1.5. Annotation Examples

Appendix A.2. Layer-by-Layer Architectural Comparison

Appendix B

Appendix B.1. Detailed Mathematical Formulation of EMAttention

Appendix B.2. Detailed Description of Large-Neighborhood Depth Completion

Appendix B.2.1. Neighborhood Selection

Appendix B.2.2. Invalid Depth Identification

Appendix B.2.3. Threshold Parameter Determination

Appendix B.2.4. Pseudocode of the Depth Completion Algorithm

Appendix B.3. RGB-D Camera Calibration and Coordinate Conversion

Appendix B.3.1. Hardware Configuration

Appendix B.3.2. Calibration Parameters

Appendix B.3.3. RGB-Depth Alignment Procedure

Appendix C

Appendix C.1. Performance Variability of Ablation Study

Appendix C.2. Performance Variability of Model Comparison

Appendix D

Appendix D.1. Supplementary Results for Occlusion, Illumination, and Depth Completion

Appendix D.1.1. Occlusion Robustness

Appendix D.1.2. Illumination Robustness

Appendix D.1.3. Depth Completion Performance at Different Working Distances

Appendix D.2. Uncertainty Analysis for Robotic Harvesting

Appendix D.2.1. Localization Uncertainty

Appendix D.2.2. Depth Uncertainty

Appendix D.2.3. Confidence Estimation

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI