1. Introduction
As a globally cultivated high-value crop, tomato production remains heavily dependent on manual labor [
1]. While automated harvesting systems exist, they typically rely on one-time harvesting. The selective batch-picking approach based on fruit ripeness has yet to be fully automated. Labor scarcity and rising costs underscore the urgency for tomato-picking robots to ensure sustainable production. Central to such robots is a visual system capable of real-time fruit recognition and precise 3D localization [
2]. However, field conditions such as leaf occlusion, variable environmental lighting, and fruit overlap pose significant challenges to reliable perception [
3]. Traditional machine vision methods (e.g., color, texture, or shape-based thresholding) show limited robustness and generalization in unstructured environments, where pedicels often adhere to foliage or structures [
4]. Dynamic illumination conditions significantly compromise the accuracy of RGB-based segmentation in greenhouse settings, with recent studies reporting performance degradation of up to 7% under varying light intensities [
5], while single-feature approaches like Hu moments achieved only 65% accuracy in cross-cultivar tasks [
6]. Moreover, cost-effective depth sensors often suffer from data loss; specifically, the largest missing regions average approximately 2.23% of the total image area [
7], which can compromise positioning precision.
In information-based agriculture, deep learning-driven object detection is now the mainstream approach in fruit recognition, surpassing early limited-feature methods [
8,
9]. The emergence of CNNs (YOLO, R-CNN series) has greatly enhanced both accuracy and speed. Yan et al. [
10] adapted YOLOv5s for real-time apple detection, achieving a mAP of 86.75% on apple targets. Gai et al. [
11] introduced TL-YOLOv8 for blueberries, integrating attention and reparameterization to accelerate training and enrich features, achieving 84.6% precision and 94.1% mAP. However, these methods target fruit-level detection, without addressing the finer-grained challenge of pedicel localization, a task requiring sub-centimeter precision rather than fruit-level detection. Recent research has shifted toward instance segmentation for precise robotic picking. Wang et al. [
12] used Mask R-CNN for overlapping apple segmentation, achieving 96.5% precision and 97.4% recall, yet its two-stage design requires 270 ms per image, unsuitable for real-time harvesting. Yuan et al. [
13] developed a lightweight SSD variant for cherry tomato segmentation, balancing speed and accuracy, yet its single-shot design struggles with small, slender targets. Liu et al. [
14] proposed YOLACTFusion, an attention-guided RGB-NIR fusion method for tomato stem detection, improving mAP from 39.20% to 46.29%; however, it relies on multimodal input not always available in greenhouses. Attention mechanisms in YOLO models further improved robustness under occlusion and varying light [
15,
16], exemplified by Song et al. [
17], who integrated SegNext-attention into YOLOv8-seg for tomato segmentation and maturity classification, achieving 86.9% precision and 84.8% mAP. Nevertheless, these attention mechanisms focus on backbone enhancement rather than multi-scale fusion across detection heads, leaving room for improvement in pedicel segmentation.
Although instance segmentation provides richer 2D shape information, precise 3D spatial positioning remains a key challenge for automated picking. Yoshida et al. [
18] used an RGB-D camera to obtain 3D tomato point clouds, applied region growing for clustering, and determined optimal picking points by integrating voxel connectivity, Mahalanobis distance, and pedicel geometry, achieving 90% success in 15 s, which limits real-time application. Zhang et al. [
19] proposed Completion-BiPy-Disp, fusing bilateral filtering with pyramid models to restore missing depth in disparity maps. Yet over 15% of regions lacked reliable depth, with RMSE reaching 3–5 pixels (4–7 mm) on texture-sparse surfaces. Zheng et al. [
20] combined RAFT-Stereo with improved YOLOv5 to segment and crop point clouds via masks, then fitted spheres to compute centroids and radii, yielding a mean absolute radius error of 2.4 mm. However, the systematic depth deviated up to 3.7 mm due to point cloud holes and noise. Rong et al. [
21] fused RGB-D data for dynamic tomato cluster tracking using YOLOv5 and ByteTrack (mAP 94.5%), yet failed to identify pedicel cutting or grasping points.
Skeletonization of segmentation masks enables precise extraction of picking points by converting instance masks into one-pixel-wide topological skeletons that preserve the morphological characteristics of pedicels. This approach has been effectively applied to branch-like structure analysis in agricultural and forestry contexts, including tree branch reconstruction [
22,
23], root system phenotyping [
24], and plant phenotyping [
25], demonstrating improved localization accuracy over centroid-based methods. For depth recovery, we adopt a neighborhood-based compensation strategy to reconstruct missing depth data caused by sensor limitations by interpolating invalid regions using valid depth values from surrounding pixels. Learning-based depth completion methods typically require extensive training data, which limits their practicality for rapid deployment [
26]. In contrast, conventional interpolation and filtering approaches are computationally efficient but often struggle with complex occlusions and texture-sparse surfaces, producing over-smoothed results that blur fine-grained pedicel structures [
27]. To balance these trade-offs, neighborhood-based depth compensation strategies offer a lightweight alternative suitable for real-time harvesting applications while maintaining local depth consistency [
28].
Table 1 summarizes the critical analysis of existing methods and the corresponding improvements proposed in this study.
Building on these insights, current research has advanced fruit recognition and positioning, but core challenges remain in incomplete 3D perception and limited accuracy. Most methods rely on idealized point cloud fitting, which is sensitive to depth loss and noise, or only perform cluster-level detection without providing precise operation points. Despite the progress reviewed above, three core scientific gaps remain unaddressed: lack of sub-centimeter pedicel localization, insufficient cross-scale feature fusion across detection heads, and poor depth completion robustness on weakly textured, elongated pedicel surfaces. To address these gaps, this work is built upon two design considerations: (1) the incorporation of multiple attention mechanisms along with an efficient backbone enhances pedicel segmentation accuracy without compromising real-time inference speed; (2) skeletonization of segmentation masks enables precise extraction of picking points, while a neighborhood-based depth compensation strategy effectively reconstructs missing depth data caused by sensor limitations. Guided by these considerations, we propose a tomato picking-point localization method that integrates a modified YOLOv8n-seg model with RGB-D fusion. Based on YOLOv8n-seg, we incorporate an optimized EfficientRep backbone, the EMAttention mechanism, and a refined DynamicHead module, tailored to pedicel morphology and model efficiency [
29,
30,
31], yielding the YOLOv8n-EED-seg model. The pedicel mask is skeletonized to extract its main structure and derive picking-point coordinates [
32]. The core contribution of this study lies in a systematic redesign of the detection pipeline for pedicel perception, rather than a simple assembly of existing components. In addition, specific improvements have been introduced to the existing EfficientRep and DynamicHead modules to better suit the characteristics of slender pedicel targets.
Specifically, the 8-direction shift convolution in the EfficientRep backbone enlarges the receptive field without adding parameters. The enriched features then feed into the EMAttention module for cross-scale fusion across detection heads. Finally, the refined DynamicHead decouples classification, regression, and segmentation tasks, avoiding gradient interference.
Beyond segmentation, we further address the challenge of missing depth data caused by sensor limitations on weakly textured, elongated pedicel surfaces. A large-neighborhood mean method is introduced to compensate for invalid depth values, enabling accurate 3D localization through RGB-D fusion. This depth compensation strategy, together with the cascaded feature refinement architecture, forms a complete perception pipeline from 2D segmentation to 3D localization. This problem-driven module reorganization has been validated through greenhouse harvesting experiments, bridging the gap between laboratory research and field applications. The main contributions of this work are as follows:
This study integrates the EMAttention mechanism into the YOLOv8n-seg model to enhance the recognition and segmentation of small pedicels via cross-dimensional interaction and multi-scale feature calibration.
To address the trade-off between inference speed and feature representation, this work introduces an improved EfficientRep lightweight backbone network. Furthermore, an improved DynamicHead module is employed to replace the original detection head. These modifications are expected to enhance feature representation and detection performance, thereby making the model more suitable for embedded deployment.
A 3D positioning system integrating image segmentation, skeletonization analysis, and depth restoration algorithms is designed. This system achieves stable and high-precision localization of tomato picking points, providing a reliable visual perception solution for picking robots.
3. Results and Analysis
To comprehensively assess the proposed tomato-picking decision system and validate its core improved modules, systematic experiments were conducted on self-constructed tomato pedicel RGB and RGB-D test datasets under real agricultural field conditions. The analysis quantifies each module’s contribution to pedicel recognition and picking-point localization, and elaborates on the integrated system’s operational mechanisms, focusing on the synergistic effects of its key techniques on slender pedicel detection and precise 3D localization in complex unstructured environments.
3.1. Ablation Experiments
This section evaluates the impacts of the improved EfficientRep backbone, the EMAttention module, and the improved Dyhead module through ablation experiments. All reported results are presented as mean values from three independent runs (random seeds: 42, 123, and 456). The stability and reproducibility of the experimental results are validated in
Appendix C (
Table A6), where all metrics are presented as mean ± standard deviation.
Table 5 summarizes results of ablation experiments on the proposed framework. The improved Dyhead alone (Row 4) enables adaptive fusion and task decoupling, which increases mAP
50 from 82.70% to 84.27% (+1.57%) with minimal computational overhead (model size: 6.6 MB to 7.1 MB; inference: 4.5 ms to 4.8 ms). EMAttention (Row 3) strengthens cross-scale feature aggregation under occlusion, which further improves mAP
50 to 86.58% (+3.88%) and F
1-score to 85.46, but increases model size to 7.8 MB and inference time to 5.0 ms (model size +18.2%, inference +11.1%). The improved EfficientRep backbone with Dyhead (Row 2) enlarges the receptive field without adding parameters, which reduces model size to 7.0 MB and inference time to 4.7 ms. Simultaneously, a 84.74% mAP
50 is achieved, enabling better feature extraction for slender pedicels.
The full YOLOv8n-EED-seg model (Row 1) integrates all three enhancements and delivers the best performance: mAP50 reaches 87.01% (+4.31% over baseline), Precision 92.08%, Recall 82.10%, and F1-score 86.64%, with competitive FLOPs of 9.1 G and inference time of 4.8 ms. Relative to Dyhead-seg (Row 4), the full model achieves a 2.74% higher mAP50 with negligible increases in model size (0.4 MB) and FLOPs (0.2 G). Relative to EMAD-seg (Row 3), it achieves higher accuracy accompanied by a smaller model size of 7.5 MB (compared to 7.8 MB) and faster inference of 4.8 ms (compared to 5.0 ms). Relative to ERD-seg (Row 2), it delivers substantial accuracy gains (mAP50 +2.27%, precision +2.47%, -score +3.07%) with only modest increases in model size (0.5 MB) and FLOPs (0.6 G). These results confirm that the three modules work synergistically: EfficientRep enlarges the receptive field without adding parameters, EMAttention strengthens cross-scale feature fusion under occlusion, and DynamicHead decouples tasks for fine-grained localization.
Figure 7 presents the qualitative ablation results across five model variants. In the first column (Israel Red Cluster pedicel segmentation), the baseline model achieves a confidence score of 0.78, which improves to 0.85 with DyHead, to 0.88 with EMAttention, and the full EED-seg model reaches the highest confidence score of 0.89. In the second column (apical pedicel segmentation), both EED-seg and EMAD-seg achieve 0.92, while ERD-seg, Dyhead-seg, and the baseline incorrectly split the pedicel into two instances due to branch interference, with confidence scores of 0.89, 0.85, and 0.85, respectively. In the third column (two harvestable pedicels), EED-seg achieves the best performance among all compared models. In the fourth column (Yuekeda cultivar with three pickable pedicels), EED-seg achieves the highest confidence scores across all instances (0.93, 0.87, 0.91), outperforming EMAD-seg (0.92, 0.81, 0.90), ERD-seg (0.92, 0.87, 0.82), Dyhead-seg (0.92, 0.85, 0.83), and the baseline (0.89, 0.85, 0.82). These results demonstrate that the full EED-seg model attains the highest overall recognition confidence among all evaluated approaches, exhibiting balanced and robust performance in segmenting both fine-grained and regular pedicels.
3.2. Performance Comparisons of Different Models on Target Detection Tasks
To evaluate the proposed YOLOv8n-EED-seg, we compared it with YOLOv9-seg [
36], YOLOv11-seg [
37], YOLACT [
38], and Seg-rtdetr [
39] under identical conditions. All reported results are presented as mean values from three independent runs with different random seeds (42, 123, and 456). The stability and reproducibility of the experimental results are further validated in
Appendix C (
Table A7), where all metrics are presented as mean ± standard deviation, with the best overall performance per metric across all models highlighted in bold. As summarized in
Table 6, the proposed model (Row 1) achieves the best performance across all accuracy metrics:
of 87.1%, precision of 92.08%, and
-score of 86.82%. It outperforms YOLOv9-seg by 4.8% in
, 6.03% in precision, and 4.49% in
-score; outperforms YOLOv11-seg by 3.77%, 5.18%, and 3.08%; outperforms YOLACT by 4.3%, 4.94%, and 3.68%; and outperforms Seg-rtdetr by 3.06%, 5.41%, and 3.06%, respectively.
In terms of computational efficiency, the proposed model requires 9.1 G FLOPs (Column 7) and 7.5 MB of parameters (Column 6), with an inference speed of 4.8 ms per frame (Column 8). Compared to YOLOv9-seg (8.7 G, 6.4 MB, 4.9 ms) and YOLOv11-seg (8.5 G, 6.3 MB, 4.7 ms), it has modestly higher computational cost but achieves substantially better accuracy. Compared to YOLACT (32.4 G, 46.5 MB, 20 ms) and Seg-rtdetr (12.8 G, 11.3 MB, 5.1 ms), it is significantly more efficient.
Among the compared models, YOLOv9-seg and YOLOv11-seg are lightweight successors in the YOLO series, designed for efficient deployment but with limited accuracy for fine-grained pedicel segmentation. YOLACT is a real-time instance segmentation model that generates prototype masks, but its large model size (46.5 MB) and high computational cost (32.4 G FLOPs, 20 ms per frame) make it unsuitable for real-time harvesting applications. Seg-rtdetr adopts a Transformer-based architecture with multi-head self-attention and a hybrid encoder, achieving competitive accuracy (84.04% ) but at the cost of larger model size (11.3 MB) and slower inference (5.1 ms) compared to lightweight YOLO variants.
Although YOLOv11-seg offers the fastest inference (4.7 ms) and smallest model size (6.3 MB), its is only 83.24%, approximately 3.77% lower than that of the proposed model (87.01%). This trade-off between lightweight deployment and detection accuracy is effectively balanced by the proposed YOLOv8n-EED-seg, which achieves superior accuracy while maintaining competitive efficiency.
Figure 8 qualitatively compares the inference results of different models in real greenhouse scenarios. Rows (a–e) correspond to YOLOv8n-EED-seg (proposed), YOLOv9-seg, YOLOv11-seg, YOLACT, and SEG-RTDETR, respectively. In Column 1 (Israel Red Cluster with two harvestable pedicels), the proposed model achieves the highest confidence scores (0.90, 0.87). In Column 2 (slender pedicel), only EED-seg and YOLACT succeed, with EED-seg achieving superior confidence scores (0.82, 0.85) versus YOLACT (0.72, 0.86), while EED-seg has a much smaller model size. In Column 3 (standard pedicel), EED-seg achieves the highest confidence score (0.86), matching YOLOv11-seg and outperforming others by 0.04–0.05. In Column 4 (Yuekeda with 60% occlusion), EED-seg achieves the best confidence scores for both the upper pedicel (surpassing others by 0.01–0.05) and the lower pedicel (0.93). These results demonstrate that EED-seg outperforms competing models across diverse challenging scenarios while maintaining a compact size (7.5 MB) and achieving an optimal balance between detection accuracy and computational efficiency.
Furthermore, the YOLOv8n-EED-seg model exhibits robust generalization across varying illumination conditions, with detailed results provided in
Appendix D (
Table A8).
3.3. Results on Picking-Point Localization
Figure 9 illustrates the complete recognition pipeline, including skeletonization and picking-point localization, based on a mobile-acquired test dataset.The process consists of five stages. In the first stage (data acquisition), RGB images
and depth images
are synchronously captured using an Intel RealSense D455 RGB-D camera, where the RGB image provides texture and color information of the pedicel while the depth image directly provides the depth value (in mm) for each pixel. In the second stage (semantic segmentation), the RGB image is fed into the trained YOLOv8n-EED-seg model to generate a binary pedicel mask
, with
indicating that pixel
belongs to the pedicel region and
indicating background. The third stage is depth extraction, in which the depth information of the pedicel region is extracted by pixel-wise multiplication:
. In the fourth stage (depth completion), the large-neighborhood mean method is applied to compensate for missing depth values, resulting in the completed depth map
. In the fifth and final stage (coordinate transformation), for each pixel
with completed depth value
, the RealSense SDK function
is used to directly convert the pixel coordinates and depth value into 3D world coordinates, obviating the need for manual manipulation of intrinsic and extrinsic matrices.
The pipeline demonstrates robust performance across diverse scenarios. In Column 1, the pedicel, fragmented by branch breakage, produces two independent skeletons; the picking point is defined as the midpoint between their endpoints, treating the disconnected parts as a single structure. Columns 2 and 3 exhibit accurate localization for the ‘Yuekeda’ cultivar despite morphological challenges—forward curvature (Column 2) and limited visibility (Column 3). Column 4 highlights the method’s superiority under favorable conditions. Collectively, these results validate the robustness and adaptability of the proposed approach across challenging greenhouse scenarios.
As has been emphasized in
Section 2 of Methods, depth information extraction is critical for acquiring 3D coordinates of picking points. This paper introduces the large-neighborhood mean method to robustly estimate missing values using valid neighboring depth information. To determine the optimal threshold
k for the large-neighborhood mean method, a comparative experiment is conducted. The evaluation is based on two metrics: first, the abnormal depth rejection rate, defined as the proportion of outliers correctly identified and replaced by the mean depth
; second, the picking localization error, measured as the distance deviation between the picking point computed from the restored depth and its actual position.
In the tomato greenhouse, the camera mounted on the robotic arm is typically positioned 20–30 cm from the target pedicel to balance depth measurement accuracy and operational safety. The fixed distance of 25 cm for 3D evaluation falls within this optimal range, ensuring the experimental setup is representative of actual harvesting conditions. This specific distance was determined empirically: the depth camera was fixed perpendicular to the greenhouse rail with an initial offset of 23 cm from the cultivation pot, then incrementally adjusted using real-time depth feedback until the final working distance of 25 cm was set. A total of 53 samples are used in the experiment; this sample size is determined based on the availability of representative pedicel instances with complete depth annotations across varying occlusion levels and lighting conditions. The 53 samples ensure statistical validity while maintaining manual measurement feasibility, covering diverse scenarios including slender pedicels (the bounding boxes of these pedicels are typically smaller than <32 × 32 pixels, following the MS COCO definition of small objects), and different cultivars (Israel Red Cluster and Yuekeda), thereby providing a robust basis for threshold optimization.
Table A3 summarizes the depth completion performance across different threshold values of
k, from which
cm is identified as the optimal threshold, balancing a relatively high abnormal depth rejection rate (88.6%) and the lowest average localization error (1.05 cm). Taking the best threshold
cm, the corresponding localization results of picking-point depth information are presented in detail in
Figure 10. The workflow begins with acquiring the original RGB image for 2D picking-point localization to accurately locate the target pedicel and its picking position. The original scene depth image is captured synchronously with the RGB image to record the initial depth data for 3D coordinate calculation. Finally, the large-neighborhood mean method is applied to compensate for missing depth values, yielding the precise 3D depth localization result of the picking point. In this randomly selected example, the obtained depth value is approximately 24.175 cm.
To quantify the localization accuracy of the proposed approach, a systematic evaluation was performed on a depth camera-captured test dataset comprising 324 images, which collectively contain 343 manually annotated pickable pedicels. During image acquisition, the distance between the depth camera and prominent primary pickable pedicels was deliberately fixed at 25 cm to establish uniform experimental conditions. A localization was considered successful if the Euclidean distance between the estimated and ground-truth positions was ≤15 mm, a tolerance determined based on the pedicel diameter (3–5 mm) and the end-effector grasping tolerance (∼10 mm). All test images were processed through the picking-point localization pipeline detailed in the preceding sections, including 2D image target detection for pedicel region identification, depth completion via the large-neighborhood mean method for missing depth data compensation, skeleton extraction for picking point derivation, and coordinate transformation for 3D positioning. Experimental results reveal that 322 of the 343 picking points are accurately localized within the 15 mm tolerance, yielding an overall success rate of 93.88%, with depth errors bounded to approximately
cm. Detailed uncertainty analysis, including localization variability, depth sensor accuracy, and confidence estimation, is provided in
Appendix D.2.
To further validate the effectiveness of the depth completion component, we conducted quantitative comparisons of the proposed large-neighborhood mean method with alternative approaches, including conventional interpolation methods (bilinear and bicubic interpolation) and a learning-based method (BP-Net). As shown in
Table 7, our method achieves a localization RMSE of 1.05 cm and MAE of 0.81 cm at a working distance of 25 cm, with an inference time of 3.9 ms, while BP-Net achieves slightly better accuracy (RMSE of 0.92 cm, MAE of 0.72 cm), it requires extensive training data and has a much higher inference time of 23.0 ms, making it unsuitable for real-time harvesting applications. In contrast, our method requires no training and achieves a fast inference time of 3.9 ms, significantly outperforming BP-Net in efficiency while maintaining competitive accuracy. Compared to bilinear interpolation (RMSE 1.34 cm, MAE 1.27 cm, 3.2 ms) and bicubic interpolation (RMSE 1.28 cm, MAE 1.12 cm, 3.5 ms), our method achieves substantially better accuracy with a modest increase in inference time.
These results demonstrate that our method achieves a favorable balance between accuracy and efficiency, making it particularly suitable for real-time greenhouse harvesting applications where computational resources are limited. The high localization success rate (93.88%), coupled with the stringent depth error tolerance and efficient depth completion, thoroughly validates the robustness and reliability of the proposed method, confirming its suitability for real-world agricultural automation scenarios.
3.4. Real-Time Performance Evaluation on Edge Devices
To validate real-world deployability, we evaluated the proposed model on a Jetson Orin NX edge device (100 TOPS). As shown in
Table 8, the baseline YOLOv8n achieves 7.6 ms (132 FPS), while our YOLOv8n-EED-seg achieves 9.3 ms (108 FPS). With full post-processing (mask extraction, skeletonization, and depth completion), the total time increases to 16.2 ms (62 FPS). Since 30 FPS is the standard real-time benchmark for embedded systems, both 108 FPS and 62 FPS far exceed this requirement, validating the feasibility of real-time picking-point localization in greenhouse environments. However, this level of performance requires at least 100 TOPS of computational power, which is provided by the Jetson Orin NX. Future work will focus on model compression techniques to enable deployment on lower-power devices such as the Jetson Nano.