1. Introduction
With the ongoing modernization and scaling of agricultural production in China, traditional manual harvesting and conventional production practices are increasingly insufficient to meet the demands of large-scale fruit cultivation. Statistical reports indicate that labor costs for fruit harvesting account for approximately 35–40% of the total production cost, while the overall mechanized harvesting rate remains as low as 2.33% [
1]. Lychee, which originated in China, is predominantly cultivated in Guangdong Province. According to official agricultural statistics and industry forecasts, China’s total lychee production reached 1.78 million tons in 2024 [
2], as shown in
Figure 1. As a time-sensitive operation, harvesting represents a crucial stage in the fruit production cycle and directly affects yield, product quality, and economic returns. Therefore, the development of efficient and intelligent harvesting technologies has become an urgent need in modern orchard management.
The development of automated lychee-picking robots is of considerable practical significance, as such systems have the potential to substantially improve harvesting efficiency while reducing labor dependency and production costs. In robotic harvesting, accurate fruit detection, spatial localization, and obstacle avoidance constitute essential perception tasks that enable precise grasping and reliable execution of the harvesting operation. To date, most vision-based fruit-harvesting research has focused on single fruits or densely clustered fruit groups characterized by distinct visual features such as color, shape, and texture, as well as relatively regular growth patterns or clearly defined contours. However, comparatively limited attention has been given to fruit clusters with dispersed spatial distribution and picking points located on fruit-bearing branches. Existing studies addressing such scenarios are scarce and often restricted to individual fruit clusters or semi-mechanized vibration-based harvesting approaches [
3], which may cause mechanical damage to both the fruit and the tree structure. Meanwhile, extensive research efforts—both in China and internationally—have been devoted to robotic harvesting of apples, citrus fruits, lychees, tomatoes, cucumbers, and other crops by integrating advances in computer vision, robotics, and intelligent control technologies [
4]. The accurate recognition and localization of multiple fruit clusters in complex natural environments become a prominent research focus. Typical representatives of cluster fruits, such as lychee and longan, are generally priced higher than many other fruits in subtropical regions. The research group led by Zou Xiangjun at South China Agricultural University investigated the operational behavior of a lychee-harvesting robot through virtual-reality-based simulation analysis. Based on this study, they developed a prototype lychee-picking robotic system, as illustrated in
Figure 2. The system integrates a binocular vision module mounted on a six-degree-of-freedom industrial robotic arm produced by Guangzhou CNC, enabling fruit detection and spatial localization. An innovative end-effector mechanism, composed of dual gripping fingers and an eccentrically actuated cutting blade, was designed and driven by a motor to accomplish the harvesting action. Field experiments conducted in outdoor orchard environments reported a picking success rate of 78%, demonstrating the feasibility of the proposed robotic harvesting approach [
5,
6].
Target detection and localization technologies form the foundation of vision-based fruit positioning systems. However, their performance is highly susceptible to environmental variability in orchard settings, such as fluctuating illumination conditions, complex canopy occlusion, and substantial variations in fruit color, size, and morphology [
7,
8]. Existing studies have explored various strategies to improve detection and segmentation performance. Traditional image processing methods, such as double Otsu thresholding combined with k-means clustering, have been applied for lychee fruit and stem segmentation in field images. With the advancement of deep learning, object detection algorithms such as YOLO have been introduced into robotic harvesting systems for fruit localization. Furthermore, RGB-D-based neural networks, such as DaSNet-v1, have been proposed for simultaneous fruit and branch segmentation in orchard environments. For small-target detection in UAV imagery [
9], improved SSD-based models have been developed to enhance lychee recognition performance. Although these approaches have achieved encouraging results, several limitations remain. First, most studies primarily focus on single fruits or densely distributed clusters, while dispersed fruit clusters with irregular spatial structures have received limited attention [
10]. Second, detection and segmentation tasks are typically treated independently, without integrating fruit identification and branch segmentation for accurate cutting-point localization. Therefore, a unified framework that integrates robust cluster detection with precise branch segmentation for cutting-point computation remains lacking [
11]. Addressing these limitations is essential for enabling reliable robotic harvesting of dispersed lychee clusters.
Branch segmentation plays a crucial role in robotic harvesting, as cutting-point localization relies on accurate extraction of fruit-bearing branches. However, segmenting thin and irregular branch structures in natural orchard environments remains challenging due to illumination variation, occlusion, and background clutter. Existing semantic segmentation networks, such as U-Net, DeepLab v3+, and transformer-based architectures, have been applied in agricultural scenarios, but they often struggle with thin-structure continuity and small-scale object preservation [
12]. Therefore, enhancing feature representation for fine-grained branch extraction remains an open problem in robotic perception.
This study focuses on the visual perception stage of robotic harvesting systems, with lychee clusters selected as the target objects. Severe occlusion and overlapping fruit clusters lead to unstable detection and cluster grouping; complex and thin branch structures are difficult to segment reliably under cluttered backgrounds. A lightweight YOLO-SCM architecture incorporates SimAM attention and CMUNeXt modules to enhance feature representation while maintaining efficiency [
13]. The integration of MPDIoU loss is performed to improve bounding box regression under occlusion and irregular fruit distribution [
14]. A complete pipeline connects detection, clustering, segmentation, and picking-point localization tailored to robotic harvesting scenarios. By integrating object detection for fruit identification and semantic segmentation for branch extraction, the cutting points are systematically determined to support cutting-based harvesting, thereby improving fruit integrity and operational efficiency.
The main contributions of this study can be summarized as follows:
A unified perception framework is proposed for dispersed lychee clusters in natural orchard environments, integrating fruit detection, density-based clustering, branch semantic segmentation, and cutting-point localization into a complete robotic harvesting pipeline.
A lightweight YOLO-SCM detection architecture is developed by incorporating SimAM attention and CMUNeXt modules, together with MPDIoU loss, to enhance feature representation and improve robustness under occlusion and irregular fruit distribution.
A density-based clustering strategy is introduced to analyze the spatial distribution of detected fruits and automatically determine an adaptive harvesting sequence.
A semantic segmentation approach tailored for thin fruit-bearing branch extraction is designed to enable accurate cutting-point computation, supporting cutting-based harvesting operations and improving fruit integrity.
3. Experimental Results and Discussion
3.1. YOLO-SCM Network Performance Evaluation
The Experimental results show that the enhanced YOLO-SCM model delivers outstanding performance in lychee detection, achieving a precision of 84.3%, a recall of 73.2%, and a mean average precision (mAP) of 81.6%. A comparison of the precision, recall, mAP, and loss curves between YOLO11s and YOLO-SCM (as shown in
Figure 18) clearly indicates significant improvements in all three key metrics—precision, recall, and mAP—with the enhanced model. These improvements fully validate the stronger detection capabilities and practical value of the YOLO-SCM model for lychee object detection tasks.
As illustrated in
Figure 19, both the training and validation losses decrease steadily throughout the training process and converge to stable values without noticeable divergence in the later epochs. These results indicate that the proposed model achieves stable convergence and effective overfitting control, despite the relatively limited dataset size. Absence of significant performance degradation on the validation set suggests that the model maintains acceptable generalization capability within the collected orchard scenarios.
To evaluate the detection performance of the YOLO11s and YOLO-SCM models on lychee fruits under natural orchard conditions, a total of 521 images under varying lighting conditions and 279 images under different occlusion scenarios were randomly selected from the original dataset for testing. In total, 521 test images containing 5463 annotated lychee fruits were used for evaluation. The images with different lighting conditions include 122 images under low light containing 1387 lychee fruits, 235 images under strong light containing 2456 lychee fruits, and 164 images under shadow containing 1620 lychee fruits. The occlusion conditions consist of 154 images with fruit occlusion, containing 638 annotated lychee fruits, and 125 images with branch and leaf occlusion, containing 287 annotated fruits. The models’ performance under varying lighting conditions is summarized in
Table 2, while their performance under different occlusion conditions is presented in
Table 3. The experimental results demonstrate that the improved YOLO-SCM model consistently achieves higher precision, recall, and F1-score under all lighting conditions, demonstrating stronger robustness and reduced false detections.
To evaluate the statistical robustness of the proposed model, we conducted multiple independent training runs using different random seeds (0, 42, and 100). The mean and standard deviation of key metrics are reported in
Table 4. The lower standard deviation of YOLO-SCM indicates improved training stability compared to the baseline model.
To further assess the impact of the proposed enhancements to YOLO-SCM, ablation experiments were performed on the test set. These experiments provide a clearer understanding of the specific contributions of each module to the overall performance enhancement. The results of the ablation study are presented in
Table 5.
3.2. Clustering Performance Evaluation
The results of the k-means algorithm based on different k values and centroid selections are shown in
Table 6 and the effect diagrams are presented in
Figure 20.
The statistical results show that the highest average ARI can be achieved when the parameter k = 8. As the k value increases, the ARI will decrease instead. However, this algorithm is overly dependent on the selection of the parameter k and the distribution of the image features, so there are some data in the table where the ARI value is higher when k = 4, but lower when k = 8. When k = 16, the categories that were correctly divided when k = 8 will be split into two, which affects the ARI value. Therefore, for the k-means algorithm, the determination of the k value and the centroid are particularly important.
To analyze the sensitivity of clustering performance to the bandwidth parameter in the MeanShift algorithm, experiments were conducted with bandwidth values ranging from 20 to 60 pixels at intervals of 10 pixels. The results show that ARI remains stable within the range of 30–50 pixels, indicating that the clustering performance is not highly sensitive to moderate bandwidth variations. To quantitatively evaluate clustering performance, ground-truth cluster labels were manually constructed based on spatial proximity and structural continuity of fruit-bearing branches within each image. A fruit cluster was defined as a group of lychee fruits connected by visible branches or exhibiting dense spatial aggregation. Two authors independently assigned cluster membership labels according to the spatial distribution of detected fruit centers. Discrepancies were resolved through discussion to obtain consensus annotations. To assess annotation consistency, Cohen’s Kappa coefficient was calculated. The obtained Kappa value of 0.87 indicates strong agreement between annotators, demonstrating the reliability of the cluster ground truth.
3.3. Comparative Experiment of Five DeepLab v3+ Networks
The DeepLab v3+ model is built with the Pytorch framework. Xception and ResNet are selected as the backbone networks for comparison. Running on the autoDL platform using an NVIDIA V100 GPU. During training, the image size is uniformly set to 512 × 512. Stochastic Gradient Descent (SGD) is chosen as the optimizer for the algorithm. The cross-entropy is used as the loss function. The initial value of the base learning rate (
base_lr) is set to 0.01, and its adjustment formula is shown in Formula (16).
Among them, the
iter parameter represents the number of iterations, and max_
iter corresponds to the maximum number of iterations. As can be seen from Formula (16), the value of the parameter power can control the change in the learning rate curve, as shown in
Figure 21.
The comparison of five DeepLab v3+ networks was conducted in this study, with the differences lying in the selection of backbones and loss functions, namely Xception, ResNet-CE, DenseNet121, ResNet Focal and ResDense-Focal. The performance comparison of various networks is presented in
Table 7.
Figure 22a,
Figure 23a and
Figure 24a are the original image of a simple sample,
Figure 22b,
Figure 23b and
Figure 24b are the effect image of Xception,
Figure 22c,
Figure 23c and
Figure 24c are the effect image of DenseNet,
Figure 22d,
Figure 23d and
Figure 24d are the effect image of ResNet-CE,
Figure 22e,
Figure 23e and
Figure 24e are the effect image of ResNet-Focal, and
Figure 22f,
Figure 23f and
Figure 24f are the effect image of ResDense-Focal.
From the visualization results of simple samples, the target objects in the images are relatively clear, and the performance of all models is largely comparable. For medium-complexity samples, the Xception model performs slightly worse. In contrast, DenseNet and ResNet yield results that are more consistent with the original image branches, which may be attributed to the fact that the training of Xception has not yet fully converged. In the visualization results of complex samples, the background becomes more cluttered and the lychee branches appear thinner and more challenging to segment. Under these conditions, the Xception, DenseNet, and ResNet-CE models lose a substantial amount of detail. The ResNet-Focal model shows moderate improvement; however, compared with the other networks, the ResDense-Focal model produces predictions with more complete structural details and significantly better segmentation performance.
3.4. Pre-Location of Lychee Picking Points on Images
The result obtained in
Figure 16 is used to determine the maximum external matrix of the lychee. The maximum external matrix is moved upward pixel by pixel and intersects with the branches obtained through semantic segmentation. The horizontal distance of the intersection points on the inner side of the branches is calculated, as shown in
Figure 25.
In
Figure 25a, the blue dotted line indicates the horizontal distance between the intersection point of the largest external matrix of the lychee and the inner side of the branch. As the matrix box moves vertically upwards, the horizontal distance gradually decreases until it becomes 0, meaning the intersection points coincide. At this point, it is taken as the picking point of the lychee cluster. When the horizontal distance does not decrease to 0 as the external matrix moves vertically upwards, as shown in
Figure 25b, the highest point of the branch detected in the image is then regarded as the picking point.
Although the proposed method performs well in most scenarios, several failure cases were observed. These failures primarily occur under conditions of severe branch occlusion, incomplete segmentation of thin branch structures, or complex background interference. In such situations, the estimated branch geometry may deviate slightly from the actual structure, leading to displacement of the computed cutting point.
Figure 25 illustrates representative examples of successful and failed localization cases. The resulting variation in the computed picking-point coordinates remained within 4.2 pixels on average, indicating that the geometric computation method is relatively robust to moderate segmentation noise. The quantitative evaluation results are presented in
Table 8. The results demonstrate that the proposed geometric picking-point computation method achieves high localization accuracy, with the majority of predicted points falling within the acceptable tolerance range for robotic harvesting operations.
3.5. The Performance Analysis
To evaluate the real-time capability of the proposed framework, inference speed was measured on the AutoDL platform using an NVIDIA V100 GPU with batch size set to 1. The average inference time per image was 18.6 ms, corresponding to approximately 53.7 FPS. These results indicate that the proposed YOLO-SCM model satisfies real-time requirements for robotic harvesting applications. The model’s complexity is shown in
Table 9. Considering that NVIDIA Jetson Orin Nano provides up to 40 TOPS of AI computing power, and the proposed model requires approximately 23 GFLOPs per inference. Therefore, the proposed framework is theoretically feasible for deployment on embedded robotic platforms in future applications. Even accounting for practical efficiency loss, the estimated runtime satisfies real-time harvesting requirements.
To quantify the contribution of each component in the proposed framework, an ablation study was conducted by progressively integrating the detection, clustering, and branch segmentation modules. The results in
Table 10 show that the YOLO-SCM detection module significantly improves fruit detection accuracy compared with the baseline model. The introduction of density-based clustering enables effective grouping of dispersed fruit clusters, while branch segmentation further enables accurate picking-point localization. These results demonstrate that each module contributes to the overall system performance, confirming the effectiveness of the proposed integrated perception framework.
3.6. Limitations and Future Work
Although the proposed framework achieves satisfactory performance in lychee detection, cluster analysis, and picking point localization, several limitations should be acknowledged.
First, the current study is conducted primarily in the two-dimensional image domain. Although the theoretical procedure for mapping image coordinates to three-dimensional space is described, no depth-sensing hardware (e.g., RGB-D cameras or stereo vision systems) was employed for experimental validation. Therefore, the 3D spatial accuracy of picking point localization and its compatibility with real robotic manipulation remain to be verified [
30,
31].
Second, the dataset was collected from a single orchard within one harvesting season. Although diverse illumination conditions, occlusion scenarios, and multiple cultivars were included, variations in orchard structure, camera devices, and broader environmental conditions were not comprehensively covered. Therefore, the generalization ability of the proposed model may be limited when directly applied to significantly different agricultural environments. Future work will focus on cross-location and cross-device validation to further evaluate and enhance the model’s robustness.
Third, the clustering-based priority harvesting strategy relies on detection outputs and spatial distribution characteristics. Although MeanShift achieved superior ARI performance compared to k-means, density-based clustering may introduce additional computational overhead when scaling to large orchard scenes or real-time robotic systems.
Finally, this study focuses on the visual perception stage of robotic harvesting. The integration of the proposed perception framework with motion planning, manipulator control, and end-effector force regulation has not yet been experimentally implemented. Future research will therefore concentrate on: (1) Integrating depth sensing to achieve accurate 3D localization and coordinate transformation; (2) Constructing a larger multi-source dataset to enhance model robustness and generalization; (3) Optimizing computational efficiency for real-time deployment; and (4) Validating the complete perception–planning–execution pipeline on a physical lychee harvesting robot.
4. Conclusions
The current framework performs 2D visual perception. In practical robotic systems, depth sensing (e.g., RGB-D or stereo vision) would be integrated to enable 3D spatial localization of cutting points. The proposed framework can be directly extended by mapping detected cutting points to 3D coordinates through camera calibration and depth estimation. The proposed research framework is extensible to the harvesting of other cluster fruits, such as longan and cherry tomatoes, and provides valuable references and a practical foundation for realizing intelligent harvesting of complex cluster fruits. To a certain extent, this work contributes to accelerating the mechanization, automation, and intelligent transformation of agricultural production.
This study presents an integrated visual perception framework for lychee fruit detection and picking-point localization in natural orchard environments. Addressing the challenges of dispersed cluster distribution, illumination variation, and branch occlusion, an improved object detection model, YOLO-SCM, was developed based on YOLO11s. By incorporating the SimAM attention mechanism, CMUNeXt large-kernel depthwise separable convolution, and MPDIoU loss function, the model demonstrated enhanced feature extraction capability and regression accuracy. Experimental results showed that YOLO-SCM achieved a precision of 84.3%, recall of 73.2%, and mAP of 81.6%, outperforming the baseline model under various lighting and occlusion conditions.
To determine priority harvesting regions, clustering algorithms were employed to group detected fruits. Comparative experiments indicated that density-based clustering (MeanShift) achieved the highest average ARI value (0.768), demonstrating superior adaptability to irregular cluster distributions. For branch segmentation, an improved DeepLab v3+ model incorporating a ResDense-Focal backbone was proposed. The enhanced segmentation framework achieved superior mIoU (0.797248) and fwIoU (0.981818) performance compared with conventional backbone networks, enabling more accurate extraction of fruit-bearing branches in complex backgrounds. By integrating object detection, clustering analysis, and semantic segmentation, a two-dimensional picking point localization strategy was established. The proposed method provides a systematic solution for cluster-based fruit harvesting and offers technical support for the development of intelligent lychee-picking robots.
Overall, this research contributes to improving the accuracy and robustness of visual perception in cluster fruit harvesting and lays a foundation for the practical implementation of automated lychee harvesting systems in complex natural environments.