Research on Detection and Picking Point of Lychee Fruits in Natural Scenes Based on Deep Learning

Chang, Jing; Kim, Sangdae

doi:10.3390/agriculture16060686

Open AccessArticle

Research on Detection and Picking Point of Lychee Fruits in Natural Scenes Based on Deep Learning

by

Jing Chang

¹

and

Sangdae Kim

^2,*

¹

School of Computer Science, South China Business College, Guangdong University of Foreign Studies, Guangzhou 510545, China

²

Department of Medical IT Engineering, College of Software Convergence, Soonchunhyang University, Asan 31538, Republic of Korea

^*

Author to whom correspondence should be addressed.

Agriculture 2026, 16(6), 686; https://doi.org/10.3390/agriculture16060686

Submission received: 3 February 2026 / Revised: 14 March 2026 / Accepted: 16 March 2026 / Published: 18 March 2026

(This article belongs to the Special Issue Robots for Fruit Crops: Harvesting, Pruning, and Phenotyping)

Download

Browse Figures

Versions Notes

Abstract

China is one of the world’s major lychee producers, and the fruit’s soft texture, small size, and thin peel make non-destructive robotic harvesting particularly challenging. Accurate fruit detection, branch segmentation, and precise picking-point localization are critical for enabling automated harvesting in complex natural orchard environments. This study proposes an integrated perception framework for lychee harvesting that combines object detection, density-based clustering, and semantic segmentation. An improved YOLO11s-based detection network incorporating SimAM attention, CMUNeXt feature enhancement, and MPDIoU loss is developed to enhance robustness under illumination variation, occlusion, and scale changes. The proposed detector achieves a precision of 84.3%, recall of 73.2%, and mAP of 81.6%, outperforming baseline models. Density-based clustering is employed to group individual detections into fruit clusters. Comparative experiments demonstrate that MeanShift achieves the highest clustering consistency, with an average Adjusted Rand Index (ARI) of 0.768, outperforming k-means and other baselines. An improved DeepLab v3+ semantic segmentation network with a ResDenseFocal backbone and Focal Loss is designed for accurate branch extraction under complex backgrounds. Finally, a rule-based geometric picking-point localization algorithm is formulated in the image coordinate system by integrating detection, clustering, and branch segmentation results. Experimental validation demonstrates that the proposed framework can reliably localize picking points in two-dimensional images under natural orchard conditions. The proposed method provides a practical perception solution for intelligent lychee harvesting and establishes a foundation for future 3D robotic manipulation and field deployment.

Keywords:

visual perception; object detection; density clustering; image segmentation; picking point localization

1. Introduction

With the ongoing modernization and scaling of agricultural production in China, traditional manual harvesting and conventional production practices are increasingly insufficient to meet the demands of large-scale fruit cultivation. Statistical reports indicate that labor costs for fruit harvesting account for approximately 35–40% of the total production cost, while the overall mechanized harvesting rate remains as low as 2.33% [1]. Lychee, which originated in China, is predominantly cultivated in Guangdong Province. According to official agricultural statistics and industry forecasts, China’s total lychee production reached 1.78 million tons in 2024 [2], as shown in Figure 1. As a time-sensitive operation, harvesting represents a crucial stage in the fruit production cycle and directly affects yield, product quality, and economic returns. Therefore, the development of efficient and intelligent harvesting technologies has become an urgent need in modern orchard management.

The development of automated lychee-picking robots is of considerable practical significance, as such systems have the potential to substantially improve harvesting efficiency while reducing labor dependency and production costs. In robotic harvesting, accurate fruit detection, spatial localization, and obstacle avoidance constitute essential perception tasks that enable precise grasping and reliable execution of the harvesting operation. To date, most vision-based fruit-harvesting research has focused on single fruits or densely clustered fruit groups characterized by distinct visual features such as color, shape, and texture, as well as relatively regular growth patterns or clearly defined contours. However, comparatively limited attention has been given to fruit clusters with dispersed spatial distribution and picking points located on fruit-bearing branches. Existing studies addressing such scenarios are scarce and often restricted to individual fruit clusters or semi-mechanized vibration-based harvesting approaches [3], which may cause mechanical damage to both the fruit and the tree structure. Meanwhile, extensive research efforts—both in China and internationally—have been devoted to robotic harvesting of apples, citrus fruits, lychees, tomatoes, cucumbers, and other crops by integrating advances in computer vision, robotics, and intelligent control technologies [4]. The accurate recognition and localization of multiple fruit clusters in complex natural environments become a prominent research focus. Typical representatives of cluster fruits, such as lychee and longan, are generally priced higher than many other fruits in subtropical regions. The research group led by Zou Xiangjun at South China Agricultural University investigated the operational behavior of a lychee-harvesting robot through virtual-reality-based simulation analysis. Based on this study, they developed a prototype lychee-picking robotic system, as illustrated in Figure 2. The system integrates a binocular vision module mounted on a six-degree-of-freedom industrial robotic arm produced by Guangzhou CNC, enabling fruit detection and spatial localization. An innovative end-effector mechanism, composed of dual gripping fingers and an eccentrically actuated cutting blade, was designed and driven by a motor to accomplish the harvesting action. Field experiments conducted in outdoor orchard environments reported a picking success rate of 78%, demonstrating the feasibility of the proposed robotic harvesting approach [5,6].

Target detection and localization technologies form the foundation of vision-based fruit positioning systems. However, their performance is highly susceptible to environmental variability in orchard settings, such as fluctuating illumination conditions, complex canopy occlusion, and substantial variations in fruit color, size, and morphology [7,8]. Existing studies have explored various strategies to improve detection and segmentation performance. Traditional image processing methods, such as double Otsu thresholding combined with k-means clustering, have been applied for lychee fruit and stem segmentation in field images. With the advancement of deep learning, object detection algorithms such as YOLO have been introduced into robotic harvesting systems for fruit localization. Furthermore, RGB-D-based neural networks, such as DaSNet-v1, have been proposed for simultaneous fruit and branch segmentation in orchard environments. For small-target detection in UAV imagery [9], improved SSD-based models have been developed to enhance lychee recognition performance. Although these approaches have achieved encouraging results, several limitations remain. First, most studies primarily focus on single fruits or densely distributed clusters, while dispersed fruit clusters with irregular spatial structures have received limited attention [10]. Second, detection and segmentation tasks are typically treated independently, without integrating fruit identification and branch segmentation for accurate cutting-point localization. Therefore, a unified framework that integrates robust cluster detection with precise branch segmentation for cutting-point computation remains lacking [11]. Addressing these limitations is essential for enabling reliable robotic harvesting of dispersed lychee clusters.

Branch segmentation plays a crucial role in robotic harvesting, as cutting-point localization relies on accurate extraction of fruit-bearing branches. However, segmenting thin and irregular branch structures in natural orchard environments remains challenging due to illumination variation, occlusion, and background clutter. Existing semantic segmentation networks, such as U-Net, DeepLab v3+, and transformer-based architectures, have been applied in agricultural scenarios, but they often struggle with thin-structure continuity and small-scale object preservation [12]. Therefore, enhancing feature representation for fine-grained branch extraction remains an open problem in robotic perception.

This study focuses on the visual perception stage of robotic harvesting systems, with lychee clusters selected as the target objects. Severe occlusion and overlapping fruit clusters lead to unstable detection and cluster grouping; complex and thin branch structures are difficult to segment reliably under cluttered backgrounds. A lightweight YOLO-SCM architecture incorporates SimAM attention and CMUNeXt modules to enhance feature representation while maintaining efficiency [13]. The integration of MPDIoU loss is performed to improve bounding box regression under occlusion and irregular fruit distribution [14]. A complete pipeline connects detection, clustering, segmentation, and picking-point localization tailored to robotic harvesting scenarios. By integrating object detection for fruit identification and semantic segmentation for branch extraction, the cutting points are systematically determined to support cutting-based harvesting, thereby improving fruit integrity and operational efficiency.

The main contributions of this study can be summarized as follows:

A unified perception framework is proposed for dispersed lychee clusters in natural orchard environments, integrating fruit detection, density-based clustering, branch semantic segmentation, and cutting-point localization into a complete robotic harvesting pipeline.
A lightweight YOLO-SCM detection architecture is developed by incorporating SimAM attention and CMUNeXt modules, together with MPDIoU loss, to enhance feature representation and improve robustness under occlusion and irregular fruit distribution.
A density-based clustering strategy is introduced to analyze the spatial distribution of detected fruits and automatically determine an adaptive harvesting sequence.
A semantic segmentation approach tailored for thin fruit-bearing branch extraction is designed to enable accurate cutting-point computation, supporting cutting-based harvesting operations and improving fruit integrity.

2. Materials and Methods

2.1. System Overview

Deep learning-based recognition and localization techniques have significantly improved the adaptability of fruit and vegetable detection systems compared with traditional image-processing methods, offering enhanced robustness and generalization capability under complex environmental conditions. To achieve accurate lychee detection and precise picking-point localization in natural orchard scenes, this study proposes an integrated perception framework to guide robotic harvesting operations and improve picking efficiency. Specifically, the YOLO11s model is employed for fruit detection and spatial localization, while a clustering algorithm is introduced to determine priority picking regions within fruit clusters. Furthermore, a DeepLab v3+-based semantic segmentation network is utilized to segment fruit-bearing branches, enabling geometric analysis for optimal cutting-point estimation. The proposed framework is designed to localize cutting points on fruit-bearing branches under natural orchard conditions, thereby providing reliable target coordinates for robotic manipulation. The system input is an RGB image captured under varying illumination and occlusion conditions. The final output is the 2D coordinate of the cutting point corresponding to each fruit cluster [15]. The technical route diagram is shown in Figure 3; it illustrates the overall technical route of the proposed method, including fruit detection, clustering, branch segmentation, and picking-point localization.

2.1.1. Lychee Fruit Detection Algorithm Based on Convolutional Neural Network

This study examines the backbone architecture and loss function design of YOLO11s, a lychee-specific target detection network. In natural orchard environments, two challenging conditions frequently compromise the visual information necessary for accurate picking-point localization. The first arises from images acquired under highly variable illumination, while the second results from partial occlusion of fruit clusters by primary branches within the canopy. To improve robustness against these disturbances, targeted enhancement strategies are incorporated to alleviate the adverse effects of uneven lighting and structural occlusion [16]. Consequently, the proposed framework demonstrates improved stability and achieves higher detection accuracy in complex natural scenes.

2.1.2. Cluster Detection of Lychee Clusters Based on Density Clustering Algorithm

The spatial distribution of lychee fruits within the image is first analyzed to determine the harvesting sequence based on the number and arrangement of fruit clusters. Drawing on the principles of traditional clustering and density-based clustering algorithms, density clustering is applied to the detection results obtained in Section 2.1.1. By calculating the number and spatial configuration of clusters, an adaptive picking sequence is planned, enabling efficient and orderly harvesting operations.

2.1.3. Lychee Branch Segmentation Based on Semantic Segmentation Algorithm

Compared with object detection, image segmentation in natural scenes provides more detailed structural information about target regions, which is particularly critical for identifying cutting points on fruit-bearing branches within lychee clusters. Pixel-level classification enables precise extraction of both essential picking information and potential obstacle regions, especially for slender branches occupying only a small number of pixels. Based on the theoretical framework of semantic segmentation algorithms [17], and considering the encoder–decoder architecture, network design, and loss function formulation of DeepLab v3+, a task-specific semantic segmentation model is developed to accurately segment lychee branches for robotic harvesting applications.

2.1.4. The Location of Lychee Picking Spots

Based on the results of target detection and clustering, the maximum bounding box of each lychee cluster is first calculated. Subsequently, by integrating a semantic segmentation algorithm, the two-dimensional picking points in the image are determined, thereby completing lychee localization. The spatial localization of the fruit-bearing branches within a cluster is a critical step for successful harvesting.

2.2. Data Acquisition

Due to the absence of publicly available lychee datasets, this study constructed a dedicated dataset by collecting original images from an orchard located in Conghua District, Guangzhou. Image acquisition was conducted between May and June 2025, during daytime hours from 10:00 to 18:00, using a Nikon D750 digital camera (Nikon Corporation, Tokyo, Japan) and a Huawei smartphone (Huawei Technologies Co., Ltd., Shenzhen, China). The shooting distance ranged from 50 to 150 cm, with automatic exposure enabled. All images were saved in *.JPG format with a resolution of 4096 × 3072 pixels. A total of 1976 raw images were collected, encompassing diverse environmental conditions, including front-lit and backlit scenarios on sunny days, as well as multiple viewing angles and varying fruit postures under overcast conditions. To evaluate the robustness of the proposed method under scale variation, the samples were divided into three distance-based categories: close-range, medium-range, and long-range, corresponding to different camera-to-fruit distances. Figure 4 provides representative examples for each category, highlighting the changes in fruit scale and visibility. Specifically, Figure 4a–c shows close-range images of lychee samples, Figure 4d–f shows medium-range views, and Figure 4g–i shows long-range. In close-range images, lychee fruits are fewer in number, with clearer and more distinguishable features, making feature extraction relatively straightforward.

The dataset comprises multiple lychee cultivars, including Nuomici, Guiwei, and Feizixiao. To eliminate redundant samples and remove images without visible fruit instances, manual data screening was conducted. After this filtering process, 1744 valid images containing lychee fruits were retained for subsequent experiments. The open-source annotation tool LabelImg (v1.8.6) was employed to manually label fruit instances within each image, thereby constructing the final detection dataset. All annotated objects were assigned the class label “Lychee_fruit.” Annotations were performed based on the visible surface appearance of the fruits within the camera’s field of view. During labeling, bounding boxes were carefully drawn to tightly enclose each fruit instance. The corresponding annotation files containing positional information were generated in YOLO format and stored in the designated label directory. The overall annotation workflow is illustrated in Figure 5.

The dataset was randomly divided into: 70% training set, 20% validation set and 10% test set. The split was performed before data augmentation to avoid data leakage. Data augmentation effectively enriches the diversity of training samples and improves the model’s resilience to different environmental conditions [18]. Furthermore, with the aid of data enhancement techniques, it is possible to train more robust models on constrained datasets, thereby enhancing both the accuracy and generalization capabilities of the model. The dataset was expanded to 5232 images by applying Roboflow (Roboflow Inc., Des Moines, IA, USA; available online: https://roboflow.com, accessed on 19 August 2025) tools to rotate, flip, add noise, and blur the images, as shown in Figure 6.

2.3. Lychee Detection and Localization Based on YOLO11s

Since this study targets harvesting robots operating in complex natural orchard environments, the selected object detection network must balance several critical requirements, including lightweight architecture for embedded deployment, robust detection accuracy under varying illumination and cluttered backgrounds, and real-time computational efficiency. YOLO11s, a lightweight variant of the Ultralytics YOLO real-time object detection series [19], was adopted in this work. The model incorporates the C3k2 and SPPF modules, along with the newly designed C2PSA module, and continues the NMS-free training strategy introduced in YOLOv10, enabling efficient end-to-end detection. According to recent reports, YOLO11m achieves higher mean average precision (mAP) on the COCO dataset while reducing the number of parameters by 22% compared with YOLOv8m, thereby improving computational efficiency without sacrificing detection performance [11].

YOLO-SCM Model

Although YOLO11s has demonstrated outstanding performance as a general-purpose object detection model, lychee detection in natural orchard environments still requires optimization of specific target features—such as clustered fruits, complex occlusions, and comparable color–texture characteristics—enhancing both detection accuracy and model adaptability. To address this, the present study proposes a lychee fruit recognition model based on YOLO11s, named YOLO-SCM. This innovative model significantly enhances accuracy in detecting small-sized targets. The architectural improvements mainly include the following optimizations:

The C2f module incorporates the SimAM attention mechanism, which fuses channel and spatial attention, replacing the original C2PSA attention module. This enhancement improves the model’s ability to focus on and weigh different feature dimensions within the image.
The C3k2 module integrates CMUNeXt, which utilizes large-kernel depthwise separable convolutions, enhancing the model’s ability to perceive and process the diverse geometric features of lychee fruits.
Replacing the traditional CIoU loss function with the MPDIoU loss function leads to improved detection precision and faster convergence during model training.

Figure 7 illustrates the architecture of the improved YOLO-SCM model’s backbone network.

The SimAM attention mechanism combines spatial and channel attention, enhancing the neural network’s ability to perceive spatial information [20,21]. In natural orchard environments, factors such as occlusion by branches and leaves, fruit overlap, and varying lighting conditions can hinder accurate feature extraction. The original C3k2 module in YOLO11s employs standard convolution operations, which may weaken the model’s ability to capture distinct features of lychee fruits. To address this, the present study integrates CMUNeXt convolution (large-kernel depthwise separable convolution) into the C3k2 module to enhance the model’s perception of the geometric characteristics of lychees. The core idea of CMUNeXt is to replace standard convolutions with large-kernel depthwise separable convolutions. This allows the model to first extract global information across the channels of the lychee fruit image using large kernels.

YOLO11s utilizes the Complete Intersection over Union (CIoU) loss function. Since a large proportion of the lychee dataset consists of small-sized targets, continued use of this loss function may result in a high rate of missed detections. To effectively address this limitation, this study introduces the MPDIoU loss function. MPDIoU (Minimum Point Distance Intersection over Union) is a novel bounding box regression loss function that builds upon the traditional IoU and its improved variants by introducing a “minimum point distance” metric, further enhancing regression accuracy and convergence speed. The computational principles of MPDIoU are illustrated in Figure 8. By using the minimum point distance between the predicted and actual boxes as a similarity measure, it improves prediction accuracy and speeds up the convergence of model regression. MPDIoU loss function is formulated as (1)–(3):

d_{1}^{2} = {(x_{1}^{B} - x_{1}^{A})}^{2} + {(y_{1}^{B} - y_{1}^{A})}^{2}

(1)

d_{2}^{2} = {(x_{2}^{B} - x_{2}^{A})}^{2} + {(y_{2}^{B} - y_{2}^{A})}^{2}

(2)

L_{MPDIoU} = 1 - IoU + \frac{d_{1}^{2}}{w^{2} + h^{2}} + \frac{d_{2}^{2}}{w^{2} + h^{2}}

(3)

This study refines the network architecture by reducing the number of downsampling operations within the backbone, effectively simplifying the network’s overall depth. The detection head adopts a three-scale prediction structure with feature map strides of 8, 16, and 32 pixels, corresponding to small-, medium-, and large-scale object detection. As shown in Figure 9, compared to methods that include additional small-object detection heads, the proposed optimization substantially decreases network depth, computational complexity, and the quantity of parameters, rendering it more appropriates for real-world applications.

2.4. Priority Picking Region Determination Using Clustering

Lychee fruit clusters are typically distributed in a relatively dispersed manner in natural environments, which poses challenges for efficient robotic harvesting. To improve picking efficiency, the robot first applies a target detection algorithm to identify lychee fruits from a long distance. Based on the detection results, a clustering algorithm is employed to group the detected fruits and determine the cluster containing the largest number of lychees. This cluster is defined as the priority harvesting region. The overall process is illustrated in Figure 10.

2.4.1. k-Means Cluster Analysis

In this study, the k-means clustering algorithm was adopted to group detected lychee fruits [22]. Since the selection of the number of clusters (k) and the initialization of centroids significantly affect clustering performance, a structured initialization strategy was designed according to the spatial distribution characteristics of lychee fruits, Figure 11a shows the initial image:

First, based on the target detection results, the overall spatial range of detected lychees was determined by calculating the largest enclosing bounding box. At this stage, k = 1, and the centroid was defined as the geometric center of the bounding box (denoted as m), assuming that all detected fruits belong to a single cluster, as illustrated in Figure 11b.
Second, the bounding box was vertically divided into two subregions using the centroid obtained in Step 1, resulting in k = 2. The centroids (m₁ and m₂) were initialized as the geometric centers of the two subregions, as shown in Figure 11c.
Third, each subregion was further horizontally divided, producing four regions in total (k = 4). The corresponding centroids (m₁, m₂, m₃, and m₄) were initialized at the centers of these regions, as illustrated in Figure 11d.

Subsequently, the value of k was progressively increased in powers of two (e.g., 1, 2, 4, 8, 16) until the clustering results achieved satisfactory separation according to the spatial distribution of detected fruits, as illustrated in Figure 11e,f.

2.4.2. Analysis of Density Clustering

In this study, the density-based clustering algorithm MeanShift was employed to cluster detected lychee fruits in long-range images [23]. The MeanShift algorithm does not require the specification of initial parameters such as

k

value, neighborhood radius

ε

and

M i n P t s

. Cluster centers are automatically determined based on the density distribution of data points. The algorithm relies on kernel density estimation, where a kernel function and a bandwidth parameter are used to estimate the local data density. During the iterative process, each data point is shifted toward the region with the highest density within its neighborhood. This mean-shift updating continues until convergence, that is, until the displacement of the center becomes smaller than a predefined threshold. The converged points correspond to the local maxima (density peaks), which are regarded as the cluster centers. After convergence, each detected fruit is assigned to the nearest density peak according to the Euclidean distance, thereby forming distinct clusters. The clustering result is illustrated in Figure 12.

2.5. Lychee Branch Segmentation Based on DeepLab v3+

Due to the small size and fragile nature of lychee fruits, automated harvesting is typically realized by clamping the fruit-bearing branches and cutting them rather than directly grasping the fruit. Consequently, accurate detection of lychee branches is a critical prerequisite for successful automated picking. In recent years, semantic segmentation techniques based on deep learning have been widely adopted in the field of image understanding due to their strong feature representation capability. Therefore, this study employs the DeepLab V3+ algorithm to achieve precise segmentation of lychee branches. To enhance the model’s focus on hard-to-classify samples and address the issue of class imbalance in the lychee branch dataset, an improved Focal Loss function is introduced during training. Experimental results demonstrate that the proposed loss function effectively mitigates sample imbalance and improves the semantic segmentation accuracy of lychee branches to a certain extent [24]. Branch segmentation belongs to thin-structure semantic segmentation, which remains challenging due to discontinuity and occlusion.

2.5.1. Semantic Segmentation Dataset

This article is based on the DeepLab v3+ algorithm for semantic segmentation of lychee branches. Therefore, the dataset needs to be processed into a format that conforms to the training of this algorithm model. Using the LabelMe (v 5.2.1) tool, we mark the information of the lychee branches in the image and save it in JSON format. All annotations were assigned the label name “Lychee_branch”, as shown in Figure 13.

After converting it into the corresponding JSON file, it needs to be parsed before continuing with the training. In the data samples for object detection and random sampling without replacement, 703 images were randomly selected from the original 1744 images and subjected to four types of data augmentation: horizontal flipping, vertical flipping, random rotation within 0 to 360 degrees, and color jittering. The training set and test set were divided into a standard 4:1 ratio of the overall dataset. After augmentation, there were 2812 images in the training set and 703 in the test set, meeting the data requirements for pixel-level semantic segmentation. The number of samples in each category of these 703 images is statistically shown in Table 1.

2.5.2. DeepLab v3+ Model

This paper studies the segmentation effect of the DeepLab v3+ algorithm proposed by Liang-Chieh Chen et al. [25] on lychee branches. It adopts an encoder–decoder structure that fuses shallow features with deep upsampled features. The input image is processed through deep convolutional network operations to obtain a high-level abstract feature map with a smaller resolution. Different-sized atrous convolutions are used to sample features at a deeper level. The obtained high-level feature map is upsampled four times and fused with the shallow features to achieve the decoding output [25]. The structural schematic diagram is shown in Figure 14.

Using ResNet50 as the backbone network, the input feature maps are downsampled by a factor of 32. To preserve the spatial resolution of the feature maps, the stride in Layer4 is set to 1, thereby keeping the feature map size unchanged. Meanwhile, in order to maintain the receptive field of the convolutional layers in Layer 4, the standard 3 × 3 convolutions are replaced with dilated convolutions, in which holes are introduced to effectively expand the kernel size to 5 × 5 without increasing the number of parameters [17]. In the decoding process of DeepLab v3+, there are two 4× upsamplings. The first upsampling is used to upsample the high-level features of different scales extracted by ASPP to the same size as the 4× downsampled features in the encoder; the second upsampling upsamples these features to the size of the input image and feeds them into the classifier to obtain the final segmentation result.

2.5.3. ResNet-Focal_DeepLab v3+

In the official code of DeepLab v3+, the loss function adopted is the classic multi-class cross-entropy loss function. Its formula is (4):

C E (p_{t}) = - \log (p_{t})

(4)

Among them,

p_{t}

represents the probability of the predicted result corresponding to the label. For semantic segmentation, the loss calculation is carried out on a two-dimensional matrix, and the calculation method of its loss value is Formula (5):

L o s s (p_{t}) = - \frac{1}{W H} \sum_{i}^{H} \sum_{j}^{W} \log p_{t, i, j}

(5)

Among them, W represents the width of the image, H represents the height of the image, and p_t_,i,j represents the predicted probability corresponding to the label at position (i,j). The closer p_t_,i,j is to 1, the more accurate the prediction is, and the smaller the loss value will be.

Focal Loss (FL) was proposed in the object detection model RetinaNet. It is an improvement of the cross-entropy loss function and can focus on learning difficult samples and addressing the imbalance between positive and negative samples during training. The calculation formula of Focal Loss is Formula (6):

FL (p_{t}) = - {(λ - P_{t})}^{γ} \log (p_{t})

(6)

In the formula, both

λ

and

γ

are hyperparameters, and

p_{t}

represents the prediction probability of the algorithm model.

{(1 - p_{t})}^{γ}

can be regarded as the weight of Formula (6).

2.5.4. ResDense-Focal_DeepLab v3+

The DenseNet network adopts a dense connection, which can achieve the fusion of feature maps across layers, realize feature reuse, and ultimately improve the training efficiency of the model.

Yunpeng Chen et al. [26] proposed that the features extracted by the ResNet network are relatively new and have not been extracted before. This is because the residual block can reuse the features extracted by the previous layer and remove the directly connected features with the previous layer, so the redundancy of the features extracted by ResNet is relatively low. The features extracted by the DenseNet network from the previous layer are not directly reused by the next layer, but are further processed to create new features [27,28]. Therefore, the next layer may extract the same features as the previous layer in addition to the new features, so DenseNet may extract features with higher redundancy. Therefore, we combined the blocks of ResNet and DenseNet to generate the ResDense-Focal block, integrating the advantages of the blocks of ResNet and DenseNet. After concatenation, a 1 × 1 convolution is used for feature fusion.

Among them, Figure 15a shows the block structure of ResNet, Figure 15b presents the block structure of DenseNet, and Figure 15c illustrates the improved ResDense-Focal block structure that combines the block ideas of ResNet and DenseNet.

2.6. Optimal Picking Point Calculation

In Section 2.3, the target detection method is used to detect lychee. The analysis is conducted from a distance to a close range. First, lychees are detected, and only based on the results can the clustering analysis be carried out. When a region is selected based on the number of clusters, the camera is zoomed in to perform target detection and semantic segmentation operations, as shown in Figure 16.

The flowchart of this process is illustrated in Figure 17.

All geometric computations are performed in the 2D image coordinate system. The origin O(0, 0) is located at the top-left corner of the image. The x-axis increases horizontally to the right, and the y-axis increases vertically downward. A pixel location is denoted as:

p = (x, y) \in Z^{2}

. Let

B = (x_{\min}, y_{\min}, x_{\max}, y_{\max})

denote the bounding box of a detected lychee cluster.

M (x, y) \in \{0, 1\}

denote the binary branch segmentation mask.

C = (x_{c}, y_{c})

denote the centroid of the detected fruit cluster. To locate the attachment point between fruit cluster and branch, the bounding box is translated upward along the y-axis:

B_{k} = (x_{\min}, y_{\min} - k, x_{\max}, y_{\max} - k)

, where k represents the upward displacement. For each translated position, the horizontal intersection set between the upper boundary of B_k and the branch mask is defined as Formula (7):

I_{k} = \{x | M (x, y_{\min} - k) = 1, x_{\min} \leq x \leq x_{\max}\}

(7)

If

I_{k} \neq 0

, the leftmost and rightmost intersection points are:

x_{L} = \min (I_{k}), x_{R} = \max (I_{k})

. The inner-side intersection is defined as Formula (8):

x^{*} = \{\binom{x_{L}, if | x_{L} - x_{c} | < | x_{R} - x_{c} |}{x_{R}, otherwise}

(8)

The candidate picking point at displacement k is therefore:

P_{k} = (x^{*}, y_{\min} - k)

. The horizontal branch width at displacement k is defined as:

d_{k} = x_{R} - x_{L}

; the final picking point P* is determined by:

P^{*} = P_{k^{*}}

, where

k^{*} = \min \{k | d_{k} = 0\}

. This condition indicates that the branch width converges to a single intersection point, corresponding to the attachment region between fruit cluster and branch. If no k satisfies the convergence criterion, the highest branch pixel within the bounding box region is selected:

P^{*} = \arg \min_{(x, y)} y

s.t.

M (x, y) = 1, x_{\min} \leq x \leq x_{\max}

. When multiple connected branch components intersect with the translated bounding box, the component associated with the fruit cluster is selected based on: (1) Largest overlap ratio with the cluster bounding box; (2) Minimum Euclidean distance to the cluster centroid. This strategy reduces ambiguity in multi-branch crossing scenarios.

2.7. Training Settings and Evaluation Metrics

2.7.1. Experimental Setup

The experiments were conducted on the AutoDL AI computing cloud service platform to ensure sufficient computational resources and stable training conditions. The server hardware configuration consisted of a 10-core Intel^® Xeon^® CPU (Skylake, IBRS), an NVIDIA V100 GPU equipped with 32 GB of dedicated video memory, and 72 GB of system RAM. CUDA version 12.1 was used to accelerate model training and inference on the GPU. The software is developed using Python 3.10 and PyTorch 2.1.0 under Ubuntu 20.04.

2.7.2. Detection Accuracy Evaluation Metrics

Training hyperparameters are configured with the following specifications: learning rate initialized at 0.001, batch size of 16, weight decay coefficient of 0.0005, 200 epochs. To assess robustness, additional experiments were conducted with multiple random seeds and SGD optimization algorithm with a momentum of 0.9. Early stopping with a patience of 30 epochs was adopted to prevent overfitting. Consistency in dataset utilization and training configurations is maintained across all model implementations. To comprehensively evaluate the detection performance of the proposed models, several widely used accuracy-related metrics in object detection were adopted, including precision (P), recall (R), mean average precision (mAP), and the F1 score. These metrics jointly reflect the model’s ability to accurately detect target objects while minimizing false detections and missed detections [29]. Precision (P) measures the proportion of true-positive predictions among all positive predictions generated by the model. A higher precision value indicates that the model is more effective at reducing false positives and improving prediction reliability. Precision is mathematically defined as Formula (9):

Precision = \frac{TP}{TP + FP} \times 100 %

(9)

Recall (R) represents the ratio of correctly detected positive samples to the total number of actual positive samples in the dataset. A higher recall value indicates a stronger capability of the model to detect target objects comprehensively and reduce missed detections. The recall metric is defined as Formula (10):

Recall = \frac{TP}{TP + FN} \times 100 %

(10)

Mean Average Precision (mAP) is one of the most widely adopted comprehensive evaluation metrics in object detection. It is calculated by averaging the Average Precision (AP) of all object categories, where AP is derived from the area under the precision–recall (PR) curve at a specified Intersection over Union (IoU) threshold. In this study, mAP@0.5 was employed, where a detection is considered correct if the IoU between the predicted bounding box and the ground truth exceeds 0.5. The F1 score, defined as the harmonic mean of precision and recall, provides a balanced assessment of detection performance, particularly in situations involving a trade-off between false positives and false negatives. In this study, the F1 score was computed based on the optimal confidence threshold that maximizes the F1 value.

2.7.3. Clustering Evaluation

Various metrics have been proposed to evaluate clustering performance, which are generally categorized into internal and external validity indices. Internal validity measures evaluate clustering quality based on the intrinsic characteristics of the dataset, such as compactness, separability, and connectivity, without relying on ground-truth labels. In contrast, external validity measures assess the agreement between the clustering results and predefined reference labels. In this study, manual classification of lychee clusters was available as reference data. Therefore, an external validity metric was adopted to quantify the consistency between the clustering results and the manually annotated labels.

The Adjusted Rand Index (ARI) was selected as the evaluation metric. ARI measures the similarity between two data partitions by considering all pairs of samples and correcting for chance agreement. The index ranges from −1 to 1, where a value closer to 1 indicates higher consistency between the clustering result and the reference classification. The Adjusted Rand Index (ARI) is defined as Formula (11):

ARI = \frac{\sum_{i j} (\binom{n_{i j}}{2}) - [\sum_{i} (\binom{a_{i}}{2}) \sum_{j} (\binom{b_{j}}{2})] / (\binom{n}{2})}{\frac{1}{2} [\sum_{i} (\binom{a_{i}}{2}) + \sum_{j} (\binom{b_{j}}{2})] - [\sum_{i} (\binom{a_{i}}{2}) \sum_{j} (\binom{b_{j}}{2})] / (\binom{n}{2})}

(11)

Here,

n_{i, j}

denotes the number of samples assigned to cluster

i

in the predicted partition and cluster

j

in the reference partition,

a_{i}

and

a_{j}

represent the marginal totals, and

n

is the total number of samples.

2.7.4. DeepLab v3+ Evaluation Criteria

In semantic segmentation,

m I o U

(mean Intersection over Union) is the most commonly used metric to evaluate the performance of object detection and semantic segmentation methods. The calculation formula is shown in Formula (12):

m I o U = \frac{1}{K + 1} \sum_{i = 0}^{k} \frac{p_{i, j}}{\sum_{j = 0}^{k} p_{i, j}}

(12)

Here,

K

corresponds to the number of predicted categories, with the background counted as a separate category;

p_{i, j}

represents the number of correctly classified positive samples, while

p_{i, j}

and

p_{j, i}

are respectively interpreted as the number of misclassified positive samples and the number of misclassified negative samples.

Compared with the

m I o U

evaluation method, the

F W I o U

(Frequency Weighted Intersection over Union) approach incorporates a weighting scheme. It calculates the frequency of each class’s occurrence to set the weight for that category. The calculation formula is shown in Formula (13):

F W I o U = \frac{1}{\sum_{i = 0}^{k} \sum_{j = 0}^{k} p_{i, j}} \sum_{i = 0}^{k} \frac{p_{i, i}}{\sum_{j = 0}^{k} p_{i, j} + \sum_{j = 0}^{k} p_{j, i} - p_{i, i}}

(13)

Here,

k

corresponds to the number of predicted categories, with the background counted as a separate category;

p_{i, i}

represents the number of correctly classified positive samples, while

p_{i, j}

and

p_{j, i}

are respectively interpreted as the number of misclassified positive samples and the number of misclassified negative samples.

2.7.5. Picking-Point Localization Evaluation

To quantitatively evaluate the accuracy of the proposed geometric picking-point computation method, a subset of images was manually annotated with ground-truth cutting points by authors. The annotations were defined as the optimal cutting locations on fruit-bearing branches that allow safe detachment of lychee clusters while avoiding damage to surrounding structures. For each test image, the predicted picking point obtained from the proposed algorithm was compared with the manually labeled ground-truth point. The localization accuracy was evaluated using pixel-level error metrics. The Euclidean distance between the predicted picking point

P_{p} (x_{p}, y_{p})

and the ground-truth point

P_{g t} (x_{g t}, y_{g t})

was computed as:

E_{i} = \sqrt{{(x_{p} - x_{g t})}^{2} + {(y_{p} - y_{g t})}^{2}}

, where

E_{i}

represents the localization error for the i-th image.

The Mean Pixel Error (MPE) is defined as Formula (14):

RMSE = \sqrt{\frac{1}{N} \sum_{i = 1}^{N} E_{i}^{2}}

(14)

To evaluate practical applicability for robotic harvesting, a success rate within tolerance was also defined. A prediction was considered successful if the localization error was smaller than a predefined tolerance threshold T. In this study, T was set to 10 pixels, considering the spatial resolution of the input images. The success rate is defined as Formula (15):

SR = \frac{N (E_{i} < T)}{N} \times 100 %

(15)

3. Experimental Results and Discussion

3.1. YOLO-SCM Network Performance Evaluation

The Experimental results show that the enhanced YOLO-SCM model delivers outstanding performance in lychee detection, achieving a precision of 84.3%, a recall of 73.2%, and a mean average precision (mAP) of 81.6%. A comparison of the precision, recall, mAP, and loss curves between YOLO11s and YOLO-SCM (as shown in Figure 18) clearly indicates significant improvements in all three key metrics—precision, recall, and mAP—with the enhanced model. These improvements fully validate the stronger detection capabilities and practical value of the YOLO-SCM model for lychee object detection tasks.

As illustrated in Figure 19, both the training and validation losses decrease steadily throughout the training process and converge to stable values without noticeable divergence in the later epochs. These results indicate that the proposed model achieves stable convergence and effective overfitting control, despite the relatively limited dataset size. Absence of significant performance degradation on the validation set suggests that the model maintains acceptable generalization capability within the collected orchard scenarios.

To evaluate the detection performance of the YOLO11s and YOLO-SCM models on lychee fruits under natural orchard conditions, a total of 521 images under varying lighting conditions and 279 images under different occlusion scenarios were randomly selected from the original dataset for testing. In total, 521 test images containing 5463 annotated lychee fruits were used for evaluation. The images with different lighting conditions include 122 images under low light containing 1387 lychee fruits, 235 images under strong light containing 2456 lychee fruits, and 164 images under shadow containing 1620 lychee fruits. The occlusion conditions consist of 154 images with fruit occlusion, containing 638 annotated lychee fruits, and 125 images with branch and leaf occlusion, containing 287 annotated fruits. The models’ performance under varying lighting conditions is summarized in Table 2, while their performance under different occlusion conditions is presented in Table 3. The experimental results demonstrate that the improved YOLO-SCM model consistently achieves higher precision, recall, and F1-score under all lighting conditions, demonstrating stronger robustness and reduced false detections.

To evaluate the statistical robustness of the proposed model, we conducted multiple independent training runs using different random seeds (0, 42, and 100). The mean and standard deviation of key metrics are reported in Table 4. The lower standard deviation of YOLO-SCM indicates improved training stability compared to the baseline model.

To further assess the impact of the proposed enhancements to YOLO-SCM, ablation experiments were performed on the test set. These experiments provide a clearer understanding of the specific contributions of each module to the overall performance enhancement. The results of the ablation study are presented in Table 5.

3.2. Clustering Performance Evaluation

The results of the k-means algorithm based on different k values and centroid selections are shown in Table 6 and the effect diagrams are presented in Figure 20.

The statistical results show that the highest average ARI can be achieved when the parameter k = 8. As the k value increases, the ARI will decrease instead. However, this algorithm is overly dependent on the selection of the parameter k and the distribution of the image features, so there are some data in the table where the ARI value is higher when k = 4, but lower when k = 8. When k = 16, the categories that were correctly divided when k = 8 will be split into two, which affects the ARI value. Therefore, for the k-means algorithm, the determination of the k value and the centroid are particularly important.

To analyze the sensitivity of clustering performance to the bandwidth parameter in the MeanShift algorithm, experiments were conducted with bandwidth values ranging from 20 to 60 pixels at intervals of 10 pixels. The results show that ARI remains stable within the range of 30–50 pixels, indicating that the clustering performance is not highly sensitive to moderate bandwidth variations. To quantitatively evaluate clustering performance, ground-truth cluster labels were manually constructed based on spatial proximity and structural continuity of fruit-bearing branches within each image. A fruit cluster was defined as a group of lychee fruits connected by visible branches or exhibiting dense spatial aggregation. Two authors independently assigned cluster membership labels according to the spatial distribution of detected fruit centers. Discrepancies were resolved through discussion to obtain consensus annotations. To assess annotation consistency, Cohen’s Kappa coefficient was calculated. The obtained Kappa value of 0.87 indicates strong agreement between annotators, demonstrating the reliability of the cluster ground truth.

3.3. Comparative Experiment of Five DeepLab v3+ Networks

The DeepLab v3+ model is built with the Pytorch framework. Xception and ResNet are selected as the backbone networks for comparison. Running on the autoDL platform using an NVIDIA V100 GPU. During training, the image size is uniformly set to 512 × 512. Stochastic Gradient Descent (SGD) is chosen as the optimizer for the algorithm. The cross-entropy is used as the loss function. The initial value of the base learning rate (base_lr) is set to 0.01, and its adjustment formula is shown in Formula (16).

L r = b a s e_l r \times {(1 - \frac{i t e r}{\max_i t e r})}^{(p o w e r)}

(16)

Among them, the iter parameter represents the number of iterations, and max_iter corresponds to the maximum number of iterations. As can be seen from Formula (16), the value of the parameter power can control the change in the learning rate curve, as shown in Figure 21.

The comparison of five DeepLab v3+ networks was conducted in this study, with the differences lying in the selection of backbones and loss functions, namely Xception, ResNet-CE, DenseNet121, ResNet Focal and ResDense-Focal. The performance comparison of various networks is presented in Table 7.

Figure 22a, Figure 23a and Figure 24a are the original image of a simple sample, Figure 22b, Figure 23b and Figure 24b are the effect image of Xception, Figure 22c, Figure 23c and Figure 24c are the effect image of DenseNet, Figure 22d, Figure 23d and Figure 24d are the effect image of ResNet-CE, Figure 22e, Figure 23e and Figure 24e are the effect image of ResNet-Focal, and Figure 22f, Figure 23f and Figure 24f are the effect image of ResDense-Focal.

From the visualization results of simple samples, the target objects in the images are relatively clear, and the performance of all models is largely comparable. For medium-complexity samples, the Xception model performs slightly worse. In contrast, DenseNet and ResNet yield results that are more consistent with the original image branches, which may be attributed to the fact that the training of Xception has not yet fully converged. In the visualization results of complex samples, the background becomes more cluttered and the lychee branches appear thinner and more challenging to segment. Under these conditions, the Xception, DenseNet, and ResNet-CE models lose a substantial amount of detail. The ResNet-Focal model shows moderate improvement; however, compared with the other networks, the ResDense-Focal model produces predictions with more complete structural details and significantly better segmentation performance.

3.4. Pre-Location of Lychee Picking Points on Images

The result obtained in Figure 16 is used to determine the maximum external matrix of the lychee. The maximum external matrix is moved upward pixel by pixel and intersects with the branches obtained through semantic segmentation. The horizontal distance of the intersection points on the inner side of the branches is calculated, as shown in Figure 25.

In Figure 25a, the blue dotted line indicates the horizontal distance between the intersection point of the largest external matrix of the lychee and the inner side of the branch. As the matrix box moves vertically upwards, the horizontal distance gradually decreases until it becomes 0, meaning the intersection points coincide. At this point, it is taken as the picking point of the lychee cluster. When the horizontal distance does not decrease to 0 as the external matrix moves vertically upwards, as shown in Figure 25b, the highest point of the branch detected in the image is then regarded as the picking point.

Although the proposed method performs well in most scenarios, several failure cases were observed. These failures primarily occur under conditions of severe branch occlusion, incomplete segmentation of thin branch structures, or complex background interference. In such situations, the estimated branch geometry may deviate slightly from the actual structure, leading to displacement of the computed cutting point. Figure 25 illustrates representative examples of successful and failed localization cases. The resulting variation in the computed picking-point coordinates remained within 4.2 pixels on average, indicating that the geometric computation method is relatively robust to moderate segmentation noise. The quantitative evaluation results are presented in Table 8. The results demonstrate that the proposed geometric picking-point computation method achieves high localization accuracy, with the majority of predicted points falling within the acceptable tolerance range for robotic harvesting operations.

3.5. The Performance Analysis

To evaluate the real-time capability of the proposed framework, inference speed was measured on the AutoDL platform using an NVIDIA V100 GPU with batch size set to 1. The average inference time per image was 18.6 ms, corresponding to approximately 53.7 FPS. These results indicate that the proposed YOLO-SCM model satisfies real-time requirements for robotic harvesting applications. The model’s complexity is shown in Table 9. Considering that NVIDIA Jetson Orin Nano provides up to 40 TOPS of AI computing power, and the proposed model requires approximately 23 GFLOPs per inference. Therefore, the proposed framework is theoretically feasible for deployment on embedded robotic platforms in future applications. Even accounting for practical efficiency loss, the estimated runtime satisfies real-time harvesting requirements.

To quantify the contribution of each component in the proposed framework, an ablation study was conducted by progressively integrating the detection, clustering, and branch segmentation modules. The results in Table 10 show that the YOLO-SCM detection module significantly improves fruit detection accuracy compared with the baseline model. The introduction of density-based clustering enables effective grouping of dispersed fruit clusters, while branch segmentation further enables accurate picking-point localization. These results demonstrate that each module contributes to the overall system performance, confirming the effectiveness of the proposed integrated perception framework.

3.6. Limitations and Future Work

Although the proposed framework achieves satisfactory performance in lychee detection, cluster analysis, and picking point localization, several limitations should be acknowledged.

First, the current study is conducted primarily in the two-dimensional image domain. Although the theoretical procedure for mapping image coordinates to three-dimensional space is described, no depth-sensing hardware (e.g., RGB-D cameras or stereo vision systems) was employed for experimental validation. Therefore, the 3D spatial accuracy of picking point localization and its compatibility with real robotic manipulation remain to be verified [30,31].

Second, the dataset was collected from a single orchard within one harvesting season. Although diverse illumination conditions, occlusion scenarios, and multiple cultivars were included, variations in orchard structure, camera devices, and broader environmental conditions were not comprehensively covered. Therefore, the generalization ability of the proposed model may be limited when directly applied to significantly different agricultural environments. Future work will focus on cross-location and cross-device validation to further evaluate and enhance the model’s robustness.

Third, the clustering-based priority harvesting strategy relies on detection outputs and spatial distribution characteristics. Although MeanShift achieved superior ARI performance compared to k-means, density-based clustering may introduce additional computational overhead when scaling to large orchard scenes or real-time robotic systems.

Finally, this study focuses on the visual perception stage of robotic harvesting. The integration of the proposed perception framework with motion planning, manipulator control, and end-effector force regulation has not yet been experimentally implemented. Future research will therefore concentrate on: (1) Integrating depth sensing to achieve accurate 3D localization and coordinate transformation; (2) Constructing a larger multi-source dataset to enhance model robustness and generalization; (3) Optimizing computational efficiency for real-time deployment; and (4) Validating the complete perception–planning–execution pipeline on a physical lychee harvesting robot.

4. Conclusions

The current framework performs 2D visual perception. In practical robotic systems, depth sensing (e.g., RGB-D or stereo vision) would be integrated to enable 3D spatial localization of cutting points. The proposed framework can be directly extended by mapping detected cutting points to 3D coordinates through camera calibration and depth estimation. The proposed research framework is extensible to the harvesting of other cluster fruits, such as longan and cherry tomatoes, and provides valuable references and a practical foundation for realizing intelligent harvesting of complex cluster fruits. To a certain extent, this work contributes to accelerating the mechanization, automation, and intelligent transformation of agricultural production.

This study presents an integrated visual perception framework for lychee fruit detection and picking-point localization in natural orchard environments. Addressing the challenges of dispersed cluster distribution, illumination variation, and branch occlusion, an improved object detection model, YOLO-SCM, was developed based on YOLO11s. By incorporating the SimAM attention mechanism, CMUNeXt large-kernel depthwise separable convolution, and MPDIoU loss function, the model demonstrated enhanced feature extraction capability and regression accuracy. Experimental results showed that YOLO-SCM achieved a precision of 84.3%, recall of 73.2%, and mAP of 81.6%, outperforming the baseline model under various lighting and occlusion conditions.

To determine priority harvesting regions, clustering algorithms were employed to group detected fruits. Comparative experiments indicated that density-based clustering (MeanShift) achieved the highest average ARI value (0.768), demonstrating superior adaptability to irregular cluster distributions. For branch segmentation, an improved DeepLab v3+ model incorporating a ResDense-Focal backbone was proposed. The enhanced segmentation framework achieved superior mIoU (0.797248) and fwIoU (0.981818) performance compared with conventional backbone networks, enabling more accurate extraction of fruit-bearing branches in complex backgrounds. By integrating object detection, clustering analysis, and semantic segmentation, a two-dimensional picking point localization strategy was established. The proposed method provides a systematic solution for cluster-based fruit harvesting and offers technical support for the development of intelligent lychee-picking robots.

Overall, this research contributes to improving the accuracy and robustness of visual perception in cluster fruit harvesting and lays a foundation for the practical implementation of automated lychee harvesting systems in complex natural environments.

Author Contributions

J.C. and S.K. contributed equally to the literature review, analysis, and manuscript preparation. All authors have read and agreed to the published version of the manuscript.

Funding

This work was funded by “Research and Application of Lychee Fruit Detection in Natural Scenes Based on Deep Learning”, Key Special Project of the Guangdong Provincial Department of Education, 2024 (Grant number: 2024ZDZX4043); “Programming Algorithms Teaching and Research Office”, 2025 Guangdong Provincial Quality Engineering Project (Grant number: SJ-JYS2025-003) and was supported by the Soonchunhyang University Research Fund (2026-0004).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Huazhong, L.; Jun, L.; Can, L. Research Progress in Orchard Mechanization Production Technology. Guangdong Agric. Sci. 2020, 47, 226–235. [Google Scholar] [CrossRef]
Daily, C. China’s Lychee Output Expected to Decline in 2024. 2024. Available online: https://www.chinadaily.com.cn/a/202407/12/WS669098c9a31095c51c50dc20.html?utm_source=chatgpt.com (accessed on 2 March 2026).
Li, B.; Lu, H.; Lü, E.; Li, J.; Qiu, G.; Yin, H.; Ma, Y. Characterizing energy transfer of litchi branches and working parameters of destemmed vibrational picking. Trans. Chin. Soc. Agric. Eng. 2018, 34, 18–25. [Google Scholar]
Bac, C.W.; van Henten, E.J.; Hemming, J.; Edan, Y. Harvesting Robots for High-value Crops: State-of-the-art Review and Challenges Ahead. J. Field Robot. 2014, 31, 888–911. [Google Scholar] [CrossRef]
Zou, X.; Zou, H.; Lu, J. Virtual manipulator-based binocular stereo vision positioning system and errors modelling. Mach. Vis. Appl. 2012, 23, 43–63. [Google Scholar] [CrossRef]
Zou, X.; Ye, M.; Luo, C.; Xiong, J.; Luo, L.; Wang, H.; Chen, Y. Fault-Tolerant Design of a Limited Universal Fruit-Picking End-Effector Basedon Vision-Positioning Error. Appl. Eng. Agric. 2016, 32, 5–18. [Google Scholar]
Karkee, M.; Zhang, Q. Mechanization and automation technologies in specialty crop production. Resour. Mag. 2012, 19, 16–17. [Google Scholar]
Tang, Y.; Chen, M.; Wang, C.; Luo, L.; Li, J.; Lian, G.; Zou, X. Recognition and Localization Methods for Vision-Based Fruit Picking Robots: A Review. Front. Plant Sci. 2020, 11, 510. [Google Scholar] [CrossRef] [PubMed]
Hongxing, P.; Jing, L.; Huiming, X.; Hu, C.; Zheng, X.; Huijun, H.; Juntao, X. Litchi detection based on multiple feature enhancement and feature fusion SSD. Trans. Chin. Soc. Agric. Eng. 2022, 38, 169–177. [Google Scholar] [CrossRef]
Wei, J.; Ni, L.; Luo, L.; Chen, M.; You, M.; Sun, Y.; Hu, T. GFS-YOLO11: A Maturity Detection Model for Multi-Variety Tomato. Agronomy 2024, 14, 2644. [Google Scholar] [CrossRef]
Chen, R.; Zhou, H.; Xie, H.; Wang, B. YOLO-CE: An underwater low-visibility environment target detection algorithm based on YOLO11. J. Supercomput. 2025, 81, 723. [Google Scholar] [CrossRef]
Liu, Z.; Feng, Q.; Qin, C.; Lin, Y.; Xia, P.; Wang, H.; Gong, L.; Liu, C. EDSC-HRAFNet: An apple tree branch semantic segmentation model for harvesting robots under complex orchard conditions. Artif. Intell. Agric. 2026, in press. [Google Scholar] [CrossRef]
Yang, L.; Zhang, R.Y.; Li, L.; Xie, X. SimAM A Simple, Parameter-Free Attention Module for Convolutional Neural Networks. In Proceedings of the International Conference on Machine Learning (ICML), Virtual, 18–24 July 2021; pp. 63–74. [Google Scholar]
Ma, S.; Xu, Y. MPDIoU A loss for efficient and accurate bounding box regression. arXiv 2023, arXiv:2307.07662. [Google Scholar] [CrossRef]
Kang, H.; Chen, C. Fruit detection, segmentation and 3D visualisation of environments in apple orchards. Comput. Electron. Agric. 2020, 171, 105302. [Google Scholar] [CrossRef]
Wu, X.; Su, X.; Ma, Z.; Xu, B. YOLO-lychee-advanced: An optimized detection model for lychee pest damage based on YOLOv11. Front. Plant. Sci. 2025, 16, 1643700. [Google Scholar] [CrossRef]
Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 801–808. [Google Scholar]
Roy, A.M.; Bhaduri, J. Real-time growth stage detection model for high degree of occultation using DenseNet-fused YOLOv4. Comput. Electron. Agric. 2022, 193, 106694. [Google Scholar] [CrossRef]
Zeng, T.; Li, S.; Song, Q.; Zhong, F.; Wei, X. Lightweight tomato real-time detection method based on improved YOLO and mobile deployment. Comput. Electron. Agric. 2023, 205, 107625. [Google Scholar] [CrossRef]
Yin, H.; Wang, B.; Jing, Y.; Li, J.; Wang, P.; Quan, G.; Sun, T. Improved YOLOv7 method for counting watermelons in UAV aerial videos. Trans. Chin. Soc. Agric. Eng. 2024, 40, 124–134. [Google Scholar] [CrossRef]
Liang, L.; Zhang, Y.; Zhang, S.; Li, J.; Plaza, A.; Kang, X. Fast Hyperspectral Image Classification Combining Transformers and SimAM-Based CNNs. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5522219. [Google Scholar] [CrossRef]
Campello, R.J.G.B.; Moulavi, D.; Sander, J. Density-Based Clustering Based on Hierarchical Density Estimates. In Pacific-Asia Conference on Knowledge Discovery and Data Mining; Springer: Berlin/Heidelberg, Germany, 2013; pp. 160–172. [Google Scholar]
Comaniciu, D.; Meer, P. Mean Shift: A Robust Approach Toward Feature Space Analysis. IEEE Trans. Pattern Anal. Mach. Intell. 2002, 24, 603–619. [Google Scholar] [CrossRef]
Zhao, Y.; Gong, L.; Huang, Y.; Liu, C. A review of key techniques of vision-based control for harvesting robot. Comput. Electron. Agric. 2016, 127, 311–323. [Google Scholar] [CrossRef]
Chen, L.C.; Papandreou, G.; Schroff, F.; Adam, H. Rethinking Atrous Convolution for Semantic Image Segmentation. arXiv 2017, arXiv:1706.05587. [Google Scholar] [CrossRef]
Chen, Y.; Li, J.; Xiao, H.; Jin, X.; Yan, S.; Feng, J. Dual path networks. In Advances in Neural Information Processing Systems; NeurIPS: San Francisco, CA, USA, 2017; Volume 30. [Google Scholar]
Hong, Y.; Pan, H.; Jia, Y.; Sun, W.; Gao, H. ResDNet: Efficient Dense Multi-Scale Representations With Residual Learning for High-Level Vision Tasks. IEEE Trans. Neural Netw. Learn. Syst. 2022, 36, 3904–3915. [Google Scholar] [CrossRef] [PubMed]
Li, X.; Yang, T.; Tang, M.; Xiong, P. A novel parameter dense three-dimensional convolution residual network method and its application in classroom teaching. Front. Neurosci. 2024, 18, 1482735. [Google Scholar] [CrossRef]
Everingham, M.; Van Gool, L.; Williams, C.K.I.; Winn, J.; Zisserman, A. The Pascal Visual Object Classes (VOC) challenge. Int. J. Comput. Vis. 2010, 88, 303–338. [Google Scholar] [CrossRef]
Huang, Y.; Xu, S.; Chen, H.; Li, G.; Dong, H.; Yu, J.; Zhang, X.; Chen, R. A review of visual perception technology for intelligent fruit harvesting robots. Front. Plant Sci. 2025, 16, 1646871. [Google Scholar] [CrossRef] [PubMed]
Chen, M.; Luo, L.; Liu, W.; Wei, H.; Wang, J.; Lu, Q.; Luo, S. Orchard-Wide Visual Perception and Autonomous Operation of Fruit Picking Robots: A Review. Smart Agric. 2024, 6, 20–39. [Google Scholar]

Figure 1. Lychee production in China in 2024.

Figure 2. Lychee-picking robot.

Figure 3. Technology roadmap.

Figure 4. Images of the lychee datasets.

Figure 5. Annotation of lychee data using LabelImg.

Figure 6. Image enhancement effect diagram.

Figure 7. YOLO-SCM model backbone network structure diagram. Note: split means dividing the input feature map into two parts, one part enters the bottleneck module, and the other part directly participates in the subsequent concatenation; SimAM is an attention mechanism module; Concat is a feature concatenation module; N is the number of bottleneck blocks; CMUNeXt is a large kernel depth separable convolution module.

Figure 8. Schematic of MPDIoU Loss Function.

Figure 9. Reconstruction of detection layer.

Figure 10. Utilizing clustering to locate harvesting areas.

Figure 11. Determination of k value and centroids in k-means algorithm.

Figure 12. Clustering effect of MeanShift algorithm on lychee.

Figure 13. Information annotation of lychee branches.

Figure 14. Schematic diagram of deepLab v3+ modules.

Figure 15. Block structures of different networks.

Figure 16. Results after object detection, clustering and semantic segmentation.

Figure 17. Flowchart of lychee picking points.

Figure 18. Training process curves. (a) Precision; (b) Recall; (c) mAP; (d) Loss.

Figure 19. Training and validation loss curves.

Figure 20. Effect diagrams of k-means algorithm with different k values. (a) the original image; (b) k = 4; (c) k = 8; (d) k = 16.

Figure 21. Learning rate curve.

Figure 22. Segmentation effect diagrams of simple samples by different models.

Figure 23. Segmentation effect comparison of different models on medium-sized samples.

Figure 24. Segmentation effect comparison of complex samples by different models.

Figure 25. Determination of two-dimensional picking points.

Table 1. Number of Dataset Classifications.

Image Resolution	Sample Size	The Quantity After Data Augmentation
Simple	270	1350
Medium	268	1340
Complicated	165	825
Total	703	3515

Note. The three scene categories are defined based on occlusion level, fruit density, and illumination variation. Simple: clear background, minimal occlusion, uniform lighting; Medium: moderate occlusion and partial illumination variation; Complicated: dense fruit clusters, severe occlusion, and significant lighting variation.

Table 2. Object-level detection performance under different lighting conditions.

Scene	Model	GT Fruits	TP	FP	FN	Precision (%)	Recall (%)	F1 (%)
Low light	YOLO11s	1387	1319	73	68	94.76	95.10	94.93
Low light	YOLO-SCM	1387	1342	53	45	96.20	96.76	96.48
Strong light	YOLO11s	2456	2372	99	84	95.99	96.58	96.29
Strong light	YOLO-SCM	2456	2403	64	53	97.41	97.84	97.63
Shadow	YOLO11s	1620	1551	77	69	95.27	95.74	95.50
Shadow	YOLO-SCM	1620	1580	47	40	97.11	97.53	97.32

Table 3. Object-level detection performance under different occlusion conditions.

Scene	Model	GT Fruits	TP	FP	FN	Precision (%)	Recall (%)	F1 (%)
Fruit occlusion	YOLO11s	638	621	32	17	95.10	97.34	96.21
Fruit occlusion	YOLO-SCM	638	626	26	12	96.01	98.12	97.05
Branch and foliage occlusion	YOLO11s	287	269	19	18	93.40	93.73	93.57
Branch and foliage occlusion	YOLO-SCM	287	278	13	9	95.53	96.86	96.19

Table 4. Statistical robustness analysis of detection performance.

Model	mAP (%)
YOLO11s	78.2 ± 0.6
YOLO-SCM	81.6 ± 0.3

Table 5. Ablation experiment results.

Model	SimAM	CMUNeXt	MPDIoU	P (%)	R (%)
YOLO11s	✗	✗	✗	84.2	69.2
YOLO-S	✓	✗	✗	85.1	68.8
YOLO-C	✗	✓	✗	84.9	68.7
YOLO-M	✗	✗	✓	83.9	69.9
YOLO-SCM	✓	✓	✓	84.3	73.2

Table 6. Comparative analysis of ARI values of k-means and MeanShift.

Image Sequence Number	k-Means					MeanShift
Image Sequence Number	k = 1	k = 2	k = 4	k = 8	k = 16	MeanShift
1	0.211	0.415	0.631	0.748	0.591	0.827
2	0.207	0.399	0.640	0.723	0.603	0.723
3	0.198	0.374	0.673	0.715	0.621	0.743
4	0.174	0.361	0.682	0.701	0.583	0.754
5	0.235	0.426	0.657	0.738	0.556	0.798
6	0.171	0.386	0.603	0.676	0.569	0.791
7	0.168	0.401	0.639	0.693	0.597	0.826
8	0.147	0.313	0.623	0.687	0.473	0.761
9	0.169	0.394	0.661	0.734	0.586	0.883
10	0.101	0.330	0.592	0.671	0.461	0.713
11	0.157	0.384	0.541	0.621	0.501	0.683
12	0.167	0.333	0.498	0.601	0.473	0.688
13	0.188	0.372	0.651	0.702	0.571	0.739
14	0.201	0.363	0.661	0.722	0.543	0.771
15	0.178	0.381	0.674	0.729	0.571	0.829
Mean value	0.178	0.375	0.628	0.700	0.553	0.768

Table 7. Performance comparison of various networks.

Model	mIoU	fwIoU
DeepLab v3+_Xception	0.705947	0.974245
DeepLab v3+_ResNet-CE	0.790425	0.981569
DeepLab v3+_DenseNet	0.783496	0.980479
DeepLab v3+_ResNet-Focal	0.783496	0.981564
DeepLab v3+_ResDense-Focal	0.797248	0.981818

Table 8. The quantitative evaluation results.

Matric	Value
Mean Pixel Error (MPE)	4.2 px
RMSE	8.3 px
Success Rate (≤10 px)	89.3%

Table 9. Model complexity.

Model	Params (M)	FLOPs (G)	FPS
YOLO11s	9.4	21.7	62
YOLO-SCM	10.2	23.1	54

Table 10. Ablation of the proposed framework.

Method	Detection mAP	Cluster ARI	Picking Success Rate
Baseline YOLO11s	78.2	-	-
YOLO-SCM	81.6	-	-
YOLO-SCM + Clustering	81.6	0.81	-
YOLO-SCM + Clustering + Segmentation	81.6	0.81	89.3%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Chang, J.; Kim, S. Research on Detection and Picking Point of Lychee Fruits in Natural Scenes Based on Deep Learning. Agriculture 2026, 16, 686. https://doi.org/10.3390/agriculture16060686

AMA Style

Chang J, Kim S. Research on Detection and Picking Point of Lychee Fruits in Natural Scenes Based on Deep Learning. Agriculture. 2026; 16(6):686. https://doi.org/10.3390/agriculture16060686

Chicago/Turabian Style

Chang, Jing, and Sangdae Kim. 2026. "Research on Detection and Picking Point of Lychee Fruits in Natural Scenes Based on Deep Learning" Agriculture 16, no. 6: 686. https://doi.org/10.3390/agriculture16060686

APA Style

Chang, J., & Kim, S. (2026). Research on Detection and Picking Point of Lychee Fruits in Natural Scenes Based on Deep Learning. Agriculture, 16(6), 686. https://doi.org/10.3390/agriculture16060686

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Research on Detection and Picking Point of Lychee Fruits in Natural Scenes Based on Deep Learning

Abstract

1. Introduction

2. Materials and Methods

2.1. System Overview

2.1.1. Lychee Fruit Detection Algorithm Based on Convolutional Neural Network

2.1.2. Cluster Detection of Lychee Clusters Based on Density Clustering Algorithm

2.1.3. Lychee Branch Segmentation Based on Semantic Segmentation Algorithm

2.1.4. The Location of Lychee Picking Spots

2.2. Data Acquisition

2.3. Lychee Detection and Localization Based on YOLO11s

YOLO-SCM Model

2.4. Priority Picking Region Determination Using Clustering

2.4.1. k-Means Cluster Analysis

2.4.2. Analysis of Density Clustering

2.5. Lychee Branch Segmentation Based on DeepLab v3+

2.5.1. Semantic Segmentation Dataset

2.5.2. DeepLab v3+ Model

2.5.3. ResNet-Focal_DeepLab v3+

2.5.4. ResDense-Focal_DeepLab v3+

2.6. Optimal Picking Point Calculation

2.7. Training Settings and Evaluation Metrics

2.7.1. Experimental Setup

2.7.2. Detection Accuracy Evaluation Metrics

2.7.3. Clustering Evaluation

2.7.4. DeepLab v3+ Evaluation Criteria

2.7.5. Picking-Point Localization Evaluation

3. Experimental Results and Discussion

3.1. YOLO-SCM Network Performance Evaluation

3.2. Clustering Performance Evaluation

3.3. Comparative Experiment of Five DeepLab v3+ Networks

3.4. Pre-Location of Lychee Picking Points on Images

3.5. The Performance Analysis

3.6. Limitations and Future Work

4. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI