A Vision-Based Robot System with Grasping-Cutting Strategy for Mango Harvesting

Liu, Qianling; Lu, Zhiheng

doi:10.3390/agriculture16010132

Open AccessArticle

A Vision-Based Robot System with Grasping-Cutting Strategy for Mango Harvesting

by

Qianling Liu

and

Zhiheng Lu

^*

College of Mechanical Engineering, Guangxi University, Nanning 530004, China

^*

Author to whom correspondence should be addressed.

Agriculture 2026, 16(1), 132; https://doi.org/10.3390/agriculture16010132

Submission received: 2 November 2025 / Revised: 19 December 2025 / Accepted: 26 December 2025 / Published: 4 January 2026

(This article belongs to the Special Issue AI-Powered Agricultural Robots: From Field Sensing to Autonomous Operation)

Download

Browse Figures

Versions Notes

Abstract

Mango is the second most widely cultivated tropical fruit in the world. Its harvesting mainly relies on manual labor. During the harvest season, the hot weather leads to low working efficiency and high labor costs. Current research on automatic mango harvesting mainly focuses on locating the fruit stem harvesting point, followed by stem clamping and cutting. However, these methods are less effective when the stem is occluded. To address these issues, this study first acquires images of four mango varieties in a mixed cultivation orchard and builds a dataset. Mango detection and occlusion-state classification models are then established based on YOLOv11m and YOLOv8l-cls, respectively. The detection model achieves an AP_0.5–0.95 (average precision at IoU = 0.50:0.05:0.95) of 90.21%, and the accuracy of the classification model is 96.9%. Second, based on the mango growth characteristics, detected mango bounding boxes and binocular vision, we propose a spatial localization method for the mango grasping point. Building on this, a mango-grasping and stem-cutting end-effector is designed. Finally, a mango harvesting robot system is developed, and verification experiments are carried out. The experimental results show that the harvesting method and procedure are well-suited for situations where the fruit stem is occluded, as well as for fruits with no occlusion or partial occlusion. The mango grasping success rate reaches 96.74%, the stem cutting success rate is 91.30%, and the fruit injury rate is less than 5%. The average image processing time is 119.4 ms. The results prove the feasibility of the proposed methods.

Keywords:

mango harvesting robot; mango detection; occlusion status classification; grasping-cutting strategy; field experiment

1. Introduction

Harvesting is a critical stage in fruit production and exhibits typical labor-intensive characteristics. As the second largest tropical fruit in terms of global cultivation area, mango has become an important component of the agricultural economies in many tropical and subtropical regions [1,2,3]. However, the current mango harvesting still depends largely on manual labor. And labor expenses represent over 25% of the total production cost [4]. Most mango cultivation areas are located in tropical and subtropical climates. The high temperatures during the fruit ripening period make manual harvesting challenging, with high labor intensity and low operational efficiency. Furthermore, the accelerating pace of urbanization has exacerbated the shortage of agricultural labor, resulting in persistently high harvesting costs. Therefore, developing mechanized mango harvesting technology has become an urgent need for the sustainable development of the industry. Conventional fruit harvesting machines typically use strong vibrations or mechanical grabbing methods [5], which pose risks of damaging both the fruit and the tree. The integration of artificial intelligence with mechanical equipment to develop harvesting robots with precise recognition and intelligent operation capabilities is becoming the mainstream direction for the mechanization of mango harvesting.

At present, the research on fruit harvesting robots mainly focuses on core areas such as fruit detection, picking point location and end-effector design [6,7]. Among them, fruit detection and picking point location are mainly achieved through visual technology. Traditional visual algorithms extract features such as color, geometry, shape and texture, and use clustering algorithms, contour analysis, geometric fitting and other steps to segment and locate objects. For the location of the picking points of grape clusters, Luo et al. conducted image segmentation through improved artificial bee colony fuzzy clustering, delineated the regions of interest of the fruit stems based on morphology, and finally determined the picking points using geometric analysis [8]. Although this method achieved a localization accuracy of 88.33% in experiments, the variability in fruit color poses a challenge for researchers to design a unified image segmentation approach. Zhu et al. proposed a rapid detection and picking points location method based on an improved K-means clustering algorithm and contour analysis. In the experiments of locating the picking points for grapes with different planting methods and different colors, the success rate ranged from 78% to 92% [9]. To address the issues of separation and stem location of overlapping pomelos, Lin et al. proposed a progressive center location method, which achieved an average recognition rate of 94.02% in the test of 50 natural scene images [10]. This method can accurately identify pomelo stems that are not significantly obstructed and are relatively close to the camera.

Although traditional visual algorithms perform well under specific conditions, they rely on artificially designed features and have limited generalization ability in scenarios where the appearance features of fruits are not uniform, or the orchard environment is relatively complex [7]. In recent years, convolutional neural networks based on deep learning have demonstrated significant advantages in fruit detection and localization tasks with their powerful feature learning capabilities and have become mainstream methods in this field. Among them, the YOLO series, Mask R-CNN and their improved models have been widely applied. Yu et al. proposed a strawberry detection method based on Mask R-CNN, which can accurately segment fruits affected by multi-fruit adhesion, overlapping occlusion and varying illumination conditions [11]. In a test set containing 100 images, this model achieved an average detection accuracy of 95.78%, a recall rate of 95.41%, and an average cross-union ratio (mIoU) of 89.85% for instance segmentation. Further experiments on the location of picking points indicated that the average prediction error for the picking points of mature fruits was controlled within ±1.2 mm. To address the challenge of distinguishing green peppers from backgrounds with similar colors and mitigating occlusion by branches and leaves, Huang et al. developed an improved model named Pepper-YOLO based on YOLOv8n-Pose, designed for simultaneous detection of green pepper fruits and localization of picking points [12]. By optimizing feature extraction and multi-scale fusion mechanisms, the model achieved a detection accuracy of 82.2% and a picking point localization accuracy of 88.1% in complex scenarios, with an average localization error of less than 12.58 pixels. Chen et al. designed a grape cluster detection and picking point localization approach using an improved YOLOv8n-Pose model, attaining an AP of 89.7% in grape cluster detection and maintaining picking point localization errors within 30 pixels [13]. Wang et al. combined the Litchi-YOSO instance segmentation model with a branch morphology reconstruction algorithm, successfully locating picking points in litchi clusters where branches grow horizontally or are obstructed [14]. The success rate of picking point localization was 91.5%, and the average processing time per image was 120.20 ms. Wang et al. proposed an efficient positioning method for apple picking points based on target area segmentation, with an average positioning accuracy rate reaching 90.80%. In this method, a circle is fitted to the target mask region obtained by segmentation, and the center of the circle is defined as the picking point [15]. In the field of vision for harvesting robots, the overlapping of fruits and the occlusion of branches and leaves change the shape of the fruits, thus restricting the accuracy of fruit detection and picking point positioning, which has received extensive attention from researchers. The current mainstream methods mainly include extracting overlapping fruit contours based on instance segmentation, regressing the location of picking points through key point detection, or integrating deep learning and traditional geometric analysis to infer complete target morphology from local information.

The harvesting end-effector generally grasps the fruit or stem by means of gripping, grasping, suction or a combination of suction and grasping, and then implements fruit separation through methods of branch cutting, branch pulling, fruit twisting, fruit pulling and vacuum [7]. During the picking process, the fruit separation methods are comprehensively influenced by factors such as the shape, texture and growth environment of the fruits. Researchers selected separation and grasping methods based on the characteristics of different fruits and designed specialized end effectors to achieve non-destructive picking. Chen et al. designed a servo-driven soft gripper with a Fin Ray structure. By establishing a mechanical model and proposing a slip detection method integrated with a distance sensor, the damage caused by the relative sliding between the fruit and the finger was reduced [16]. In the field experiment, compared with the flexible finger with slip detection off, although the grasping success rate of the flexible finger with slip detection was reduced by 16%, there was no damage to the fruit skin. A three-finger grasping end effector was designed by Xiao et al. for the picking of spherical fruits such as citrus. The success rate of picking citrus with different sizes was 95.23%, and the average time for a single fruit was 4.65 s [17]. The end-effector had the advantages of high adaptability and non-destructive capabilities. Li et al. designed an end effector with a flip-cutting mode for the universal picking of spherical fruits. The experimental results show that the success rate of grabbing was 100%, and the success rate for picking was over 80%. An average picking time was about 9.6 s, and very few fruits were damaged [18]. By analyzing the characteristics of navel oranges and based on the underactuated principle, Jiang et al. designed a clamping end-effector with a two-layer and three-finger mechanical claw for picking navel oranges. The overall picking success rate of the mechanism was 95%, and the picking cycle was 4.3 s [19]. For tomato picking, Yin et al. developed an enveloping-type end-effector capable of grasping individual tomatoes without detaching the stem. The harvesting success rate was 83.3%, and the average harvesting time per fruit was 9.5 s [20]. An end-effector in which the cutting, suction and transporting modules were integrated was designed by Park et al. In the picking experiment of 160 strings of tomatoes, the picking success rate of this system was 80.6%, and the total picking time was 15.5 s [21]. Qu et al. proposed a rigid-flexible coupling end-effector with a telescopic pneumatic suction cup, which completed the picking operation of the target tomato through a combined motion mode of screw and pull [22]. The single-fruit grasping time in the performance test of this device is approximately 5.4 s, and the grasping success rate can reach 88%. For pear fruit harvesting, a bionic adaptive end-effector with rope-driven fingers was designed by Li et al., which achieved a grasping success rate of 100% with no damage [23]. The end-effector could adapt to different sizes and shapes of pear fruits. For the collection of litchis, Yao et al. designed and verified a litchi combing and cutting end-effector based on visual–tactile fusion [24]. The picking success rate in the field test was 62.86%, and the fruit stem sensing rate was 86.7%. For sweet pepper harvesting, Lian et al. proposed a two-stage motion planning (TSMP) method and a novel end effector with the functions of clamping and cutting stems [25]. The field experiment results demonstrate that the harvesting success rate was 77.16% and the average picking time was about 15 s with a fruit recovery device. Fu et al. designed a kiwifruit picking end-effector capable of automated harvesting by first recognizing and enveloping fruit clusters, then cutting and separating individual stalks [26]. The results of the test showed that the picking success rate was 89.3%, the average time was 8.8 s, and the picking damage rate was 6%.

In the above studies, the separation mode of the fruit plays a decisive role in determining the position of the picking point and the grasping mode of the end-effector. The branch-pulling or fruit-twisting method is suitable for fruit varieties that do not need to retain the fruit stem during picking or whose branches are prone to breakage and will not cause fruit damage when broken, such as apples and passion fruits. The stem cutting method can meet more refined picking requirements and is suitable for situations where certain stems need to be retained, or fruits grow in clusters (such as mango, lychee bunches, grape bunches, etc.), as well as for picking fruits with relatively fragile peels or flesh (such as kiwi, strawberries, etc.). In addition, in the design of the end-effector, the flexible design of the finger material and the suction material, and the precise control of the grasp force or shear force can reduce the damage during the picking operation.

Research on automatic mango harvesting started relatively late and has continued to use the mainstream research methods and technical ideas in the field of intelligent fruit picking. Unlike other fruits, mangoes are often grown in a mixture of varieties, growing either singly or in clusters. The automatic harvesting of mangoes is challenged by significant inter-varietal diversity and specific agronomic practices. Mango varieties exhibit substantial differences in shape (spherical, S-shaped, egg-shaped), size and color. The skin color of unripe fruit is predominantly green, closely resembling foliage, while ripe fruit can present in green, yellow or mixed hues. Furthermore, to prevent skin damage from stem sap and aid preservation, growers typically cut stems, leaving a 3–5 cm stub. These varietal traits and harvesting requirements collectively increase the difficulty of reliable fruit detection, precise picking point localization, and the design of a universally adaptable robot system.

In recent years, research on improving accuracy and robustness based on deep learning, especially improved detection and segmentation networks as well as multi-task learning frameworks, has become the mainstream direction in current mango recognition and picking point location studies. Some studies focus on the detection of mango fruits and stems. Li et al. developed an improved YOLOv3-based detector (ISD-YOLOv3) for mature and near-mature mangoes, achieving 94.91% average precision at a speed of 85 FPS [27]. Xiong et al. used a UAV and the YOLOv2 model to detect green mangoes on tree crowns, achieving rapid fruit count estimation with 96.1% precision, 89.0% recall, and a final estimation error of 1.1% [28]. To tackle occlusion and overlap in natural mango scenes, Chen et al. enhanced their detection system with two improved instance segmentation algorithms [29]. The system achieved an overall segmentation AP of 85.1%. It demonstrated strong environmental adaptability, with AP of 87.9%, 86.3% and 81.1% under challenging conditions of uneven lighting, branch/leaf occlusion and fruit overlap, respectively. For fruit and stem detection, Li et al. developed an enhanced model, MAL-YOLOv10n [30]. The model achieved an overall mAP of 95.5% and a detection speed of 119.6 FPS, outperforming the original YOLOv10 by 2.5% in mAP. Integrated approaches for mango detection and picking point localization have also been explored. Zhang et al. developed the YOLOMS multi-task model, achieving 92.19% fruit detection accuracy, 89.84% picking point localization success, and an average processing time of 58.4 ms, demonstrating effective performance in locating main stem picking points for clustered mango harvesting [31]. Zhang et al. proposed a deep learning method for mango segmentation and stem key-point detection [32]. The approach can effectively locate the picking point of partially occluded mangoes. Li et al. acquired mango images using a mobile phone and developed a mango picking point detection system based on an improved YOLOv8 architecture [33]. The system performed detection and segmentation of mangoes and their stems, achieving a picking point localization success rate of 92.01% and an average processing time of 72.75 ms. Chen et al. investigated mango picking point detection by enhancing the FCIS and Mask R-CNN frameworks [34]. The model attained a detection accuracy of 92.90% with a recall rate of 96.97%. To tackle challenges like the color similarity between green mangoes and the background and the irregular fruit shapes, Gu proposed an enhanced YOLOv8-based detection model [35]. The model incorporates the BRA sparse attention module, a dynamic detection head (DyHead) and a lightweight Slim-neck network, enabling accurate joint detection of fruits and stems. When paired with a segmented localization algorithm, the system achieves efficient picking point localization with a success rate of 96.09% and a processing time of 8.52 ms per frame.

Research on end-effectors and integrated robot systems has accelerated the application of automated mango harvesting. Since sap exuding from mango stems can damage the peel, Ranjan et al. designed a stem-clamping and cutting end-effector. They employed a servo motor to drive the gripper’s opening and closing for stem clamping and cutting [36]. Based on the morphology and mechanical characteristics of fruit stems, Gu et al. designed a scissor-type end effector and constructed a complete vision-robot arm collaborative harvesting system. The system achieved cutting success rates of 90% in laboratory environments and 80% in natural orchard environments, with single-fruit picking cycles consistently controlled within 10 s [35]. Yin et al. developed a dual-arm robot system integrating binocular vision, a dedicated end-effector for stem grasping and cutting, and a collaborative strategy [37]. This scheme, through workspace division and depth-first strategy, reduced the picking time by 48.38% and increased the collision-free harvest rate to 91.68%, demonstrating the feasibility and efficiency of the dual-arm system. Most existing studies on mango picking point localization and end-effector design are predicated on the stem-shearing fruit separation method. This approach requires the end-effector to grip and cut the stem, which consequently demands high precision in detecting and segmenting the fruit stem. While advances in visual detection have improved the identification of fruits and stems under occlusion, a fundamental limitation remains: fruits with entirely obscured stems still face the critical problem of undetectable picking points, rendering them unharvestable by these methods.

To advance automated mango harvesting, this study proposes a novel robot system based on a stepwise fruit-grasping and stem-cutting strategy that does not require precise stem detection. The system comprises three key components: (1) a two-stage visual perception pipeline for fruit detection and pickability classification based on occlusion status; (2) an adaptive picking-point localization method that estimates the grasp position based on predicted fruit size, addressing size variability; and (3) a custom end-effector and coordinated workflow designed to implement the stepwise harvesting strategy, specifically to overcome stem occlusion. Finally, an integrated robot system was constructed and validated through field experiments, demonstrating the effectiveness of the complete solution.

2. Materials and Methods

2.1. Image Acquisition

2.1.1. Analysis of Mango Plantation Conditions

Guangxi is a major mango-producing region of China. Its planting areas are mainly on typical hilly, sloping land and the orchards are widely dispersed, as shown in Figure 1. This study selected two representative mango cultivation areas of Guangxi for image acquisition and field experiments. The sites were a mango orchard in Tiandong County, Baise City (23°40′ N, 107°04′ E) and the Guangxi Subtropical Agricultural Science New City Mango Demonstration Orchard in Chongzuo City (22°55′ N, 108°05′ E).

Field surveys found that dwarf, high-density orchard systems are prevalent in the research area. And mixed planting is practiced, characterized by grafting scions of different mango varieties onto a single rootstock. Figure 2 presents the typical mixed planting model and the primary mango varieties cultivated in the mango orchards of Tiandong County, Baise City. Based on the field surveys, we selected four representative commercial mango cultivars as harvesting targets: Golden, Tainong, Guifei and Renong.

We measured planting parameters for 15 groups of mango trees using a tape measure (Figure 3), obtaining row spacing, plant spacing and the vertical distribution range of fruit growth (highest and lowest heights above the ground). Furthermore, a vernier caliper was employed to conduct morphological measurements on 70 mango samples from the four varieties, capturing key parameters including fruit length, width, height and stem diameter. Measurement data and statistical results are summarized in Table 1.

According to Table 1, the average row and plant spacings in the mango orchard are 3.9 m and 3.12 m, respectively, and the vertical distribution range of mango is 0.1–2 m. The morphological measurements show that the average fruit length, width and thickness are 109.1 mm, 65.9 mm and 55.2 mm, respectively, and the average stem diameter is 4.8 mm.

2.1.2. Camera Selection

The mango harvesting season in Guangxi is concentrated from June to August, with strong illumination under sunny conditions. Under high-intensity lighting, depth cameras that rely on active projection are susceptible to illumination interference, whereas binocular cameras based on passive imaging exhibit better environmental adaptability. We compared the fields of view of binocular cameras with 2.1 mm and 4 mm focal lengths, as shown in Figure 4. When the camera was mounted 75 cm above the ground, the 2.1 mm focal length camera obtained a full-tree field of view at 40 cm from the outermost mango, whereas the 4 mm focal length camera still did not achieve the same field of view at 80 cm. During harvesting, if the camera and robot arm are too far from the outermost fruit, the number of fruits within the robot arm’s reachable harvesting range decreases. Meanwhile, a wider field of view is beneficial for autonomous navigation and harvesting path planning. To ensure harvesting efficiency, the 2.1 mm focal length ZED 2i binocular camera (Stereolabs Inc., San Jose, CA, USA) was selected for image acquisition.

2.2. Mango Detection and Harvestability Judgement

2.2.1. Image Acquisition and Dataset Creation

To reflect typical orchard planting patterns, we did not distinguish varieties and adopted a mixed-cultivar mango image acquisition method. Images were captured with a handheld binocular camera. To simulate the harvesting robot’s operation, the distance between the camera and fruit was dynamically adjusted (0.2–1 m) during image acquisition. The collected images covered diverse environmental conditions, including different weather (sunny/cloudy) and lighting variations (intense midday light and oblique evening light), as shown in Table 2. The sample data captured multi-angle features of mango and included various distribution patterns, ranging from unobstructed to partially occluded and heavily occluded.

After image acquisition, a total of 1693 binocular images were obtained, with a resolution of 2560 × 720 pixels. To construct the basic sample dataset, the binocular images were split into left and right images, and only the left images at 1280 × 720 pixels were retained. This step was applied to avoid image redundancy and training overfitting.

To address the challenges posed by partially and heavily occluded targets during mango harvesting, we adopt a two-stage approach. First, object detection was applied to obtain bounding boxes of all mangoes. Subsequently, a classification model was used to judge the harvestable state of each fruit, enabling prioritization of targets for harvesting. To build a high-quality dataset, we performed two rounds of manual annotation using X-Anylabeling software (version 2.4.4).

In the first round of image annotation, mangoes were labeled with horizontal bounding boxes to build an object detection dataset. During annotation, mangoes under three situations shown in Figure 5 were not labeled: (1) mangoes located at the image edges with more than 20% of their area truncated; (2) mangoes with over 50% area occlusion; and (3) mangoes at distant positions clearly beyond the working range of the harvesting robot. After annotation, the dataset was randomly divided into training and validation sets at an 8:2 ratio, yielding 1355 images for training and 338 for validation. The training set contained 5891 annotated mango instances, and the validation set contained 1345 instances. To enhance model generalization, data augmentation was applied to the dataset. The augmentation strategies included horizontal flipping (

P

= 0.5), Gaussian blur (

σ

= 0.5–2.5), contrast adjustment (±30%), and brightness adjustment (±30%). After augmentation, the number of images in the training and validation sets was 5480 and 1372, respectively.

In the second round of annotations, the harvestability of each mango was judged based on the bounding boxes generated in the first round. All annotated targets were categorized into two classes: “Pickable” and “Unpickable”. Specifically, “Pickable” refers to fruit that is unobstructed or partially occluded by weeds, leaves or other mangoes (occlusion area less than 20%). “Unpickable” refers to fruit occluded by leaves or other fruit by more than 20% or occluded by the fruit stem. After categorization, each bounding box was cropped as a RoI (region of interest) and saved by class label (Figure 6).

After classification, 6427 “Pickable” samples and 1033 “Unpickable” samples were obtained. And both categories were divided into training and validation sets at an 8:2 ratio. To address class imbalance, the “Unpickable” samples were augmented using the same strategy as in the first annotation round. After augmentation, “Unpickable” samples increased to 4499. And the final classified dataset contained 8918 training images and 2008 validation images.

2.2.2. Object Detection and Classification Networks

In natural orchard environments, the similar colors of mangoes, stems and leaves, the dense growth and mutual occlusion of fruits, and varying lighting conditions pose challenges for vision-based detection algorithms. Mango harvesting requires detection models that can detect targets quickly and accurately, with high-resolution feature extraction and small object detection capabilities to handle complex environmental interference. In recent years, YOLO algorithms have been widely used in agricultural object detection due to their high computational efficiency, strong robustness, excellent performance in small object detection and lightweight structure. In this study, we adopted YOLOv11 as the detection algorithm, with its network architecture shown in Figure 7. Compared with YOLOv8, YOLOv11 enhances multi-scale feature extraction through improvements in the backbone and neck structures. It replaces the CF2 module with the C3K2 module, adds a C2PSA module after the SPPF module, and adopts the YOLOv10 head design concept with depthwise separable convolutions to reduce computational redundancy. These improvements enhance the detection accuracy of YOLOv11 and its adaptability to complex scenarios.

For mango harvestable status classification, considering the substantial variation in pixel sizes between close-up and distant mangoes in the images, as well as the small pixel area of fruits, we compared two state-of-the-art lightweight classification models, YOLOv8-cls and YOLOv11-cls, to select the final model.

2.2.3. Model Training

The object detection and classification models were trained on the same hardware platform. The specific environment configurations and training parameters are shown in Table 3, Table 4 and Table 5.

2.2.4. Evaluation Metrics

The evaluation metrics for the detection model included P (precision), R (recall), F1-score, AP (average precision) and FPS (frames per second). Specifically, P indicates the accuracy of the model’s positive predictions and reflects the false positive rate. R indicates the proportion of true positives correctly detected and is used to assess missed detections. The F1-score is the harmonic mean of precision and recall. AP represents the average precision over multiple IoU thresholds. The formulas for each metric are listed in Equations (1)–(4).

P = \frac{T P}{T P + F P}

(1)

R = \frac{T P}{T P + F N}

(2)

A P = \int_{0}^{1} P (R) d R

(3)

F 1 = \frac{2 P R}{P + R}

(4)

The classification model used Top-1 accuracy as the evaluation metric, which measured whether the class with the highest predicted probability matched the ground-truth label. The computation is shown in Equation (5).

A c c u r a c y = \frac{T P + T N}{T P + T N + F P + F N}

(5)

where

T P

is the number of correctly predicted positive samples,

T N

is the number of correctly predicted negative samples,

F P

refers to the number of incorrectly predicted positive samples, and FN is the number of incorrectly predicted negative samples.

2.3. Spatial Positioning of the Mango Grasping Point

2.3.1. Binocular Vision

The binocular vision system simulates the stereoscopic mechanism of human eyes. It uses two cameras with a fixed relative position to capture images of the same object from different viewpoints simultaneously. Based on epipolar matching and triangulation, it then calculates the object’s spatial position in the camera coordinate system.

Binocular vision systems are generally categorized into two structural types: standard parallel and non-parallel. The standard parallel structure is the ideal binocular imaging model, in which the left and right cameras are coplanar with parallel optical axes and identical intrinsic parameters. Through binocular correction, a non-parallel structure can be transformed into a standard parallel configuration, thereby simplifying the computational complexity of spatial localization. The relationship between the left and right image planes before and after correction is shown in Figure 8a. After binocular correction, the two imaging planes are parallel.

The measurement model of the XZ plane in a standard parallel binocular structure is shown in Figure 8b. In this model, f denotes the focal length, and points O_L and O_R represent the optical centers of the left and right cameras, respectively. The camera baseline B refers to the distance between O_L and O_R. The point

P (x, y, z)

represents the coordinates of the object in the world coordinate system, while

P_{1} (X_{l},

Y_{l}

) and

P_{2}

(

X_{r}

,

Y_{r})

are the projections of point P onto the left and right image coordinate systems, respectively. According to the similar triangle relationship, the following expressions can be derived.

\{\begin{matrix} \frac{z}{f} = \frac{x}{X_{l}} \\ \frac{z}{f} = \frac{x - B}{X_{r}} \\ \frac{z}{f} = \frac{y}{Y_{l}} = \frac{y}{Y_{r}} \end{matrix}

(6)

Simplifying Equation (6) yields the following:

\{\begin{matrix} z = \frac{B * f}{X_{l} - X_{r}} \\ x = \frac{X_{l} * z}{f} = B + \frac{X_{r} * z}{f} \\ y = \frac{Y_{l} * z}{f} = \frac{Y_{r} * z}{f} \end{matrix}

(7)

2.3.2. Mango Size Estimation

For stem occlusion in mango harvesting, we adopted a coordinated fruit-grasping and stem-cutting method. The fruit was first grasped with a flexible gripper, and then the stem was cut by a cutting device. Different mango varieties showed large differences in morphological dimensions, especially length, and the length parameter affected the cutting position. Therefore, it is necessary to estimate the mango size parameters to provide key information for the end-effector’s grasping and cutting operations.

Based on the ranging principle of binocular vision, the mango length can be calculated from the depth distance, pixel size and a scale factor, as shown in Equation (8).

L_{m} = (L_{p i x e l} \times S) / D

(8)

where

L_{m}

represents the mango’s actual length,

L_{p i x e l}

is the mango’s pixel length in the image, and

D

is the depth distance between the mango and the camera, obtained from the mango center coordinates in the left and right images using the binocular ranging Equation (7).

S

is a scale factor determined by a calibration experiment with the depth distance D and the actual length

L_{m}

.

Through the calibration experiment, the scale factor S was determined to be 956 mm/pixel. To verify the accuracy of the length estimation method, we used a vernier caliper to measure 20 mangoes in the field and compared the results with the estimated lengths. And the results showed that the error of this method was within ±10 mm.

2.3.3. Harvesting Strategy for Mangoes of Different Sizes

The grasping point is the spatial position at which the end-effector grasps the mango, represented as

P_{M} (X_{P}, Y_{P}, Z_{P})

in the robot arm’s base coordinate system. Based on object detection and binocular vision, the center point

O_{M}

on the mango surface can be determined in the image, and its spatial coordinates

(X_{O}, Y_{O}, Z_{O})

can be calculated. However, considering the fruit’s three-dimensional geometry, directly using

O_{M}

as the grasping point (i.e.,

P_{M} = O_{M}

) may introduce spatial positioning errors, thereby reducing the grasping success rate. Therefore, the spatial position of the grasping point

P_{M}

should be determined by combining the coordinates of

O_{M}

with morphological feature parameters of the mango to ensure accurate grasp placement.

The end-effector typically approaches and grasps the mango from a direction perpendicular to the mango’s longitudinal axis, so the grasping position is related to the fruit’s thickness or width dimensions. Morphological measurements of four representative mango varieties (Table 1) show that fruit width and thickness are relatively similar. Accordingly, the mean of the average width and the average thickness is defined as T (T = 60.55 mm). Then, in the robot arm’s base coordinate system, with the X-axis aligned to the end-effector’s approach direction, the X-coordinate of the mango grasping point is calculated as follows:

X_{P} = X_{O} + T / 2

(9)

For mangoes of normal and small sizes, the grasping point

P_{M}

is approximately located at the geometric center of the fruit. The positional relationship between

P_{M}

and

O_{M}

is shown in Figure 9, where

L_{S}

is the vertical mounting offset of the cutting device relative to the center of the robot’s end-flange, and

G_{M}

represents the grasping center of the flexible gripper.

In mango harvesting, to prevent juice released during stem cutting from contacting and damaging the fruit surface, a stem length of more than 30 mm is typically reserved. When harvesting large-sized mangoes, the cutting device’s installation height

L_{S}

is fixed and it may cut into the fruit or fail to leave sufficient stem length. Therefore, we adopt the adaptive harvesting strategy shown in Figure 10. For mangoes of normal or small size, the reserved stem length meets the requirement, so we primarily consider the fruit grasp stability and set the Z-axis coordinate of the grasping point

P_{M}

to

Z_{P} = Z_{O}

. For large-size mangoes, the gripper can provide sufficient grasp force due to the larger contact area with the fruit, so the computation of the Z-axis coordinate of

P_{M}

prioritizes retaining adequate stem length. Accordingly, when the mango length

L_{m}

>

2 (L_{S} -

30 mm), we first calculate the upward offset

∆ Z

of the grasping point, and then set

{Z_{P} = Z}_{M} + ∆ Z

. The calculation of

∆ Z

is shown in Equation (10).

∆ Z = \frac{L_{m}}{2} - L_{S} + 30 m m

(10)

In this study, the cutting device was mounted at a height of

L_{S} = 110 m m .

Therefore, when the mango length

L_{m} \leq 160 m m

, the spatial coordinates of the grasping point are

P_{M} (X_{O} + T / 2, Y_{M}, Z_{M})

. When

L_{m} > 160 m m

, they are

P_{M} (X_{O} + T / 2, Y_{M}, Z_{M} + L_{m} - 80 mm)

.

2.3.4. Workflow for Mango Grasping Point Spatial Positioning

By integrating mango detection and classification, binocular vision, and the harvesting strategy, the spatial localization workflow for the mango grasping point is as follows:

(1): Mango detection and classification. The mango image was acquired in real time using the binocular camera, then split into left and right images, and subjected to binocular correction. Object detection was performed on the corrected images to obtain horizontal bounding boxes for mangoes. Each bounding box was treated as a RoI, and the classification model was applied to judge the fruit harvestability. For mangoes classified as harvestable, the center pixel coordinates $O (x, y)$ , length $L_{p i x e l}$ , and pixel area A of the corresponding bounding box was recorded.
(2): Target matching and spatial localization. The mangoes were initially sorted based on the column coordinates of their center points in ascending order. However, due to the different viewpoints of the left and right cameras, this initial sorting cannot ensure the correct matching of all mangoes in the images. Therefore, based on the epipolar geometry constraint of binocular vision, the matching rules for mangoes between the left and right images were established: (1) the absolute value of the row coordinate difference $| x_{L} - x_{R} | < 20$ ; (2) the relative area error $|{A_{L} - A}_{R}| / m a x ({A_{L}, A}_{R}) < 0.2$ ; (3) the pixel areas $A_{L} > 2000$ and $A_{R} > 2000$ ; and (4) the column coordinate difference (disparity) satisfied $20 < (y_{L} - y_{R}) < 200$ . Mangoes in the left and right images that simultaneously satisfied the four matching rules above were regarded as the same target and formed the harvestable target set. Based on triangulation, the spatial coordinates $O_{C} (X_{C}, Y_{C}, Z_{c})$ of each mango’s surface center point in the left camera coordinate system were calculated from the center point pixel coordinates $O_{L} ({x_{L}, y}_{L})$ and $O_{R} ({x_{R}, y}_{R})$ of the matched mango.
(3): Spatial localization of the grasping point. By combining the transformation matrix ${}^{B a s e}H_{C a m}$ obtained from hand-eye calibration and the spatial coordinates $O_{C} (X_{C}, Y_{C}, Z_{c})$ , the mango surface center point $O_{M} (X_{O}, Y_{O}, Z_{O})$ in the robot arm’s base coordinate system were calculated. Based on the mango size estimation method and harvesting strategy in Section 2.3.2 and Section 2.3.3, the spatial coordinates of the grasping point $P_{M} (X_{P}, Y_{P}, Z_{P})$ in the robot arm’s base coordinate system can be calculated.

2.4. System Integration and Field Experiments

2.4.1. Mango Harvesting Robot System

To verify the feasibility of the proposed mango harvesting methods, we designed and built a complete mango harvesting robot system, as shown in Figure 11. The system mainly comprised a ZED 2i binocular camera with 2.1 mm focal length, an Elite EC612 six-DOF robot arm, a grasping-cutting end-effector and a tracked chassis. The binocular camera was connected via a gimbal and mounted on an adjustable bracket, providing 25 cm of longitudinal and vertical adjustment travel. With a standard mounting height of 75 cm, the binocular camera could be adjusted within a 75–100 cm height range to accommodate the spatial distribution of mangoes in the orchard. The robot arm was selected based on the growth height and canopy size of mango trees, with an effective working radius of 1304 mm. The detailed parameters of the robot arm are shown in Table 6.

For the soft soil and the hilly terrain of the mango orchards, we used a tracked chassis as the locomotion platform. Its shock-absorbing design and climbing capability ensure stable travel under a 100 kg payload, enabling the harvesting robot system to operate smoothly across terrains with varying gradients.

Mango harvesting requires retaining a 3–5 cm stem section, making stem cutting a conventional detachment method. However, stems occupy a few pixels in images and are often occluded, making it difficult for vision systems to accurately identify them and locate the cutting point. To address this challenge, this study designed a fruit-grasping and stem-cutting end-effector, and the structure is shown in Figure 12. The end-effector consisted of a grasping unit and a cutting unit. The grasping unit employed a servo-driven four-finger soft gripper. Its silicone fingers have a maximum grasping load of 3 kg and are adapted to smooth, irregular fruit. And the control system enabled multi-parameter control of grasping force, position and speed. The cutting unit was composed of a 75 mm-stroke linear stage, a push-pull solenoid, and a pair of scissors. The linear stage was driven by a stepper motor and performed extension and retraction motions. The solenoid was mounted on the linear stage, and the scissors were attached to the end of the solenoid. Switching between de-energized and energized states of the solenoid can control the opening and closing of the scissors. When the solenoid extended in the de-energized state, the scissors opened, and when it retracted upon energization, the scissors closed, and cutting was performed, ensuring fast response and reliable execution of the cutting process. The overall height of the harvesting robot was about 1.8 m, and the end-effector was 208 mm long. After the end-effector was installed, the robot arm’s working range essentially covered the height range of mango growth.

The operating workflow of the end-effector is shown in Figure 13. In the non-harvesting state, the linear stage and scissors remained retracted and open to avoid injuring the fruit during the robot arm’s motion. When the harvesting process began, the soft gripper first grasped the target fruit. The robot arm then pulled downward to place the stem under tension. Next, the linear stage drove the cutting unit forward so that the stem fell into the rear cutting edge region of the open scissors. Finally, the solenoid was energized to close the scissors and cut the stem. In particular, the extension motion of the linear stage ensured that the stem entered the optimal cutting region of the scissors, thereby improving the cutting effect.

The proposed harvesting strategy does not require precise, real-time localization of the fruit stem. Its effectiveness is instead based on the natural physical properties of the stem and a fault-tolerant mechanical design. The stem naturally hangs downward under gravity, aligning approximately with the fruit’s central axis. Grasping and pulling down the fruit further constrains its spatial position. This, combined with the deliberate dimensional tolerance in the scissors opening, ensures that through the coordinated sequence of “grasp-pull down-extend-cut”, the stem is reliably guided into the cutting zone and severed.

2.4.2. Hand-Eye Calibration

Before the harvesting operation, hand-eye calibration must be performed between the binocular camera and the robot arm to obtain the pose matrix of the left camera in the robot arm’s base coordinate system

{}^{B a s e}H_{C a m}

. The calibration process is shown in Figure 14. The calibration board was mounted on the end flange of the robot arm. By moving the robot arm, the calibration board appeared in the left camera’s field of view with varying positions and orientations, left images were captured, and the corresponding robot poses were recorded.

The hand-eye calibration involved transformations between four coordinate systems: the robot arm’s base coordinate system (Base), the end-effector coordinate system (End), the camera coordinate system (Cam) and the calibration board coordinate system (Cal). The end-effector pose, denoted as

{}^{B a s e}H_{E n d}

, can be obtained from the robot arm’s teach pendant. The pose of the calibration board in the left camera coordinate system, denoted as

{}^{C a m}H_{C a l}

, can be obtained from the calibration board images together with the camera’s internal parameters. Moreover, since the camera was fixed relative to the robot arm base, and the calibration board was fixed relative to the end-effector, both the camera’s pose in the robot arm base coordinate system

{}^{B a s e}H_{C a m}

and the calibration board’s pose in the end-effector coordinate system

{}^{E n d}H_{C a l}

remained constant. Therefore, using the pose matrices

{}^{C a m}H_{C a l}

and

{}^{B a s e}H_{E n d}

, the fitted pose of the left camera in the robot arm’s base coordinate system

{}^{B a s e}H_{C a m}

can be calculated. The calculation expression is as follows.

{}^{B a s e}H_{C a m} = {}^{B a s e}H_{E n d} \cdot {}^{E n d}H_{C a l} \cdot {}^{C a l}H_{C a m}

(11)

where

{}^{C a l}H_{C a m}

is the inverse matrix of

{}^{C a m}H_{C a l}

, and each pose matrix is a 4 × 4 matrix.

Hand-eye calibration was performed separately prior to each of the two harvesting experiments, with each calibration utilizing 15 images. The root mean square (RMS) and maximum errors for translation and rotation parameters are summarized in Table 7. Although the complex orchard background may have limited the calibration precision, the achieved accuracy was sufficient for the harvesting task, as the end-effector’s grasping and cutting mechanism was designed with an operational envelope that could accommodate these spatial uncertainties.

After obtaining the pose matrix

{}^{B a s e}H_{C a m}

via hand-eye calibration, the spatial coordinates of the mango grasping point in the robot arm’s coordinate system can be calculated by combining it with the grasping point’s spatial coordinates in the left camera coordinate system obtained in Section 2.3. Based on the preset harvesting pose, the corresponding six joint angles of the robot arm were then solved through inverse kinematics to control the robot arm’s motion.

2.4.3. Harvesting Experiments and Evaluation Metrics

According to the mango harvesting workflow, the component algorithms were integrated into an application in C++ using Visual Studio Community 2017. To evaluate the reliability of the harvesting robot system, we conducted harvesting experiments at the mango orchard in Tiandong County, Baise City. The overall mango harvesting procedure is shown in Figure 15, and the main steps are as follows:

(1): The application was started, and initialization was performed, which mainly included establishing a standard parallel binocular model, loading detection and classification models, setting detection and classification parameters, and reading the pose matrix ${}^{B a s e}H_{C a m}$ . Initialization was executed only once at startup, and its runtime did not affect the program’s loop section. After parameter initialization was completed, the robot arm was moved to the waiting-for-harvesting pose, and the program was switched to the loop section.
(2): In the loop section of the program, images were first acquired by the binocular camera. Then, following the procedure in Section 2.3.4, the number of harvestable mangoes n was determined, and the spatial coordinates, $P_{M} (X_{P}, Y_{P}, Z_{P})$ , of each mango grasping point in the robot arm’s coordinate system were obtained. If no harvestable mangoes were present (n = 0), the tracked chassis was moved via remote control, and the execution of Step (2) was continued during movement.
(3): When the number of harvestable mangoes n > 0, the spatial distance $D_{M}$ of each grasping point $P_{M}$ relative to the robot arm was computed, and the mangoes were sorted from nearest to farthest. Then, the mango with index 1 was evaluated. If the spatial distance $D_{M 1}$ exceeded the preset harvestable distance of the robot arm (1.4 m), Step (2) was resumed; otherwise, the next step was executed.
(4): The joint angles of the robot arm corresponding to the grasping point $P_{M}$ were computed by inverse kinematics. The robot arm was then moved to the grasping point, and mango harvesting was performed according to the fruit-grasping and stem-cutting procedures. Next, the robot arm placed the mango into the collection basket, and one harvesting cycle was completed. The robot arm was returned to the waiting-for-harvesting pose, and the program returned to Step (2).

To validate the performance of the harvesting system, field experiments were conducted in a mango orchard in Tiandong County, Baise City (one of the image acquisition sites). The experiments were carried out in two rounds on 3 July and 10 July 2025. Detailed experimental information is summarized in Table 8. A total of 92 mango fruits of the Golden and Tainong varieties were harvested. No pre-selection was applied, and the sample included both single and clustered fruits. Based on on-site observations, clustered fruits accounted for approximately 10–20% of the total. During the experiments, the following metrics were recorded: the success rate of fruit grasping

{(R}_{G})

, the success rate of stem cutting (

R_{C}

), and the number of fruits injured during cutting (

N_{I}

). A grasp was considered successful if the grasping unit accurately reached the target position and securely held the fruit. Subsequently, a cut was deemed successful if the cutting unit severed the stem connected to the target fruit after a successful grasp. Any visible peel damage to the target or adjacent fruits, or fruit drop caused during the harvesting process, was recorded as fruit damage. The performance metrics were manually observed and recorded in real-time during the field experiments. The calculations of

R_{G}

and

R_{C}

are given by Equations (12) and (13).

R_{G} = \frac{T h e s u c c e s s f u l g r a s p e d n u m b e r}{T h e t o t a l n u m b e r o f t a r g e t s}

(12)

R_{C} = \frac{T h e n u m b e r o f f r u i t s t e m s a r e s u c c e s s f u l c u t o f f}{T h e t o t a l n u m b e r o f t a r g e t s}

(13)

3. Results

3.1. Mango Detection Results

To determine the optimal mango detection model, we selected three one-stage detection models, YOLOv8, YOLOv10 and YOLOv11, for performance comparison. All models were trained and compared under the same dataset and hardware conditions. The performance metrics of each model are shown in Table 9. AP_0.5 represents the average precision at IoU = 0.5, and AP_0.5:0.95 represents the average precision at IoU = 0.50:0.05:0.95. This metric provides a more comprehensive evaluation of the object detection model’s performance across different IoU thresholds.

As shown in Table 9, the models have similar AP values, so both the performance and detection speed should be considered. The AP_0.5, AP_0.5–0.95, and F1 score of YOLOv10m are lower than those of the YOLOv11 series, but it is the fastest at 294 FPS, making it suitable for scenarios with high real-time requirements and for deployment on edge devices. YOLOv11l achieves the highest AP_0.5 of 94.74%, but is relatively slower at 208 FPS, making it suitable for tasks that prioritize detection accuracy. Overall, YOLOv11m offers the best balance of accuracy and speed, with AP_0.5 of 94.45% and AP_0.5–0.95 of 90.21%. And 270 FPS that provides strong real-time performance. Therefore, we selected YOLOv11m as the mango detection model.

The detection results of YOLOv11m on the image of the validation set are shown in Figure 16. The model can detect both unoccluded and partially occluded mangoes, with occlusions including leaves, other fruits, and stems. Besides, the model avoids detecting distant small objects, edge-truncated mangoes, and heavily occluded targets to reduce the computational load of subsequent classification and spatial localization tasks.

3.2. Mango Classification Results

For object classification, we compared the performance of YOLOv8 and YOLOv11 under varying mango occlusion conditions. All models were trained and evaluated under the same hardware configuration. The results are shown in Table 10.

As shown in Table 10, the models exhibit small differences in classification accuracy. The YOLOv8l-cls model achieves the highest validation accuracy of 96.97% and a per-object processing time of 8 ms, satisfying real-time requirements. Therefore, we selected it as the mango classification model. After determining the classification model, we further analyzed its performance on the two mango categories by using YOLOv8l-cls to predict the “Pickable” and “Unpickable” samples in the validation set. The results are shown in Table 11.

As shown in Table 11, the YOLOv8l-cls model achieves classification accuracies of 98.85% and 94.23% for the “Pickable” and “Unpickable” mango categories, respectively, meeting the classification requirements for harvesting. The lower accuracy for the “Unpickable” class may be due to factors such as lighting conditions, color similarity between the fruit and leaves, and varying degrees of stem occlusion. Moreover, ambiguity in human labeling criteria during annotation increased the difficulty of classification.

3.3. Harvesting Experiment Results

With the developed mango harvesting robot system, we conducted harvesting experiments on a total of 92 mangoes (including single fruit and clustered fruit). The harvesting process is shown in Figure 17, and the experimental results are presented in Table 12.

R_{G}

is the success rate of fruit grasping,

R_{C}

represents the success rate of stem cutting, and

N_{I}

is the number of adjacent fruit injured during harvesting.

The complete harvesting pipeline comprises four sequential stages: (1) fruit detection, (2) occlusion-based pickability classification, (3) spatial localization of the fruit grasping point and (4) robot execution (fruit grasping and stem cutting). The harvesting performance metrics reported in Table 12 refer specifically to the robot execution stage, following successful fruit detection and occlusion classification. The rates R_G (grasping success) and R_C (cutting success) exclusively represent the success of the robot execution. By definition, a successful grasp is recorded when the end-effector securely grasps a fruit that has been previously detected and classified. A successful cut is recorded only if the stem is severed following a successful grasp. Consequently, any failure in grasping is also counted as a failure in cutting. The preceding stages of fruit detection and occlusion classification were rigorously validated offline during model training. In field trials, for every mango deemed pickable by the system, manual observation confirmed that the spatial localization (stage 3) was correct, as the end-effector consistently reached the intended grasping point. Therefore, R_G and R_C provide a focused and practical assessment of the system’s harvesting capability under field conditions.

According to Table 12 and the program’s timing function and on-site records, the average time from image acquisition to spatial localization of the grasping point (one full program cycle) was 119.4 ms. In the first round, 50 mangoes were harvested, achieving a fruit-grasping success rate of 96.0% (48/50), a stem-cutting success rate of 90.0% (45/50), with two fruit damage incidents. The second round involved 42 mangoes, yielding a grasping success rate of 97.6% (41/42), a cutting success rate of 92.9% (39/42), and, similarly, two fruit damage incidents. Aggregating both rounds, the system harvested a total of 92 mangoes. The overall fruit-grasping success rate was 96.74% (89/92), with three failures. The overall stem-cutting success rate was 91.30% (84/92), with eight failures. A total of four fruit damage incidents occurred during the entire process. Statistical comparison using an independent two-sample t-test demonstrated no significant difference between the two experiments for fruit grasping success (p = 0.663), stem cutting success (p = 0.628), or fruit damage rate (p = 0.858), confirming the repeatability of the system’s performance. The detailed analysis of the harvesting results is as follows.

(1): The harvesting method we adopted was a fruit-grasping and stem-cutting strategy. A grasping success rate of 96.74% verified the effectiveness of the vision-based detection and localization method and the four-finger soft gripper. The approach can accurately grasp a single mango and also perform well on clustered mangoes, as shown in Figure 18a,b. Based on on-site observations, the main cause of grasping failures was contact between the robot arm and the branch carrying the target mango or adjacent branches and leaves during motion. The movement of the branches and leaves displaced the mango, leading to grasping failure, as shown in Figure 18c.

(2): For fruit stem cutting, the cutting success rate was 91.30%, indicating that the workflow of mango grasping, stem pull-down, linear stage extension and stem cutting was well-suited to the growth characteristics of mango. Cutting failures fell into three cases: (i) the stem of the target mango was short, so the scissors had to cut the main stem, not directly connected to the fruit. The main stem typically had a larger diameter, higher cutting resistance than the fruit stem, and more surrounding leaves and branches, which easily led to cutting failure, as shown in Figure 19a; (ii) numerous stems surrounded the target mango. As the scissors extended with the linear stage in an open state, surrounding stems entered the cutting range, requiring the scissors to cut multiple stems simultaneously, increasing resistance and causing cutting failure, as shown in Figure 19b; (iii) clustered mangoes had short stems, and the main stem was often offset from the fruit’s center. When the offset was large, the main stem did not fall within the scissors’ cutting range, as shown in Figure 19c.

(3): During the harvesting experiment, the end-effector did not damage the target fruit, but it did cause damage to four adjacent mangoes. A detailed breakdown of these incidents aligns with the two primary failure modes observed: (i) In three instances, surrounding fruits were close to the target fruit. When the scissors moved forward, they scratched adjacent fruits, or the fruits entered the scissors’ working range, causing injured fruit, as shown in Figure 20a; (ii) in one instance, the stems of surrounding fruits were close to the target fruit and aligned with the scissors’ motion direction. The end-effector grasped only the target fruit, but the scissors simultaneously cut both stems, causing the adjacent fruit to fall to the ground, as shown in Figure 20b.

4. Discussion

4.1. Analysis of the Mango Harvesting Process

Due to illumination variations and complex backgrounds in orchard environments, detection algorithms inevitably result in false detections and missed detections, as highlighted by the red circles in Figure 21. Specifically, Figure 21a shows leaves in the background misidentified as a mango, Figure 21b is a small target that is not detected, Figure 21c is a mango truncated by a fruit stem, resulting in an incomplete detection bounding box, and Figure 21d shows an unannotated distant mango and a mango truncated at the image edge being detected. We consider that, in harvesting tasks, incorrect detections are acceptable. The key is how to filter out harvestable targets. Meanwhile, the tracked chassis enables the harvesting robot to repeatedly patrol the orchard, capturing images of trees and fruits from multiple viewpoints to obtain more accurate harvesting information.

To filter detection results and obtain information about harvestable mangoes, this study used a binocular vision and target-matching approach. First, by performing object detection and mango harvestability classification separately on the left and right images, the majority of mangoes truncated by image boundaries can be excluded. Second, using matching rules based on the row and column pixel coordinates of mango center points in the left and right images, the bounding box area, and the relative area error, false detections and incomplete targets can be filtered out. Third, after obtaining the spatial coordinates of the mango grasping point in the robot arm base coordinate system, we sorted the mangoes from near to far to reduce the possibility of the robot disturbing other mangoes during harvesting. Moreover, we limited the harvesting range to within 1.4 m, thereby filtering out distant targets. These strategies compensate for shortcomings of the detection algorithm and establish a robust mango detection-classification-spatial localization pipeline, enabling the robot to accurately harvest the majority of mangoes during the field experiments.

4.2. Analysis of Two-Stage and One-Stage Method

This study adopted a two-stage method. We first performed object detection on the images and then classified the detected bounding boxes into pickable and unpickable. We chose this approach because the one-stage method that performed detection and classification simultaneously yielded suboptimal results. The performance of the one-stage method on the validation set is shown in Table 13. The image dataset and data augmentation strategies were the same, and the only difference was that during annotation, the mangoes were directly divided into two categories, “Pickable” and “Unpickable”.

Comparing Table 13 with the two-stage method results (Table 9, Table 10 and Table 11) shows a notable performance drop in the multi-class detection task, where the model had to detect both mangoes and classify their occlusion status. In addition, since it was difficult to apply data augmentation specifically to the minority “Unpickable” class, class imbalance may have affected the model’s performance. The detection and classification results of models trained with the two methods are shown in Figure 22. For the two-stage method, there was no missed detection. Although the model detected mango at the image edges, it was able to accurately judge the occlusion status of the target. For the one-stage method, missed detections occurred, and the occlusion status was not classified accurately. Therefore, we ultimately used the two-stage method for mango detection and classification in natural orchard environments.

4.3. Analysis of Mango Harvesting Strategy

Based on on-site observations, we adopted a combined fruit-grasping and stem-cutting harvesting strategy. This decision was driven by a key challenge: in the field of view of the 2.1 mm focal length camera, mango stems occupy very few pixels and exhibit colors similar to trunks and branches. Under varying illumination, they easily blend into the background, and a significant number of stems are occluded by leaves and fruit, as shown in Figure 23.

The primary advantage of our method lies in its simplified and more robust perception requirements. Whereas stem-based methods must solve the computationally challenging tasks of stem detection, fruit–stem relationship determination and precise stem-cutting point localization, our harvesting strategy requires only fruit detection and grasping point estimation. By eliminating the dependency on stem visibility, our method is inherently more suitable for the majority of harvesting scenarios in dense, unstructured canopies where stem occlusion is the norm.

Therefore, we posit the fruit-grasping approach as a superior primary harvesting solution for unstructured mango orchards. For the vast majority of mangoes where the stem is obscured, our method provides a direct and effective solution. For cases where the stem deviates significantly from the fruit center (e.g., in clustered fruit), a supplement strategy can be employed. The robot can utilize an eye-in-hand camera for close-range stem detection post-grasping and make minor adjustments to the cutting unit’s position, ensuring successful severance. This combination of a robust primary strategy with a supplement procedure can ensure high overall success rates.

4.4. Comparative Analysis with State-of-the-Art Harvesting Systems

To evaluate the overall operational efficiency of the harvesting system, the total cycle time per fruit was analyzed. As precise programmatic timing was not implemented during the field experiments, the durations of key phases were estimated by reviewing video recordings of the harvesting process. The average time breakdown for a complete harvesting cycle, which spans from image acquisition to robot arm reset, is summarized in Table 14.

The total fruit harvesting cycle averaged approximately 21 s. The most time-consuming phase was the robot arm trajectory execution (about 14 s). This duration resulted from a deliberate setting where the robot arm’s velocity and acceleration were limited to 20% of their maximum capabilities. This setting was chosen to ensure a sufficient time for human operators to execute an emergency stop if a potential collision was detected during operation in the cluttered orchard. Actuation times for the gripper and the linear stage (covering both extension and retraction) each averaged around 2 s. In addition, the robot arm response time, which comprised the vision algorithm runtime (119.4 ms) and the inverse kinematics solution time (estimated under 80 ms), was within 0.2 s.

A comprehensive comparison of key performance metrics between this study and prior research in mango harvesting is presented in Table 15. The table summarizes data on fruit/stem detection methods, models employed, detection metrics, harvesting strategies, harvesting success rates, harvesting times and fruit damage rates.

Analysis of the compiled data reveals that the majority of existing studies focus on the detection and segmentation of mangoes and stems to localize picking points. The fruit detection AP achieved in this work is 94.45%, which falls within the medium-to-high range of the AP/mAP values (85.1% to 98.83%) reported in other studies, indicating a robust and competitive detection performance. In terms of harvesting strategy, the fruit-grasping and stem-cutting strategy adopted in our research obtained the highest reported harvesting success rate of 91.3%. Given the sample size of 92 mangoes harvested in field experiments, this result demonstrates the practical applicability and reliability of the method in orchard environments. Regarding harvesting time, the single-fruit harvesting time in this study was about 21 s. This duration is primarily attributable to the conservative safety settings, where the robot arm’s velocity and acceleration were limited to 20% of their maximum to prevent collisions, and to the additional time required to deposit the fruit into a collection basket. In future implementations, harvesting efficiency could be significantly improved by integrating a flexible delivery tube beneath the end-effector, which allows fruits to be conveyed directly to a collection bin, enabling continuous robot arm operation. The fruit damage rate in our study was 4.35%. Comparable data from other mango harvesting studies are limited, with only one laboratory-based study reporting no fruit damage [36]. Further reduction in damage rate may be achievable through structural refinement of the end-effector.

4.5. Optimization Analysis of Harvesting Performance

This study proposed a complete workflow for mango detection, harvestable state classification, and spatial localization of grasping point, designed a grasping-cutting end-effector and built a robot system on which we conducted field experiments. However, during the experiments, we found several situations that posed challenges to robot harvesting, and the next step was to improve the algorithmic pipeline and the hardware design to enhance harvesting performance.

(1): The end-effector designed in this study used a grasp-and-cut harvesting method. It is applicable to fruit that is unobstructed or partially occluded, and to cases where the stem is occluded, but it is not suitable when the fruit is heavily occluded. To address this, a rubber clamping module could be installed below the scissors to clamp the stem first and then cut it, giving the end-effector two harvesting functions: grasping-cutting and stem clamping-cutting. We think this approach can help with harvesting clustered mangoes and in cases of severe fruit occlusion. For example, for clustered mangoes, since the fruit center point and the main stem cutting point are not collinear, a large lateral offset between these two points easily leads to cutting failure, as shown in Figure 24. Therefore, a dedicated detection and classification module for clustered mangoes can be added. When the target is identified as a clustered mango, the main stem is detected, and the cutting point is located, after which harvesting is performed via stem clamp-and-cut. Similarly, for heavily occluded mangoes, the corresponding stem can be detected and harvested using the stem clamp-and-cut approach.

(2): During the robot arm’s harvesting motion, contact with obstacles such as branches and leaves may damage the end-effector or disturb the target mango, thereby reducing the success rates of grasping and cutting. Therefore, a camera-LiDAR fusion approach can be used to obtain point cloud data around the target mango, followed by path planning to enable obstacle-avoiding harvesting. For branches and leaves interference that is difficult to avoid, shields can be added to the outer sides of the gripper and scissors to block some of the obstacles and improve harvesting performance for severely occluded mangoes.
(3): To prevent the scissors from damaging fruit adjacent to the target mango, considering that the fruit stem is much smaller than the fruit, the rubber clamping module mentioned in (1) can be used first to clamp the stem, keeping the surrounding fruits away from the cutting range, and then using the scissors to cut the stem.
(4): In orchard environments, it is necessary to consider the equipment’s stability under high temperatures and direct sunlight, as well as its waterproof performance. Since waterproofing was not addressed, we transported the robot system to the test site four times but twice had to abort due to thunderstorms. In addition, the end-effector connectors in this study were primarily 3D-printed from resin. After multiple experiments under high temperature and sunlight, the parts showed yellowing (Figure 13) and some wear, although performance was not affected. Therefore, the harvesting robot system needs robust waterproofing and the ability to operate continuously in hot, humid conditions.

4.6. Analysis of Harvesting Completeness and Practical Performance

A critical metric for assessing the practical value of the robot system is the harvesting completeness ratio per plant (i.e., the percentage of all ripe fruits successfully harvested from a single tree). Our current field trials, which focused on validating the integrated detection-to-detachment pipeline, were conducted with the robot traversing along the tree row rather than performing exhaustive, multi-perspective harvesting around individual plants. Therefore, a precisely, experimentally measured harvesting completeness ratio was not obtained.

However, based on our system’s performance and operational constraints, we can provide a reasonable estimation. The key factors influencing harvesting completeness in our system include the following. (1) Fruit occlusion: While our strategy handles stem occlusion during grasping, severely obscured fruits may remain undetected. Harvesting nearby fruits or adjusting the robot’s position could expose some of these, but not all. (2) Collision avoidance: fruits growing very close to the main trunk pose a risk of collision between the robot arm and the trunk, as our system currently lacks sophisticated collision detection and path planning for such scenarios.

Given our achieved stem-cutting success rate of 91.3% for fruits that were attempted, and accounting for the fruits likely missed due to the above constraints, we estimate the per-plant harvesting completeness of our current system to be in the range of approximately 60% to 80%. This is lower than the near-complete harvest achievable by an experienced human worker who can navigate the canopy. Enhancing this ratio requires future advancements, particularly in mobile platform navigation for multi-view perception and real-time motion planning for collision-free access to densely packed fruits, which are important directions for bringing robot harvesting closer to commercial viability.

4.7. Limitations

The complete harvesting workflow should use a closed-loop control scheme, including autonomous navigation of the chassis, mango detection and classification, spatial localization of the grasping point, execution of the harvesting motion by the robot arm, and judgment of the harvesting status. Because the tracked chassis autonomous navigation and the harvesting status judgment algorithm were not integrated, the robot system developed in this study operated in open-loop control.

For autonomous navigation, we are trying to integrate an RTK (real-time kinematic) navigation system into the tracked chassis. During the harvesting process, when the fruit lies beyond a certain range of the robot arm’s workspace, the chassis is typically moved closer for harvesting. We think autonomous navigation should incorporate this decision-making and control process, so further field testing is required.

In terms of the harvesting status judgment, it includes the judgment of the fruit-grasping and the stem-cutting results. In this study, to obtain a global view of the mango tree, the binocular camera with a 2.1 mm focal length was fixed on the tracked chassis, resulting in a small area of the fruit stem in the image. Once the end-effector approaches the target fruit, the camera’s view of the fruit stem may be blocked by the robot. Therefore, when the robot arm has performed fruit grasping and stem cutting, it is difficult to ensure that the camera can accurately re-detect and localize the fruit and stem to judge the harvesting status. In subsequent work, a binocular camera can be mounted on the end-effector. After approaching the fruit, the camera can judge the fruit grasping and stem cutting status and adjust motion to achieve closed-loop control.

5. Conclusions

This study presented a novel, integrated robotic system for automated mango harvesting, designed to address key challenges in unstructured orchard environments. Its primary contribution is a comprehensive vision-to-action pipeline that operates without the need for stem detection or localization. This pipeline addresses the core harvesting stages distinctly: through robust fruit detection and occlusion judgment; a size-adaptive grasp point estimation for approach; and a stepwise grasping and cutting strategy for reliable detachment, specifically effective for fruits with occluded stems. Furthermore, the implementation of a size-specific harvesting strategy through the end-effector effectively minimized damage to the fruit.

Experimental validation demonstrated the system’s effectiveness, achieving a fruit grasping success rate of 96.74% and a stem-cutting success rate of 91.30%, while maintaining a low fruit injury rate below 5%. The efficient image processing pipeline, with an average time of 119.4 ms, confirms the potential for real-time applications. Future work will focus on enhancing overall autonomy through improved mobile platform navigation, harvesting status judgment, and robot arm obstacle avoidance.

Author Contributions

Conceptualization, Z.L.; data curation, Q.L.; formal analysis, Q.L.; methodology, Z.L. and Q.L.; software, Q.L.; visualization, Q.L.; writing—original draft, Q.L.; writing—review and editing, Q.L. and Z.L. All authors have read and agreed to the published version of the manuscript.

Funding

The work presented in this paper was supported by the National Natural Science Foundation of China (grant No.32501773), the Specific Research Project of Guangxi for Research Bases and Talents, (grant No. AD23026105), and the Basic Ability Promotion Project for Young Teachers in Guangxi, (grant No. 2023KY0017).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author due to commercial restrictions.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Ding, L.; Zhang, Y. China’s Mango Industry Development Situation and Countermeasures. J. Yunnan Agric. Univ. (Soc. Sci.) 2021, 15, 65–69. [Google Scholar] [CrossRef]
Li, Y.P.; Ye, L.; Liang, W.H.; Deng, C.M.; Liu, Y.Q. Current Status and Countermeasures of Mango Industry Data Resources in China. Chin. J. Trop. Agric. 2020, 40, 105–109. [Google Scholar]
Li, H.B.; Shi, Y. Review on orchard harvesting robots. China Agric. Inform. 2019, 31, 1–9. [Google Scholar] [CrossRef]
Yuan, Y.; Bai, S.; Niu, K.; Zhou, L.; Zhao, B.; Wei, L.; Xiong, S.; Liu, L. Research progress on mechanized harvesting technology and equipment for forest fruit. Trans. Chin. Soc. Agric. Eng. (Trans. CSAE) 2022, 38, 53–63. [Google Scholar] [CrossRef]
Liu, M.; Wang, F.L.; Xing, H.G.; Ke, W.L.; Ma, S.C. The Experimental Study on Apple Vibration Harvester in Tall-spindle Orchard. IFAC-Pap. 2018, 51, 152–156. [Google Scholar] [CrossRef]
Zhang, J.X.; Zhao, H.; Chen, K.D.; Fei, G.B.; Li, X.F.; Wang, Y.W.; Yang, Z.Y.; Zheng, S.W.; Liu, S.Q.; Ding, H. Dexterous hand towards intelligent manufacturing: A review of technologies, trends, and potential applications. Robot. Cim.-Int. Manuf. 2025, 95, 103021. [Google Scholar] [CrossRef]
Zhou, H.; Wang, X.; Au, W.; Kang, H.; Chen, C. Intelligent robots for fruit harvesting: Recent developments and future challenges. Precis. Agric. 2022, 23, 1856–1907. [Google Scholar] [CrossRef]
Luo, L.F.; Zou, X.J.; Xiong, J.T.; Zhang, Y.; Peng, H.X.; Lin, G.C. Automatic positioning for picking point of grape picking robot in natural environment. Trans. Chin. Soc. Agric. Eng. 2015, 31, 14–21. [Google Scholar] [CrossRef]
Zhu, Y.J.; Du, W.S.; Wang, C.Y.; Liu, P.; Li, X. Rapid Recognition and Picking Points Automatic Positioning Method for Table Grape in Natural Environment. Smart Agric. 2023, 5, 23–34. [Google Scholar] [CrossRef]
Lin, Y.H.; Lv, Z.L.; Yang, C.C.; Lin, P.J.; Chen, F.Y.; Hong, J.W. Recognition of the overlapped honey pomelo images in natural scene and experiment. Trans. Chin. Soc. Agric. Eng. 2021, 37, 158–167. [Google Scholar] [CrossRef]
Yu, Y.; Zhang, K.; Yang, L.; Zhang, D. Fruit detection for strawberry harvesting robot in non-structural environment based on Mask-RCNN. Comput. Electron. Agric. 2019, 163, 104846. [Google Scholar] [CrossRef]
Huang, Y.K.; Zhong, Y.L.; Zhong, D.C.; Yang, C.C.; Wei, L.F.; Zou, Z.P.; Chen, R.Q. Pepper-YOLO: An lightweight model for green pepper detection and picking point localization in complex environments. Front. Plant Sci. 2024, 15, 1508258. [Google Scholar] [CrossRef] [PubMed]
Chen, J.Q.; Ma, A.Q.; Huang, L.X.; Li, H.W.; Zhang, H.Y.; Huang, Y.; Zhu, T.T. Efficient and lightweight grape and picking point synchronous detection model based on key point detection. Comput. Electron. Agric. 2024, 217, 108612. [Google Scholar] [CrossRef]
Wang, C.L.; Han, Q.Y.; Zhang, T.; Li, C.J.; Sun, X. Litchi picking points localization in natural environment based on the Litchi-YOSO model and branch morphology reconstruction algorithm. Comput. Electron. Agric. 2024, 226, 109473. [Google Scholar] [CrossRef]
Wang, W.B.; Shan, Y.Z.; Hu, T.T.; Gu, J.N.; Zhu, Y.M.; Gao, Y. Locating apple picking points using semantic segmentation of target region. Trans. Chin. Soc. Agric. Eng. 2024, 40, 172–178. [Google Scholar] [CrossRef]
Chen, K.; Li, T.; Yan, T.; Xie, F.; Feng, Q.; Zhu, Q.; Zhao, C. A Soft Gripper Design for Apple Harvesting with Force Feedback and Fruit Slip Detection. Agriculture 2022, 12, 1802. [Google Scholar] [CrossRef]
Xiao, X.; Wang, Y.N.; Jiang, Y.M. End-Effectors Developed for Citrus and Other Spherical Crops. Appl. Sci. 2022, 12, 7945. [Google Scholar] [CrossRef]
Li, Z.; Yuan, X.; Yang, Z. Design, simulation, and experiment for the end effector of a spherical fruit picking robot. Int. J. Adv. Robot. Syst. 2023, 20, 17298806231213442. [Google Scholar] [CrossRef]
Jiang, Y.; Liu, J.; Hu, Z.; Zhang, W. Design and Experiment of End-effectors for Picking Navel Oranges Based on the Underactuated Principle. Jixie Chuandong 2024, 48, 105–113. [Google Scholar]
Yin, Z.R.; Li, H.; Zuo, Z.J.; Guan, Z.X. Key technologies of tomato-picking robots based on machine vision. Int. J. Agric. Biol. Eng. 2025, 18, 247–256. [Google Scholar] [CrossRef]
Park, Y.; Seol, J.; Pak, J.; Jo, Y.; Jun, J.; Son, H.I. A novel end-effector for a fruit and vegetable harvesting robot: Mechanism and field experiment. Precis. Agric. 2023, 24, 948–970. [Google Scholar] [CrossRef]
Qu, G.; Zhong, X.; Yang, X.; Liu, L.; Hu, X.; Addy, M.M. Design and performance evaluation of a rigid-flexible coupling end-effector for tomato picking robots. Front. Agric. Sci. Eng. 2026, 13, 25643. [Google Scholar] [CrossRef]
Li, M.; Liu, P. A bionic adaptive end-effector with rope-driven fingers for pear fruit harvesting. Comput. Electron. Agric. 2023, 211, 107952. [Google Scholar] [CrossRef]
Yao, Z.; Xiong, J.; Yang, J.; Wang, X.; Li, Z.; Huang, Y.; Li, Y. Design and verification of a litchi combing and cutting end-effector based on visual-tactile fusion. Comput. Electron. Agric. 2025, 232, 110077. [Google Scholar] [CrossRef]
Lian, J.; Pan, Q.; Wang, D. A Two-Stage Recognition and Planning Approach for Grasping Manipulators and Design of End-Effectors. J. Field Robot. 2025, 43, 230–256. [Google Scholar] [CrossRef]
Fu, M.; Guo, S.; Cai, J.; Zhou, J.; Liu, X. Triz-aided design and experiment of kiwifruit picking end-effector. Inmateh—Agric. Eng. 2023, 71, 356–366. [Google Scholar] [CrossRef]
Li, G.J.; Huang, X.J.; Li, X.H. Detection method of tree-ripe mango based on improved YOLOv3. J. Shenyang Agric. Univ. 2021, 52, 70–78. [Google Scholar]
Xiong, J.; Liu, Z.; Chen, S.; Liu, B.; Zheng, Z.; Zhong, Z.; Yang, Z.; Peng, H. Visual detection of green mangoes by an unmanned aerial vehicle in orchards based on a deep learning method. Biosyst. Eng. 2020, 194, 261–272. [Google Scholar] [CrossRef]
Chen, C.X. Research of Mango Detection and Segmentation Based on Adversarial. Master’s Thesis, South China Agricultural University, Guangzhou, China, 2020. Volume 2. [Google Scholar] [CrossRef]
Li, X.L.; Lan, Y.B.; Wang, H.Z. Detecting mango fruits and peduncles in natural scenes using improved YOLOv10n. Trans. Chin. Soc. Agric. Eng. 2025, 41, 167–175. [Google Scholar] [CrossRef]
Zhang, B.; Xia, Y.; Wang, R.; Wang, Y.; Yin, C.; Fu, M.; Fu, W. Recognition of mango and location of picking point on stem based on a multi-task CNN model named YOLOMS. Precis. Agric. 2024, 25, 1454–1476. [Google Scholar] [CrossRef]
Zheng, C.; Chen, P.; Pang, J.; Yang, X.; Chen, C.; Tu, S.; Xue, Y. A mango picking vision algorithm on instance segmentation and key point detection from RGB images in an open orchard. Biosyst. Eng. 2021, 206, 32–54. [Google Scholar] [CrossRef]
Li, H.W.; Huang, J.Z.; Gu, Z.N.; He, D.Q.; Huang, J.D.; Wang, C.L. Positioning of mango picking point using an improved YOLOv8 architecture with object detection and instance segmentation. Biosyst. Eng. 2024, 247, 202–220. [Google Scholar] [CrossRef]
Chen, P.F. Research and Implementation of Mango Instance Segmentation and Detection of Picking Point Based on Mask R-CNN. Master’s Thesis, South China Agricultural University, Guangzhou, China, 2019. [Google Scholar] [CrossRef]
Gu, Z.N. Research on the Clamping and Shearing Picking Technology of the Mango Picking Robot Under Visual Guidance. Master’s Thesis, Guangxi University, Nanning, China, 2025. Volume 11. [Google Scholar] [CrossRef]
Ranjan, A.; Machavaram, R.; Patidar, P. Design and development of a peduncle-holding end effector for robotic harvesting of mango. Cogent. Eng. 2024, 11, 2403706. [Google Scholar] [CrossRef]
Yin, C.; Huang, J.; Xia, Y.; Zheng, H.; Fu, W.; Zhang, B. Design, Development, Integration and Field Evaluation of a Dual Robotic Arm Mango Harvesting Robot. J. Field Robot. 2025, 42, 3705–3725. [Google Scholar] [CrossRef]

Figure 1. Mango plantation terrain.

Figure 2. Mango orchard: (a) multi-species mixed planting; (b) four commercial mango varieties.

Figure 3. Characteristic measurement of tree and fruit: (a) The spacing of the trees; (b) the highest and lowest points of fruit; (c) the size of the mango and the stem.

Figure 4. Comparison of fields of view: (a) 2.1 mm focal length camera, 40 cm from the outermost mango; (b) 4 mm focal length camera, 80 cm from the outermost mango.

Figure 5. Unannotated cases for mangoes: (a) truncated objects; (b) severely occluded objects; (c) out-of-reach distant objects.

Figure 6. Sample of datasets for detection and classification.

Figure 7. YOLOv11 network structure.

Figure 8. Binocular vision principle: (a) before and after binocular correction; (b) the XZ plane projection of the ideal binocular imaging model.

Figure 9. Positional relationship: (a) the grasping point

P_{M}

and the mango surface center point

O_{M}

; (b) the grasping center

G_{M}

of the gripper; (c) the gripper grasping a mango.

Figure 9. Positional relationship: (a) the grasping point

P_{M}

and the mango surface center point

O_{M}

; (b) the grasping center

G_{M}

of the gripper; (c) the gripper grasping a mango.

Figure 10. Harvesting strategy for different sizes of mangoes: (a) normal-size fruit; (b) small-size fruit; (c) large-size fruit,

P_{M}

and

O_{M}

, are not at the same height.

Figure 10. Harvesting strategy for different sizes of mangoes: (a) normal-size fruit; (b) small-size fruit; (c) large-size fruit,

P_{M}

and

O_{M}

, are not at the same height.

Figure 11. Mango harvesting robot.

Figure 12. The structure of the end effector. 1. Servo-driven four-finger soft gripper; 2. scissors for pruning; 3. push-pull solenoid; 4. linear stage.

Figure 13. End-effector workflow: (a) the grasping unit moves to the target fruit; (b) the gripper grasps the fruit and the linear stage extends; (c) the cutting unit performs fruit stem cutting.

Figure 14. Hand-eye calibration in the orchard.

Figure 15. The workflow of mango harvesting.

Figure 16. The mango detection results.

Figure 17. The mango harvesting process: (a) Waiting-for-harvesting pose; (b) move to the target fruit; (c) grasp and pull down the fruit, then extend the linear stage; (d) cut the fruit stem, and retract the linear stage; (e) place the fruit into the collection basket.

Figure 18. Mango grasping: (a) single fruit grasping; (b) clustered fruit grasping; (c) grasp failure due to fruit displacement.

Figure 19. Mango cutting failures: (a) leaves and branches around the main stem; (b) surrounding stems entered the cutting range; (c) large offset of the main stem in clustered mangoes.

Figure 20. Adjacent fruit injured: (a) adjacent fruit entered the cutting range; (b) adjacent fruit stem was cut off.

Figure 21. Incorrect detections: (a) leaves misidentified as mango; (b) missed detection of small target; (c) incomplete bounding box due to stem occlusion; (d) unannotated mangoes in the distance and at image edges being detected.

Figure 22. Comparison of mango detection and classification: (a) two-stage method; (b) one-stage method.

Figure 23. Mango stems blend into the background or are occluded.

Figure 24. Morphology of clustered mangoes. The red dashed line indicates the vertical line passing through the mango center, and the “×” denote the positions of the main stem.

Table 1. Characteristic measurement data of trees and fruits.

Content		Maximum	Minimum	Average
The spacing between trees/m	Row spacing	4.60	3.20	3.90
The spacing between trees/m	Plant spacing	3.40	2.90	3.12
The vertical distribution range of fruit growth/m	Highest	2.00	1.70	1.89
The vertical distribution range of fruit growth/m	Lowest	0.80	0.10	0.34
Mango size/mm	Width	88.5	45.2	65.9
	Thickness	81.3	35.6	55.2
	Length	152.0	70.3	109.1
Stem size/mm	Diameter	7	2.5	4.8

Table 2. The details of mango image acquisition.

Content	Time	Weather	Species
Location 1	22 July 2023, 11:00–17:00	Sunny	Aumang, Jinhuang, Tainong
Location 2	7 June 2025, 13:00–17:00	Sunny to cloudy	Jinhuang, Tainong, Guifei, Renong
Location 2	8 June 2025, 10:00–13:00	Cloudy	Jinhuang, Tainong, Guifei, Renong

Note: Location 1 is the Mango Planting Demonstration Orchard of Guangxi Subtropical Agricultural Science and Technology City, Chongzuo City (22°55′ N, 108°05′ E); Location 2 is the Mango Orchard in Tiandong County, Baise City (23°40′ N, 107°04′ E).

Table 3. Environment configuration.

Configuration	Parameter
OS	Windows 10
RAM	32 G
GPU	NVIDIA GeForce RTX 4070 12 G
CPU	i5-12400F 2.50 GHz
Framework	Pytorch1.13.0, CuDA11.7, Cudnn8.0
Platform	Pycharm2024

Table 4. Training parameters of object detection.

Parameter	Value	Parameter	Value
Image size	640 × 640	Optimizer	SGD
Batch size	16	Initial learning rate	0.01
Epochs	800	Weight decay	0.0005
Workers	8	Momentum	0.9

Table 5. Training parameters of object classification.

Parameter	Value	Parameter	Value
Image size	224 × 224	Optimizer	SGD
Batch size	64	Initial Learning rate	0.01
Epochs	500	Weight decay	0.0005
Workers	16	Momentum	0.9

Table 6. Parameters of the robot arm.

Content	Parameters
Weight	33.5 kg
Maximum load	12 kg
Maximum working radius	1304 mm
Joint range	±360°

Table 7. Hand-eye calibration errors for harvesting experiments.

Experiment	Translation Error (RMS)	Translation Error (Max)	Rotation Error (RMS)	Rotation Error (Max)
First round	1.92 mm	3.22 mm	0.32°	0.57°
Second round	1.84 mm	2.46 mm	0.38°	0.78°

Table 8. Details of the field experiments.

Experiment	Time	Weather	Number of Targets
First round	3 July 2025, 13:00–17:00	Sunny	50
Second round	10 July 2025, 13:00–16:00	Sunny to cloudy	42

Table 9. Performance comparison of different YOLO models for mango detection.

Model	P	R	AP_0.5	AP_0.5–0.95	F1	FPS	Size (MB)
YOLOv11m	89.12%	88.74%	94.45%	90.21%	88.93%	270	40.5
YOLOv11l	88.03%	89.18%	94.74%	90.16%	88.60%	208	51.2
YOLOv10m	89.26%	87.60%	94.22%	89.72%	88.43%	294	33.4
YOLOv8m	88.08%	88.82%	94.15%	89.58%	88.45%	256	52
YOLOv8l	89.01%	88.34%	94.59%	89.84%	88.68%	175	87.6

Table 10. Performance comparison of different YOLO models for pickability classification.

Model	Top1 Accuracy	Speed/ms
YOLOv8m-cls	96.62%	6
YOLOv8l-cls	96.97%	8
YOLOv11m-cls	96.45%	6
YOLOv11l-cls	96.53%	7

Table 11. Performance of the YOLOv8l-cls model on the pickable and unpickable mango subsets.

Category	Accuracy	Speed/ms
Pickable	98.85%	9
Unpickable	94.23%	8

Table 12. Field harvesting performance in two experimental rounds.

Experiment	Number of Targets	$R_{G}$	$R_{C}$	$N_{I}$
First round	50	96%	90%	2
Second round	42	97.62%	92.86%	2
Total	92	96.74%	91.30%	4

Table 13. Performance of the one-stage method for simultaneous mango detection and classification.

Category	P	R	mAP_0.5	mAP_0.5–0.95
Pickable	85.91%	86.12%	92.12%	87.98%
Unpickable	54.23%	42.40%	42.64%	36.94%
Average	70.07%	64.26%	67.38%	62.46%

Table 14. Time consumption of the harvesting process (estimated from operational videos).

Category	Average Time
Gripper closing/opening time	2 s
Linear stage extension/retraction time	2 s
Robot arm trajectory execution time	14 s
Single fruit harvesting time	21 s

Table 15. Comprehensive comparison of key performance metrics.

Source	Target	Model	AP/mAP	Harvesting Strategy	Harvesting Success Rate	Harvesting Time	Damage Rate
[27]	Mango	Improved YOLOv3	94.91%	/	/	/	/
[28]		YOLOv2	86.40%	/	/	/	/
[29]		Improved Mask R-CNN	85.10%	/	/	/	/
[30]	Mango and stem	Improved YOLOv10n	95.50%	/	/	/	/
[31]		Improved YOLOv5s	82.42%	/	/	/	/
[32]		Instance segmentation and key point detection	94.7% (IoU = 0.75)	/	/	/	/
[34]		Mask R-CNN	92.90%	/	/	/	/
[33]		Improved YOLOv8n	98.9% for fruit and 97.1% for stem	/	/	/	/
[35]		Improved YOLOv8	98.83% for fruit and 93.76% for stem	Grip and cut the stem	80% in field test	<10 s	/
[37]		Based on [31]	83.94%		73.92% in field test	8.93 s	/
[36]	End-effector	/	/		85% in lab test	/	0
Ours	Mango	YOLOv11m	94.45%	Grasp the fruit and cut the stem	91.3% in field test	21 s	4.35%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Liu, Q.; Lu, Z. A Vision-Based Robot System with Grasping-Cutting Strategy for Mango Harvesting. Agriculture 2026, 16, 132. https://doi.org/10.3390/agriculture16010132

AMA Style

Liu Q, Lu Z. A Vision-Based Robot System with Grasping-Cutting Strategy for Mango Harvesting. Agriculture. 2026; 16(1):132. https://doi.org/10.3390/agriculture16010132

Chicago/Turabian Style

Liu, Qianling, and Zhiheng Lu. 2026. "A Vision-Based Robot System with Grasping-Cutting Strategy for Mango Harvesting" Agriculture 16, no. 1: 132. https://doi.org/10.3390/agriculture16010132

APA Style

Liu, Q., & Lu, Z. (2026). A Vision-Based Robot System with Grasping-Cutting Strategy for Mango Harvesting. Agriculture, 16(1), 132. https://doi.org/10.3390/agriculture16010132

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Vision-Based Robot System with Grasping-Cutting Strategy for Mango Harvesting

Abstract

1. Introduction

2. Materials and Methods

2.1. Image Acquisition

2.1.1. Analysis of Mango Plantation Conditions

2.1.2. Camera Selection

2.2. Mango Detection and Harvestability Judgement

2.2.1. Image Acquisition and Dataset Creation

2.2.2. Object Detection and Classification Networks

2.2.3. Model Training

2.2.4. Evaluation Metrics

2.3. Spatial Positioning of the Mango Grasping Point

2.3.1. Binocular Vision

2.3.2. Mango Size Estimation

2.3.3. Harvesting Strategy for Mangoes of Different Sizes

2.3.4. Workflow for Mango Grasping Point Spatial Positioning

2.4. System Integration and Field Experiments

2.4.1. Mango Harvesting Robot System

2.4.2. Hand-Eye Calibration

2.4.3. Harvesting Experiments and Evaluation Metrics

3. Results

3.1. Mango Detection Results

3.2. Mango Classification Results

3.3. Harvesting Experiment Results

4. Discussion

4.1. Analysis of the Mango Harvesting Process

4.2. Analysis of Two-Stage and One-Stage Method

4.3. Analysis of Mango Harvesting Strategy

4.4. Comparative Analysis with State-of-the-Art Harvesting Systems

4.5. Optimization Analysis of Harvesting Performance

4.6. Analysis of Harvesting Completeness and Practical Performance

4.7. Limitations

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI