A Method of Grasping Detection for Kiwifruit Harvesting Robot Based on Deep Learning

: Kiwifruit harvesting with robotics can be troublesome due to the clustering feature. The gripper of the end effector will easily cause unstable fruit grasping, or the bending and separation action will interfere with the neighboring fruit because of an inappropriate grasping angle, which will further affect the success rate. Therefore, predicting the correct grasping angle for each fruit can guide the gripper to safely approach, grasp, bend and separate the fruit. To improve the grasping rate and harvesting success rate, this study proposed a grasping detection method for a kiwifruit harvesting robot based on the GG-CNN2. Based on the vertical downward growth characteristics of kiwifruit, the grasping conﬁguration of the manipulator was deﬁned. The clustered kiwifruit was mainly divided into single fruit, linear cluster, and other cluster, and the grasping dataset included depth images, color images, and grasping labels. The GG-CNN2 was improved based on focal loss to prevent the algorithm from generating the optimal grasping conﬁguration in the background or at the edge of the fruit. The performance test of the grasping detection network and the veriﬁcation test of robotic picking were carried out in orchards. The results showed that the number of parameters of GG-CNN2 was 66.7 k, the average image calculation speed was 58 ms, and the average grasping detection accuracy was 76.0%, which ensures the grasping detection can run in real time. The veriﬁcation test results indicated that the manipulator combined with the position information provided by the target detection network YOLO v4 and the grasping angle provided by the grasping detection network GG-CNN2 could achieve a harvesting success rate of 88.7% and a fruit drop rate of 4.8%; the average picking time was 6.5 s. Compared with the method in which the target detection network only provides fruit position information, this method presented the advantages of harvesting rate and fruit drop rate when harvesting linear clusters, especially other cluster, and the picking time was slightly increased. Therefore, the grasping detection method proposed in this study is suitable for near-neighbor multi-kiwifruit picking, and it can improve the success rate of robotic harvesting.


Introduction
Kiwifruit has an average vitamin C content of 70 mg per 100 g, and is considered a highly nutritious product [1]. China is the origin of kiwifruit and the largest kiwifruit producer in the world. The planting area in 2020 was 1.85 × 10 5 hectares, and the yield was 2.23 million tons [2]. Kiwifruit orchards need to be carefully managed throughout the year. In particular, fruit harvesting in autumn is labor-intensive work, accounting for more than 25% of the production costs [3]. In order to overcome the growing labor shortage, the development of efficient and adaptable kiwifruit harvesting robots has become a research hotspot [4][5][6][7].
Fruit target detection is one of the important steps to perform robot harvesting. Robots working in orchards with complex lighting conditions require reliable information from 1.5 m-1.8 m above the ground [20,21]. Figure 1 shows the distribution characteristics of kiwifruit. The outline and calyx characteristics of kiwifruit are obvious. The fruits grow in clusters and are adjacent to each other. The clusters include single fruit, linear cluster, and other cluster, and the number of fruits in a single cluster is approximately 2-10 [14,22]. A single fruit is defined as one fruit with no adjacent fruit around it. A linear cluster is defined as the number of fruits being more than or equal to two, with the fruits distributed in chains; each fruit has at most two adjacent fruits, and there is only one adjacent fruit at the beginning and end of the chain. Approximately 87% of the fruits within the canopy only have two adjacent fruits [14]. The other cluster is defined as the number of fruits being more than or equal to four, with the fruits distributed in an irregular regional shape, and with some fruits having more than or equal to three adjacent fruits.
Robotic arm grasping pose E 2 = x k , y k , z k , R x , R y , R z T is equivalent to the transformation matrix of the end effector coordinate system {E} relative to the robot base coordinate system {B}. T B E can be expressed as Equation (2) [23].
where T B K is the transformation matrix of the fruit coordinate system {K} relative to the robot base coordinate system {B}, T K E is the transformation matrix of the end-effector coordinate system {E} relative to the fruit coordinate system {K}, and R E1 E2 is the rotation matrix of the end effector around the y-axis of its own coordinate system {E}. T B K can be calculated by combining the internal-and the external-parameters transformation matrices of the camera [7]. T K E can be expressed as Equation (3).
T K E = R(z, π 2 )R(y, 0)R(x, π 2 ) 0 d 0 where d is the grasping distance (mm). Therefore, the problem of grasping pose detection can be transformed into finding the optimal rotation matrix R E1 E2 ; it can be expressed as in Equation (5).
where θ is the grasping angle ( • ), r y (θ) is the Rodrigues-transformed 3 × 3 rotation matrix rotated by θ around the y axis. As shown in Figure 2b, the effective range of the grasping Euler angle Rz is [0, 2π]. Since the gripper is a two-finger gripper with rotational symmetry on the central axis (y axis), the value range of the Euler angle Rz is [0, π] ∪ [−π, 0]. At the same time, due to the initial Euler angle being Rz 1 , the Euler angle Rz can be expressed as Equation (6).
Robotic arm grasping pose T 2 , , , , , R is the rotation matrix of the end effector around the y-axis of its own coordinate system {E}. B K T can be calculated by combining the internal-and the external-parameters transformation matrices of the camera [7]. K E T can be expressed as Equation (3). Robotic-arm grasping pose E 2 can be expressed as Equation (7).
Based on the above analysis, the detection problem of the grasping pose is finally transformed into the calculation of the grasping angle θ.

Grasping Angle
In Figure 3, the planar grasping configuration g is defined as Equation (8).
where q represents the grasping quality, (u, v) represents the grasping point of the pixel coordinates, and the grasping angle θ is defined as the angle between the opening-closing direction of the gripper (green line) and the horizontal axis of the camera (sky blue line), and w represents the grasping width in the image. Since the gripper spacing can adapt to the maximum diameter of the kiwifruit [6], we do not make strict requirements for the prediction accuracy of the grasping width. In Figure 3, the black fruit area indicates that the gripper cannot touch other fruits except the current fruit, and the white background area is regarded as the free area, which represents the area that the gripper can reach. The bending normal vector is perpendicular to the opening-closing direction of the gripper. The gripper rotates by 60 • to be able to safely separate the fruit from the stalk [6].
In Figure 3, the planar grasping configuration g is defined as Equation (8).
where q represents the grasping quality, (u, v) represents the grasping point of the pixel coordinates, and the grasping angle θ is defined as the angle between the opening-closing direction of the gripper (green line) and the horizontal axis of the camera (sky blue line), and w represents the grasping width in the image. Since the gripper spacing can adapt to the maximum diameter of the kiwifruit [6], we do not make strict requirements for the prediction accuracy of the grasping width. In Figure 3, the black fruit area indicates that the gripper cannot touch other fruits except the current fruit, and the white background area is regarded as the free area, which represents the area that the gripper can reach. The bending normal vector is perpendicular to the opening-closing direction of the gripper. The gripper rotates by 60° to be able to safely separate the fruit from the stalk [6].

Image Acquisition
The images of kiwifruit clusters were collected at the kiwifruit experimental station in Meixian County, Shaanxi Province, China, during the daytime from August to October in 2022, as shown in Figure 4. The images were acquired by a depth camera (RealSense D435i, Intel Corporation, Santa Clara, CA, USA). The image resolution was set to 640 × 480. The depth camera was placed approximately 30 cm below the canopy for image

Image Acquisition
The images of kiwifruit clusters were collected at the kiwifruit experimental station in Meixian County, Shaanxi Province, China, during the daytime from August to October in 2022, as shown in Figure 4. The images were acquired by a depth camera (RealSense D435i, Intel Corporation, Santa Clara, CA, USA). The image resolution was set to 640 × 480. The depth camera was placed approximately 30 cm below the canopy for image acquisition from bottom to top. The color and depth images were saved in PNG and TIFF formats. Backlighting did not affect the quality of the depth images of kiwifruit. The effective filling rate of the fruit area was above 95% [24]. A total of 360 original images were obtained, including 50 single-fruit images, 220 linear-cluster images, and 90 other-cluster images. All images were collected at different locations to ensure that there were no overlapping regions in the images. acquisition from bottom to top. The color and depth images were saved in PNG and TIFF formats. Backlighting did not affect the quality of the depth images of kiwifruit. The effective filling rate of the fruit area was above 95% [24]. A total of 360 original images were obtained, including 50 single-fruit images, 220 linear-cluster images, and 90 other-cluster images. All images were collected at different locations to ensure that there were no overlapping regions in the images.

Grasping Datasets
The Connell grasping dataset can be used for the training of the grasping-detection network [25], which is mainly used for the two-finger gripper or sucker. In this paper, we refer to the AFFGA-Net annotation method to construct a grasping dataset [26], including

Grasping Datasets
The Connell grasping dataset can be used for the training of the grasping-detection network [25], which is mainly used for the two-finger gripper or sucker. In this paper, we refer to the AFFGA-Net annotation method to construct a grasping dataset [26], including depth images, color images, and grasping labels in MAT format. Figure 5 shows the visualized results of the grasping labels. According to the kiwifruit distribution characteristics of each cluster, a specific positive-sample labeling method was used. The grasping label of a single fruit (referred to as SF) is shown in Figure 5a. The blue area in the figure is the grasping point, corresponding to the kiwifruit calyx area, and the green circles indicate that the fruit can be safely grasped by the gripper at any grasping angle. The grasping label of a linear cluster (referred to LC) is shown in Figure 5b. Since the fruits of the linear cluster are distributed in a chain shape, there is only one adjacent fruit along the chain. The green lines in the figure represent the opening-closing direction of the gripper. Each grasping point on the blue line corresponds to a grasping angle indicated by a green line. The grasping label of other cluster (referred to OC) is shown in Figure 5c. Since there are three or more fruits adjacent to the central fruit, there is no continuous free area of approximately 180 degrees, and it is difficult for the gripper to approach the fruit at a safe grasping angle; therefore, there may be unlabeled fruits in the other cluster. For the peripheral fruit of the other cluster, there are two fruits adjacent to it, and the blue line is approximately perpendicular to the line connecting the two adjacent fruits calyxes. The grasping dataset was divided into the training set and the test set at a ratio of 4:1. In order to expand the number of samples in the training set, the images and labels were simultaneously enhanced by scaling, rotating, and flipping.

Network Structure
The grasping configuration is predicted based on the grasping-detection network GG-CNN2 [19], and the network architecture is shown in Figure 6. First, the depth image is scaled to 300 × 300 pixels and sent to the network. Then, the image feature extraction is performed by stacking four standard convolutions of different sizes and two maximum pooling to generate a low-resolution feature map. Then, the feature map is restored in the scale space by stacking two bilinear interpolation up-sampling and standard convolutions. Finally, the maps of three-channel grasping pose Gθ are output, including the map of grasping quality Qθ, the map of grasping width Wθ and the map of grasping angle Φθ. The map of grasping quality Qθ describes the grasping feasibility of each pixel in the depth image. The closer the value is to 1, the higher the grasping quality and the darker the color appears in figure. The method for generating the optimal grasping configuration g* is based on the heatmap maximum value strategy [27], in which the position parameters of g* depend on the peak-point coordinates of Qθ, and the angle and width parameters are the peak-point coordinates of Φθ and Wθ, respectively. The formula is defined as follows: Normalization of input features can speed up convergence of the model. The grasping width divides the maximum width value of 250 pixels. The cosine and sine prediction maps of the grasping angle are obtained by linear regression, and then Φθ is obtained by solving Equation (10) [19].

Network Structure
The grasping configuration is predicted based on the grasping-detection network GG-CNN2 [19], and the network architecture is shown in Figure 6. First, the depth image is scaled to 300 × 300 pixels and sent to the network. Then, the image feature extraction is performed by stacking four standard convolutions of different sizes and two maximum pooling to generate a low-resolution feature map. Then, the feature map is restored in the scale space by stacking two bilinear interpolation up-sampling and standard convolutions. Finally, the maps of three-channel grasping pose G θ are output, including the map of grasping quality Q θ , the map of grasping width W θ and the map of grasping angle Φ θ . The map of grasping quality Q θ describes the grasping feasibility of each pixel in the depth image. The closer the value is to 1, the higher the grasping quality and the darker the color appears in figure. The method for generating the optimal grasping configuration g* is based on the heatmap maximum value strategy [27], in which the position parameters of g* depend on the peak-point coordinates of Q θ , and the angle and width parameters are the peak-point coordinates of Φ θ and W θ , respectively. The formula is defined as follows: where is n w p the predicted grasping width, and n w y is the sample label. In order to balance the loss of each branch, the total loss is used to optimize the network by calculating the loss of the output of each head, and the multi-task loss Ltotal is defined as Equation (14).

Evaluation and Hyperparameters
In this paper, the Jaccard index of the grasping rectangle [29] is used to determine whether the grasping estimation is effective. Specifically, the grasping prediction needs to meet two conditions at the same time: (1) the difference between the predicted grasping angle and the labeled grasping angle is less than 15°; and (2) the Jaccard index of the predicted grasping frame and the true grasping frame is not lower than 0.25. The Jaccard index is calculated by Equation (15).
where GP represents the area of the predicted grasping frame, GT represents the area of the true grasping frame, GP ∩ GT represents the intersection, and GP ∪ GT represents the union. The accuracy of the test set data is used as the evaluation index, and the accuracy is calculated according to Equation (16).
The network is implemented based on the PyTorch deep-learning framework. The operating environment is Ubuntu 16.04, CPU, AMD Ryzen 7 pro 4750U with Radeon Graphics. The network uses the Adam optimization function. The initial learning rate is set to 0.001, the weight attenuation coefficient is set to 0.01, and the batch size is set to 2; a total of 2000 epochs were trained. Normalization of input features can speed up convergence of the model. The grasping width divides the maximum width value of 250 pixels. The cosine and sine prediction maps of the grasping angle are obtained by linear regression, and then Φ θ is obtained by solving Equation (10) [19].
In order to prevent the network from generating the optimal grasping configuration in the background or at the edge of the fruit due to the imbalance of positive and negative samples, the original loss function of mean squared error (MSE) is improved based on the focal loss [28] with binary cross entropy (BCE) to improve the learning efficiency and generalization ability of the network. Predicting the grasping region is a binary classification problem. The sigmoid function is used to normalize the prediction results, and the focal loss is used to calculate. The grasping quality loss L qua is defined as Equation (11).
where N is the size of the feature map, p n q is the predicted probability, y n q is the sample label, α is the balance factor, and γ is the regulatory factor. Predicting the grasping angle is a regression problem. First, the sigmoid function is used to normalize the output of the angle head, and then the BCE function is used to calculate the loss. The grasping angle loss L ang is defined as Equation (12).
where p n l is the predicted probability, and y n l is the sample label. Predicting the grasping width is a regression problem. The BCE function is used to calculate the loss, and the grasping width loss L wid is defined as Equation (13) [26].
where is p n w the predicted grasping width, and y n w is the sample label. In order to balance the loss of each branch, the total loss is used to optimize the network by calculating the loss of the output of each head, and the multi-task loss L total is defined as Equation (14).

Evaluation and Hyperparameters
In this paper, the Jaccard index of the grasping rectangle [29] is used to determine whether the grasping estimation is effective. Specifically, the grasping prediction needs to meet two conditions at the same time: (1) the difference between the predicted grasping angle and the labeled grasping angle is less than 15 • ; and (2) the Jaccard index of the predicted grasping frame and the true grasping frame is not lower than 0.25. The Jaccard index is calculated by Equation (15).
where G P represents the area of the predicted grasping frame, G T represents the area of the true grasping frame, G P ∩ G T represents the intersection, and G P ∪ G T represents the union. The accuracy of the test set data is used as the evaluation index, and the accuracy is calculated according to Equation (16).
The network is implemented based on the PyTorch deep-learning framework. The operating environment is Ubuntu 16.04, CPU, AMD Ryzen 7 pro 4750U with Radeon Graphics. The network uses the Adam optimization function. The initial learning rate is set to 0.001, the weight attenuation coefficient is set to 0.01, and the batch size is set to 2; a total of 2000 epochs were trained.

Network Training Results
The network training process data was downloaded from the TensorBoard. Figure 7a shows that the curve of the loss function gradually decays with the number of iterations. The loss function decays quickly in the early iteration stage, then it starts to converge and stabilizes at approximately 1.8 after 1000 iterations of training. Figure 7b shows the graspable curve gradually increases with the number of iterations, and the graspable converges to approximately 80%. This curve indicates that the GG-CNN2 network can effectively predict the grasping configuration and the generalization ability is gradually improved.

Network Training Results
The network training process data was downloaded from the TensorBoard. Figure  7a shows that the curve of the loss function gradually decays with the number of iterations. The loss function decays quickly in the early iteration stage, then it starts to converge and stabilizes at approximately 1.8 after 1000 iterations of training. Figure 7b shows the graspable curve gradually increases with the number of iterations, and the graspable converges to approximately 80%. This curve indicates that the GG-CNN2 network can effectively predict the grasping configuration and the generalization ability is gradually improved.

Grasping Detection Results
In order to evaluate the generalization ability of the algorithm in the orchard scenario, we take the clustered kiwifruit scenarios at random locations in the orchard as the test environment, and carry out the detection test of the grasping configuration based on the grasping-detection algorithm. Figure 8 shows the results of the grasping detection. The results of grasping angle were annotated in the figure, which is the most important parameter in grasping configuration. The results show that for different fruit-distribution scenarios, the grasping algorithm can generate an optimal grasping configuration with

Grasping Detection Results
In order to evaluate the generalization ability of the algorithm in the orchard scenario, we take the clustered kiwifruit scenarios at random locations in the orchard as the test environment, and carry out the detection test of the grasping configuration based on Agronomy 2022, 12, 3096 9 of 16 the grasping-detection algorithm. Figure 8 shows the results of the grasping detection. The results of grasping angle were annotated in the figure, which is the most important parameter in grasping configuration. The results show that for different fruit-distribution scenarios, the grasping algorithm can generate an optimal grasping configuration with the highest grasping quality while meeting the requirements. Although the background in the figure shows different lighting conditions-some features of the fruit are lost due to the backlight-the network can still rely on the depth prior information provided by the depth image to complete the prediction of the grasping configuration.  Figure 9 shows the process of detecting the grasping angle. The depth image generated candidate grasping areas through the grasping-detection network (Figure 9b), then the candidate grasping angles were selected corresponding to the grasping-quality peak pixels in the region (Figure 9c); the final grasping angle was selected based on the principle of maximum grasping quality ( Figure 9d). As the fruit depth information on the left side of the depth image (Figure 9a) was incomplete, the fruit was not detected as a candidate grasping area.  Figure 10 shows the case of a false-positive prediction. As the leaf outline around the fruit in the depth image is clear, the shape is approximately circular, and the leaf depth value is close to the fruit, which leads to a false-positive prediction. Therefore, the grasping prediction will be affected by the interference of leaves and the depth filling rate in  Figure 9 shows the process of detecting the grasping angle. The depth image generated candidate grasping areas through the grasping-detection network (Figure 9b), then the candidate grasping angles were selected corresponding to the grasping-quality peak pixels in the region (Figure 9c); the final grasping angle was selected based on the principle of maximum grasping quality ( Figure 9d). As the fruit depth information on the left side of the depth image (Figure 9a) was incomplete, the fruit was not detected as a candidate grasping area.  Figure 9 shows the process of detecting the grasping angle. The depth image generated candidate grasping areas through the grasping-detection network (Figure 9b), then the candidate grasping angles were selected corresponding to the grasping-quality peak pixels in the region (Figure 9c); the final grasping angle was selected based on the principle of maximum grasping quality ( Figure 9d). As the fruit depth information on the left side of the depth image (Figure 9a) was incomplete, the fruit was not detected as a candidate grasping area.  Figure 10 shows the case of a false-positive prediction. As the leaf outline around the fruit in the depth image is clear, the shape is approximately circular, and the leaf depth value is close to the fruit, which leads to a false-positive prediction. Therefore, the grasping prediction will be affected by the interference of leaves and the depth filling rate in  Figure 10 shows the case of a false-positive prediction. As the leaf outline around the fruit in the depth image is clear, the shape is approximately circular, and the leaf depth value is close to the fruit, which leads to a false-positive prediction. Therefore, the grasping prediction will be affected by the interference of leaves and the depth filling rate in the actual orchard environment if it only depends on the depth image.  Table 1 shows the performance results of the gasping-detection network in different scenarios. The results show that the number of parameters of GG-CNN2 was 66.7 k, the average image calculation speed was 58 ms, and the average grasping-detection accuracy was 76.0%, which ensures the grasping detection can run in real time. The algorithm shows better grasping-prediction ability for single fruit and linear cluster compared with other cluster. It can complete the grasping-prediction task for most fruits. At the same time, the lightweight feature of the network is deployed in the portable graphics processing unit, which can realize the application in the real scene.

Verification Test of Robotic Picking
In order to verify whether the robot can improve the fruit-grasping rate and harvesting success rate under the condition of combining the information of the position and grasping angle of the target fruit, a picking experiment was conducted in the Yangling International Kiwifruit Innovation and Entrepreneursnip Park's kiwifruit orchard trelliscultivation environment.

Overall Structure
As shown in Figure 11, the overall structure of the kiwifruit picking robot consists of five parts: robotic arm, end effector, vision system, fruit-collection device, and mobile platform. The robotic arm (UR5, Universal Robots, Odense, Denmark) is a multi-joint robotic arm with the characteristics of being lightweight and having high flexibility. The robotic arm is composed of six rotating joints, with a repeatability of ±0.1 mm, a working radius of 850 mm, and an effective working load of 5 kg. The end effector is composed of two 3D-printed lightweight grippers, photoelectric sensors and pneumatic components. The inner curved surface of the grippers is designed to adapt to the shape of the kiwifruit, thereby reducing fruit damage during the picking process. The total weight of the end effector is 3.5 kg, and the separation force between the stalk and fruit is 3-10 N [6], which meets the requirement that the effective load of the robotic arm be less than 5 kg. The vision system includes an RGB-D camera (RealSense D435i, Intel, Santa Clara, CA, USA) and an image-processing unit (Jetson Nano, NVIDIA, Santa Clara, CA, USA). The camera detects and locates the target kiwifruit in a bottom-up direction through the arrangement of the eyes on the hand [10]. The fruit collection device includes a bellows and a box, and the harvested fruits slide into the box by the buffering effect of the bellows. The mobile  Table 1 shows the performance results of the gasping-detection network in different scenarios. The results show that the number of parameters of GG-CNN2 was 66.7 k, the average image calculation speed was 58 ms, and the average grasping-detection accuracy was 76.0%, which ensures the grasping detection can run in real time. The algorithm shows better grasping-prediction ability for single fruit and linear cluster compared with other cluster. It can complete the grasping-prediction task for most fruits. At the same time, the lightweight feature of the network is deployed in the portable graphics processing unit, which can realize the application in the real scene.

Verification Test of Robotic Picking
In order to verify whether the robot can improve the fruit-grasping rate and harvesting success rate under the condition of combining the information of the position and grasping angle of the target fruit, a picking experiment was conducted in the Yangling International Kiwifruit Innovation and Entrepreneursnip Park's kiwifruit orchard trellis-cultivation environment.

Overall Structure
As shown in Figure 11, the overall structure of the kiwifruit picking robot consists of five parts: robotic arm, end effector, vision system, fruit-collection device, and mobile platform. The robotic arm (UR5, Universal Robots, Odense, Denmark) is a multi-joint robotic arm with the characteristics of being lightweight and having high flexibility. The robotic arm is composed of six rotating joints, with a repeatability of ±0.1 mm, a working radius of 850 mm, and an effective working load of 5 kg. The end effector is composed of two 3D-printed lightweight grippers, photoelectric sensors and pneumatic components. The inner curved surface of the grippers is designed to adapt to the shape of the kiwifruit, thereby reducing fruit damage during the picking process. The total weight of the end effector is 3.5 kg, and the separation force between the stalk and fruit is 3-10 N [6], which meets the requirement that the effective load of the robotic arm be less than 5 kg. The vision system includes an RGB-D camera (RealSense D435i, Intel, Santa Clara, CA, USA) and an image-processing unit (Jetson Nano, NVIDIA, Santa Clara, CA, USA). The camera detects and locates the target kiwifruit in a bottom-up direction through the arrangement of the eyes on the hand [10]. The fruit collection device includes a bellows and a box, and the harvested fruits slide into the box by the buffering effect of the bellows. The mobile platform (Safari-880T, Guoxing Intelligent Technology, Shenzhen, China) is a crawler chassis with good trafficability in the orchard.

Control system
The picking-robot control system is developed based on the ROS-MoveIt (Robot Operation System Motion Planning Framework) [30], as shown in Figure 12. The RGB-D camera captures fruit color images and depth images and transmits the images to the imageprocessing unit. The image-processing unit first performs fruit target detection and grasping detection based on the deep-learning model, and then obtains the pose information of the target fruit relative to the robot base coordinate system based on the internal and external parameter matrices of the camera. The fruit-pose information is sequentially published in the form of topics and the robotic-arm control node subscribes to the topic. The rapidly exploring random trees (RRT) algorithm in the Open Motion Planning Library (OMPL) is used for path planning. The inverse kinematics solution is solved by calling the inverse solver IKFast to form the dynamic trajectory of the robotic-arm kinematics group and drive the robotic arm to arrive at the target pose. After the robotic arm completes the current target-fruit picking task, the image-processing node updates the fruit-pose information until all fruit-picking tasks are completed.

Control System
The picking-robot control system is developed based on the ROS-MoveIt (Robot Operation System Motion Planning Framework) [30], as shown in Figure 12. The RGB-D camera captures fruit color images and depth images and transmits the images to the image-processing unit. The image-processing unit first performs fruit target detection and grasping detection based on the deep-learning model, and then obtains the pose information of the target fruit relative to the robot base coordinate system based on the internal and external parameter matrices of the camera. The fruit-pose information is sequentially published in the form of topics and the robotic-arm control node subscribes to the topic. The rapidly exploring random trees (RRT) algorithm in the Open Motion Planning Library (OMPL) is used for path planning. The inverse kinematics solution is solved by calling the inverse solver IKFast to form the dynamic trajectory of the robotic-arm kinematics group and drive the robotic arm to arrive at the target pose. After the robotic arm completes the current target-fruit picking task, the image-processing node updates the fruit-pose information until all fruit-picking tasks are completed.
lished in the form of topics and the robotic-arm control node subscribes to the topic. The rapidly exploring random trees (RRT) algorithm in the Open Motion Planning Library (OMPL) is used for path planning. The inverse kinematics solution is solved by calling the inverse solver IKFast to form the dynamic trajectory of the robotic-arm kinematics group and drive the robotic arm to arrive at the target pose. After the robotic arm completes the current target-fruit picking task, the image-processing node updates the fruit-pose information until all fruit-picking tasks are completed.

Test Method
We implemented two methods for harvesting clustered kiwifruit. Method I is the original method, which is described as follows: the fruit target detection is performed based on the YOLO v4 network, and then the picking order is determined according to the principle of the shortest spatial distance; the manipulator combines the current-pose and fruit-position information to perform motion planning to complete all the fruit-grasping and picking tasks one by one. Method II is specifically described as follows: fruit target detection and grasping detection are performed based on the YOLO v4 network and the GG-CNN2 network, respectively, and then the manipulator performs motion planning under the condition of combining the information of the fruit position and the grasping angle, and, finally, the manipulator completes all fruit-picking tasks one by one. During the test of Method II, the robot removes the fruit associated with the optimal grasping configuration from the scene after each prediction, and the fruit is removed one by one, finally forming the picking sequence. The manipulator picking the fruit includes three steps. First, the manipulator receives the instruction to move to the canopy underside corresponding to the axis of the target fruit. Then, the end effector moves vertically upward to the fruit positioning point, and the photoelectric sensor signal controls the gripper to close to complete the fruit grasping. Finally, the separation of the fruit and the peduncle is completed by rotating the wrist joint of the robotic arm to a certain angle. During the second step of the picking, the manipulator adjusts the gripper according to the predicted grasping angle to safely approach and grasp the target fruit. In this test, the kiwifruits located in different positions in the canopy were randomly selected, and clustered fruits. Such as single fruit, two-fruit linear cluster, three-fruit linear cluster, and other cluster, were tested. The number of fruits picked in the fruit box and the number of fruits unseparated from the branches and the number of fruits dropped on the ground were counted. The harvesting success rate and the fruit drop rate were calculated. In addition, a phone timer was used to record the total time from the initial position of the end effector to the end of a cluster being picked, and the total time was divided by the number of kiwifruits in each cluster; the average value of several groups of mean times was taken as the average picking time. Figure 13 shows the picking process of the manipulator in the kiwifruit orchard. As shown in Figure 13a, the deep learning based target detection network obtains the position information of all fruits in the color image. Figure 13b shows changes in acquired depth images. As shown in Figure 13c, the deep-learning-based grasping detection network obtains the grasping angle information corresponding to the current fruit depth image with the highest grasping quality. Figure 13d shows that the manipulator combines the target position and pose information to complete motion planning and executes grasping. As the fruits were separated and dropped into the box along the bellows, the distribution characteristics of the fruits in the depth image also change accordingly. Therefore, the grasping network needs to evaluate the grasping quality and grasping angle of the remaining fruits in the current depth image, and determine the next fruit to be grasped. The robotic-picking test results are shown in Table 2. The single fruit, linear cluster, and other cluster were used to perform grasping tests 10, 25, and 27 times, respectively. The results show that the grasping rate of single fruit and linear cluster are both higher than other cluster. This is because the free area around the other cluster is relatively small. Method II, combining with target-detection and grasping-detection information, can achieve a fruit-harvesting success rate of 88.7% and a fruit drop rate of 4.8%. Compared with Method I, the harvesting success rate increased by 8.1%, and the fruit drop rate decreased by 4.9%; the average picking time was 6.5 s, which was a slight increase. There was no obvious difference in the grasping rate between two methods for single-fruit picking, but for the linear cluster, especially the other cluster, there was an obvious difference, indicating that Method II would be effective when the robot was facing the clustered fruitpicking tasks. It is a safe method which can effectively improve the success rate and drop rate of clustered fruit. In addition, the small number of the fruits left on the branches was mainly due to unsuccessful detection caused by environmental factors such as leaf occlusion and backlighting; the picking sequence of clustered fruit is also an important influencing factor. Several kinds of fruit and vegetable picking robots using multi-joint manipulators were compared and analyzed, as shown in Table 3. For greenhouse vegetables, such as tomato and sweet-pepper picking robots, the picking efficiency is relatively lower than for fruit picking robots. Most of these robots need to cut stalks and pick a single target selectively with requirements on positioning accuracy. The harvesting rate of our kiwifruit picking robot is 80.6%, and the picking time is 5.8 s. However, there is an obvious The robotic-picking test results are shown in Table 2. The single fruit, linear cluster, and other cluster were used to perform grasping tests 10, 25, and 27 times, respectively. The results show that the grasping rate of single fruit and linear cluster are both higher than other cluster. This is because the free area around the other cluster is relatively small. Method II, combining with target-detection and grasping-detection information, can achieve a fruit-harvesting success rate of 88.7% and a fruit drop rate of 4.8%. Compared with Method I, the harvesting success rate increased by 8.1%, and the fruit drop rate decreased by 4.9%; the average picking time was 6.5 s, which was a slight increase. There was no obvious difference in the grasping rate between two methods for single-fruit picking, but for the linear cluster, especially the other cluster, there was an obvious difference, indicating that Method II would be effective when the robot was facing the clustered fruit-picking tasks. It is a safe method which can effectively improve the success rate and drop rate of clustered fruit. In addition, the small number of the fruits left on the branches was mainly due to unsuccessful detection caused by environmental factors such as leaf occlusion and backlighting; the picking sequence of clustered fruit is also an important influencing factor. Several kinds of fruit and vegetable picking robots using multi-joint manipulators were compared and analyzed, as shown in Table 3. For greenhouse vegetables, such as tomato and sweet-pepper picking robots, the picking efficiency is relatively lower than for fruit picking robots. Most of these robots need to cut stalks and pick a single target selectively with requirements on positioning accuracy. The harvesting rate of our kiwifruit picking robot is 80.6%, and the picking time is 5.8 s. However, there is an obvious difference between robot and manual picking in operating efficiency; we need to further optimize the perceptual and planning algorithms. Table 3. Comparison of different fruit harvesting robots.

Conclusions
(1) In this study, a grasping-detection method for a kiwifruit harvesting robot was proposed based on the GG-CNN2, which enables the gripper to safely and effectively grasp the clustered fruits and avoid the interference of the bending action on the neighboring fruits. We mainly divided the clustered kiwifruit into three types, including single fruit, linear cluster, and other cluster. (2) The performance test results of the grasping-detection network showed that the number of parameters of the GG-CNN2 was 66.7 k, the average image calculation speed was 58 ms, and the average accuracy was 76.0%, which ensures that the grasping prediction can complete the most tasks and run in real-time. (3) The verification test results of robotic picking showed that the manipulator combined with the position information provided by the target-detection network YOLO v4 and the grasping angle provided by the grasping-detection network GG-CNN2 achieved a harvesting success rate of 88.7% and a fruit drop rate of 4.8%; the average picking time was 6.5 s. Compared with the method which was only based on the target-detection information, the harvesting success rate of this method was increased by 8.1%, and the fruit drop rate was decreased by 4.9%; the picking time was slightly increased. The grasping-detection method is suitable for near-neighbor multi-kiwifruit picking.