Real-time Fruit Recognition and Grasp Estimation for Autonomous Apple harvesting

—In this research, a fully neural network based visual perception framework for autonomous apple harvesting is proposed. The proposed framework includes a multi-function neural network for fruit recognition and a Pointnet grasp estimation to determine the proper grasp pose to guide the robotic execution. Fruit recognition takes raw input of RGB images from the RGB-D camera to perform fruit detection and instance segmentation, and Pointnet grasp estimation take point cloud of each fruit as input and output the prediction of grasp pose for each of fruits. The proposed framework is validated by using RGB-D images collected from laboratory and orchard environments, a robotic grasping test in a controlled environment is also included in the experiments. Experimental shows that the proposed framework can accurately localise and estimate the grasp pose for robotic grasping.


I. INTRODUCTION
Autonomous harvesting plays a significant role in the recent development of the agricultural industry [1]. Vision is one of the essential tasks in autonomous harvesting, as it can detect and localise the crop to guide the robotic arm to perform detachment [2]. Vision task in orchard environments is challenging as there are many factors which can influence the performance of the system, such as variances in illumination, appearance, and occlusion between crop and other items within the environment. Meanwhile, occlusion between fruits and other items can also decrease the success rate of autonomous harvesting [3] (for example, autonomous harvesting of sweet pepper, strawberry, or apples). Therefore, to increase the efficiency of harvesting, vision system should be capable of guide the robotic arm to detach the crop from a proper pose. Overall, an efficient vision algorithm which can robustly perform crop recognition and grasp pose estimation is the key to the success of autonomous harvesting [4].
In this work, a fully deep-learning based vision algorithm which can perform real-time fruit recognition and grasp pose estimation on raw sensor input for autonomous apple harvesting is proposed. The proposed vision solution includes two blocks: fruit recognition block and grasp pose estimation block. Fruit recognition block applies a one-stage multi-task neural network is to perform fruit detection and instance segmentation on colour images. Grasp pose estimation further process the information from fruit recognition block and depth images to estimate the proper grasp pose for each of fruits by using the Pointnet. The following highlights are presented in the paper: • Proposing the multi-task neural network Dasnet to perform real-time and accurate fruit detection and instance segmentation on input colour images from RGB-D camera. • Proposing the improved Pointnet network to perform fruit modelling and grasp pose estimation on input point clouds from RGB-D camera. • Realising the real-time and accurate multi-task visual processing on input from RGB-D camera by combining the aforementioned two points. The outputs from multitask vision processing block are used to guide the robot to perform autonomous harvesting. The rest of the paper is organised as follow. Section II reviews the related works on fruit recognition and grasp pose estimation. Sections III introduce the methods of the proposed vision processing algorithm. The experiments setup and result are included in Section IV. In Section V, conclusion and future works are included.

A. Fruit Recognition
Fruit recognition is an essential task in the autonomous agricultural applications [5]. There are many methods which have been studied in decades, including traditional method [6]- [8] and deep-learning based method. Traditional method applies hand-crafted image features to describe the objects within images, and use machine-learning algorithm to perform classification, detection, or segmentation on such objects [9]. The performance of the traditional method is limited by the express ability of the feature descriptor, which requires adjustment before processing different class of objects [10]. Deep-learning based method applies deep convolution neural network to perform automatic image feature extraction, which has shown a good performance and generalisation [11]. Deeplearning based detection method can be divided into two classes: two-stage detection and one-stage detection [12]. Twostage detection applies a Region Proposal Network (RPN) to search the Region of Interest (RoI) from the image, and a classification branch is applied to perform bounding box regression and classification [13], [14]. One-stage detection combines the RPN and classification into a single architecture, which speeds up the processing of the images [15], [16]. Both two-stage detection and one-stage detection are being widely studied in autonomous harvesting [17]. Bargoti and Underwood [18] applied Faster Region Convolution Neural Network (Faster-RCNN) to perform multi-class fruit detection in orchard environments. Yu et al. [19] applied Mask-RCNN [20] to perform strawberry detection and instance segmentation in the non-structural environment. Liu et al. [21] applied a modified Faster-RCNN on kiwifruit detection by combining the information from RGB and NIR images, an accurate detection performance was reported in this work. Tian et al. [22] applied an improved Dense-YOLO to perform monitoring of apple growth in different stages. Koirala et al. [23] applied a light-weight YOLO-V2 model which named as 'Mongo-YOLO' to perform fruit load estimation. Kang and Chen [24], [25] develop a multi-task network based on YOLO, which combines the semantic, instance segmentation, and detection. Deep-learning based methods are also widely applied in other agriculture applications, such as environmental modelling [26] and remote sensing [27].

B. Grasp Estimation
Grasp pose estimation is one of the key techniques in the robotic grasp [28]. Similar to the methods developed for fruit recognition, the grasp pose estimation methods can be divided into two categories: traditional analytical approaches and deep-learning based approaches [29]. Traditional analytical approaches extract feature/key points from the point clouds and then perform matching between sensory data and template from the databased to estimate the object pose [30]. The predefined grasp pose can be applied in this condition. For the unknown objects, some assumption can be made, such as grasp the object along the principle axis [28]. The performance of the traditional analytical approaches when performed in the real world, as noise or partial point cloud can severely influence the accuracy of the estimation [31]. Deep-learning based methods are developed more recently. Earlier works of deep-learning based methods recast the grasp pose estimation as an object detection task, which can directly produce grasp pose from the images [32]. Recently, with the development of the deeplearning architecture on 3D point cloud processing [33], [34], more works are focus on performing grasp pose estimation on 3D point clouds. These methods apply convolution neural network architectures to process the 3D point clouds and estimate the grasp pose to guide the grasping, such as Grasp Pose Detection (GPD) [35] and Pointnet GPD [36]. There are several works which applied grasp pose estimation in the autonomous harvesting. Lehnert et al. [37] modelling the sweep pepper as super-ellipsoid to estimate the grasp pose of the fruits. In their following work [38], they used surface normal of the sweet pepper to generate the grasp candidates. The generated grasp candidates are ranked by using a utility function, which combines the surface curvature, distance to the point cloud boundary and angle with respect to the horizontal world axis. Then, the grasp candidate with the highest score is chosen for the execution. Currently, most of the works [39]- [41] developed in autonomous harvesting grasp the fruits by translating towards the target, which can not secure the success rate of harvesting in unstructured environments [42]. Therefore, an efficient grasp pose estimation is the key to fully automatic harvesting.

47
The proposed vision perception and grasp estimation framework includes two-stage: fruit recognition and grasp pose estimation. The workflow of the proposed vision processing algorithm is shown in Figure 1. In the first step, the fruit recognition block performs fruit detection and segmentation on input RGB images from the RGB-D camera. The outputs of the fruit recognition are projected to the depth images, the point clouds of each detected fruit are extracted and be sent to the grasp pose estimation block for further processing. In the second step, the Pointnet architecture is applied to estimate the geometry and grasp pose of fruits by using the point clouds from the previous steps. The method of the fruit recognition block and grasp pose estimation are presented in Section III-B and III-C, respectively. The implementation details of the proposed method are introduced in Section III-D. 1) Network Architecture: A one-stage neural network Dasnet [43] is applied to perform fruit detection and instance segmentation tasks. Dasnet applies a 50 layers residual network (resnet-50) [44] as the backbone to extract features from the input image. A three levels Feature Pyramid Network (FPN) is used to fuse feature maps from the C3, C4, and C5 level of the backbone (as shown in Figure 2). That is, the feature maps from the higher level are fused into the feature maps from the lower level since feature maps in higher level include more semantic information which can increase the classification accuracy [45]. On each level of the FPN, an instance segmentation (includes detection and instance segmentation) branch is applied, as shown in Figure 3. Before the instance segmentation branch, an Atrous Spatial Pyramid Pooling (ASPP) [46] is used to process the multi-scale features within the feature maps. ASPP applies dilation convolution with different rates (1, 2, 4 in this work) to process the feature, which can process the features of different scale separately. The instance segmentation branch includes three sub-branches, which are mask generation branch, bounding box branch, and classification branch. Mask generation branch follows the architecture design proposed in Single Pixel Reconstruction Network (SPRNet) [47], which can predict a binary mask for objects from a single pixel within the feature maps. Bounding box branch includes the prediction on confidence score and the bounding box shape. We apply one anchor bounding box on each level of FPN (size of anchor box of instance segmentation branch on C3, C4 and C5 level are 32 x 32 (pixels), 80 x 80, and 160 x 160, respectively.). Classification branch predicts the class of the object within the bounding box. The combined outputs from the instance segmentation branch form the results of the fruit recognition on colour images. Dasnet also has a semantic segmentation branch for environment semantic modelling, which is not applied in this research.

B. Fruit Recognition
2) Network Training: More than 1000 images are collected from apple orchards located in Qingdao, China and Melbourne, Australia, which includes Fuji, Gala, pink lady, and so on. The images are labelled by using LabelImage tool from Github [48]. We applied 600 images as the training set, 100 images as the validation set, and 400 images as the test set. We introduce multiple image augmentations in the network training, including random crop, random scaling (0.8-1.2), flip (horizontal only), random rotation (±10 • ), randomly adjust on saturation (0.8-1.2) and brightness (0.8-1.2) in HSV colour space. We apply focal loss [49] in the training and Adamoptimiser is used to optimise the network parameters. The learning rate and decay rate of the optimiser are 0.001 and 0.75 per epoch. We train the instance segmentation branch for 100 epochs and train the whole network for another 50 epochs.
3) Post Processing: The results of the fruit recognition are projected into the depth image. That is, the mask region of each apple on depth image is extracted. Then, the 3D position of each point in the point clouds of each apple is calculated and obtained. The generated point clouds are the visible part of the apple from the current view-angle of the RGB-D camera. These point clouds are further processed by grasp pose estimation block to estimate the grasp pose, which is introduced in the following section.
C. Grasp Estimation 1) Grasp Planning: Since most of the apples are presented in sphere or ellipsoid, we modelling the apple as sphere shape for simplified expression. In the natural environments, apples can be blocked by branches or other items within the environments from the view-angle of the RGB-D camera. Therefore, the visible part of the apple from the current viewangle of the RGB-D camera indicates the potential grasp pose, which is proper for the robotic arm to pick the fruit. Unlike generate multiple grasp candidate and use the network to find the best grasp pose which is applied in GPD [35] and Pointnet GPD [36], we formulate the grasp pose estimation as object pose estimation similar to the Frustum PointNets [50]. We select the centre of the visible part and orientation from the centre of the apple to this centre as the position and orientation of the grasp pose (as shown in Figure 4). The Pointnet takes 1-viewed point cloud of each fruit as input and estimates the grasp pose for the robotic arm to perform detachment. 2) Grasp Representation: The pose of an object in 3D space has 6 Degree of Freedom (DoF), includes three positions (x, y, and z) and three rotations (θ, φ, and ω, along Z-axis, Y-axis, and X-axis, respectively). We apply Euler-ZYX angle to represent the orientation of the grasp pose, as shown in Figure 5. The value of ω is set as zero since we can always assume that fruit will not rotate along its X-axis (since apples are presented in a spherical shape). The grasp pose (GP) of an apple can be formulated as follow: Therefore, a parameter list [x, y, z, θ, φ] is used to represent the grasp pose of the fruit.
3) Data Annotation: Grasp pose block use point clouds as input and predicts the 3D Oriented Bounding Box (3D-OBB) (oriented in grasp orientation) for each fruit. Each 3D-OBB includes six parameters, which are x, y, z, r, θ, φ. The position (x, y, z) represents the offsets on X-, Y-, Z-axis from the centre of point clouds to the centre of the apple, respectively. The parameter r represents the radius of the apple, as the apples is modelled as sphere. The length, width, and height can be derivated by radius. θ and φ represent the grasp pose of the fruit, as described in Section III-C2. Since the parameters x, y, z, and r may have large various when dealing with prediction in different situations, a scale parameters S is introduced. We apply S to represent the mean scale (radius) of the apple, which equals 30 (cm) in our case. The parameters x, y, z, and r are divided by S to obtain the united offset and radius (x u , y u , z u , r u ). After remapping, the range of the x u , y u , z u are down into [-∞, ∞], and the range of r u are in [0, ∞]. To keep the grasp pose in the range of motion of the robotic arm, the θ and φ are limited in the range of [− 1 4 π, 1 4 π]. We divide the θ and φ by 1 4 π to map the range of grasp pose into the range of [-1,1]. The united θ and φ are donated as θ u and φ u . In total, we have six united parameters to predict the 3D-OBB for each fruit, which are [x u , y u , z u , r u , θ u , φ u ]. Among these parameters, [x u , y u , z u , θ u , φ u ] represent the grasp pose of the fruit, r u controls the shape of 3D-OBB. 4) Pointnet Architecture: Pointnet [33] is a deep neural network architecture which can perform classification, segmentation, or other tasks on point clouds. Pointnet can use raw point clouds of the object as input and does not requires any pre-processing. The architecture of the Pointnet is shown in Figure. Pointnet uses an n x 3 (n is the number of points) unordered point clouds as input. Firstly, Pointnet applies convolution operations to extract a multiple dimensional feature vector on each point. Then, a symmetric function is used to  Figures 6 and 7).
In Eq. 2, g is a symmetric function and f is the extracted features from the set. Pointnet applies max-pooling as the symmetric function. In this manner, Pointnet can learn numbers of features from point set and invariant to input permutation. The generated feature vectors are further processed by Multi-Layer Perception (MLP) (fully-connected layer in Pointnet), to perform classification of the input point clouds. Batchnorm layer is applied after each convolution layer or fullyconnection layer. Drop-out is applied in the fully-connected layer during the training. In this work, the output of the Pointnet is changed to the 3D-OBB prediction, which includes prediction on six parameters [x u , y u , z u , r u , θ u , φ u ]. The range of the parameters x u , y u , and z u are in [-∞, ∞], hence we do not applies an activation function on these three parameters. The range of the r u are from 0 to ∞, the exponential function is used as activation. The range of the θ u , φ u is from -1 to 1, hence a tanh activation function is applied. The donate of the Pointnet output before activation are donate as [x p , y p , z p , r p , θ p , φ p ]. Therefore, we have      x u , y u , z u = x p , y p , z p , The output of the Pointnet can be remapped to their original value by following the description in Section III-C3.

5) Network
Training: The data labelling is performed on our own developed labelling tool, as shown in Figure 8. Our labelling tool records the six parameters of the 3D-OBB and all the points within the point clouds. The training of the Pointnet for 3D-OBB prediction is independent of the fruit recognition network training. There are 570 1-viewed point clouds of apples labelled in total (250 are collected in lab, 250 are collected in orchards). We apply 300 point sets as the training set (150 in each data set), 50 samples as validation set (25 in each data set), and the rest 220 samples as test set (110 in each data set). We introduce scaling (0.8 to 1.2), translation (-15 cm to 15 cm on each axis), rotation (-10 • to 10 • on θ and φ), adding Gaussian noise (mean equals 0, variance equals 2cm), and adding outliers (1% to 5% in total number of point clouds) in the data augmentation. One should notice that the orientation of samples after augmentation should still in the range between − 1 4 π and 1 4 π. The square error between prediction and ground truth is applied as the training loss. The Adam-optimizer in Tensorflow is used to perform the optimisation, the learning rate, decay rate, and total training epoch of the optimiser are 0.0001, 0.6 /epoch, and 100 epochs, respectively.  [51] on the Linux Ubuntu 16.04. The calibration between the colour image and the depth image of the RGB-D camera is included in the realsense-ros. The implementation code of the Pointnet (in Tensorflow) is from the Github [52], and it is trained on the Nvidia-GPU GTX-980M. The implementation code of the Dasnet is achieved by using Tensorflow. The training of the Dasnet is performed on the Nvidia-GPU GTX-1080Ti. In the autonomous harvesting experiment, an industry robotic arm Universal Robot UR5 is applied (as shown in Figure 9). The communication between UR5 and the laptop is performed by using universal-robot-ROS. MoveIt! [53] with TackIK inverse kinematic solver [54] is used in the motion planning of the robotic arm.

D. Implementation Details
2) Point Clouds Pre-processing: Although data augmentation in Pointnet training includes the outlier adding to improve the robustness of the algorithm, we apply a Euclidean distance based outlier rejection algorithm to filter out outliers before processing by Pointnet. When the distance between a point and point clouds centre are two times larger than the mean distance between the points and centre, we consider this point as an outlier and reject it. This step is repeated three times to ensure the efficiency of the algorithm. For the inference efficiency, a voxel downsampling function (resolution 3 mm) from the 3D data processing library open3D is used to extract 200 points as the input of the Pointnet grasp estimation. The point set with the number of points less than 200 after voxel downsampling will be rejected since the insufficient number of points are presented. The point clouds before and after outlier rejection and voxel downsampling are shown in Figure.  We evaluate our proposed fruit recognition and grasp estimation algorithm in both simulation and the robotic hardware. In the simulation experiment, we perform the proposed method in the RGB-D data on the test set, which includes 110 point sets respectively in the laboratory environment and orchard environment. In the robotic harvesting experiment, we apply the proposed method to guide the robotic arm to perform the grasp of applies on the artificial plant in the lab. We apply IoU between predicted and ground-truth bounding box to evaluate the accuracy of 3D localisation and shape estimation of the fruits. We use 3D Axis Aligned Bounding Boxes (3D-AABB) to simplify the IoU calculation of 3D bounding box [55]. The IoU between 3D-AABB is donated as IoU 3D . We set 0.75 (thres IoU ) as the threshold value for IoU 3D to determine the accuracy of fruit shape prediction. In terms of the evaluation of the grasp pose estimation, we apply absolute error between the predicted value and ground truth value of grasp pose, as it can intuitively show the accuracy of predicted grasp pose. The maximum accepted error of grasp pose estimation for the robot to perform a successful grasp is 8 • , which is set as the threshold value in the grasp pose evaluation. This experiment is conducted in several scenarios, including noise and outlier presented conditions, and also dense clutter condition.

B. Simulation Experiments
In the simulation experiment, we compare our method with traditional shape fitting methods, which include sphere Random Sample Consensus (sphere-RANSAC) [56] and sphere Hough Transform (sphere-HT) [57], in terms of accuracy on fruit localisation and shape estimation. Both of RANSAC and HT based algorithm take point clouds as input and generate the prediction of the fruit shape. The 3D bounding box of predicted shapes are then used to perform accuracy evaluation and compared with our method. This comparison are conducted on RGB-D images collected from the both laboratory and orchard scenarios.

1) Experiments in laboratory Environments:
We performed Pointnet grasp estimation, RANSAC, and HT in the collected RGB-D images from the laboratory environment. The experimental results of three methods in different tests are shown in Table I. From the experimental results, Pointnet grasp estimation significantly increases the localisation accuracy of the 3D bounding box of the fruits. Pointnet grasp estimation achieves 0.94 on IoU 3D , which is and higher than the RANSAC and HT methods, respectively. To evaluate the robustness of different methods when dealing with noisy and outlier conditions, we randomly add Gaussian noise (mean equals 0, variance equals 2cm) and outlier (1% to 5% in the total number of point clouds) into the point clouds, as shown in Figure 11. Three methods show similar robustness when dealing with outlier presented condition. Since both RANSAC and HT applies vote framework to estimate the primitives of the shape, which is robust to the outlier. However, when dealing with the noisy condition, Pointnet grasp estimation achieves better robustness compared to the RANSAC and HT. Since noisy point clouds can influence the accuracy of vote framework in a large extent. We also tested Pointnet grasp estimation, RANSAC, and HT in dense clutter scenario. Grasp estimation in dense clutter condition is challenging since the point clouds of objects can be influenced by other neighbour objects. Pointnet grasp estimation can robustly perform accurate localisation and shape fitting of apples in this condition, which shows a significant improvement compared to the RANSAC and HT. The experimental results by using Pointnet grasp estimation are presented in Figure 12, the 3D-OBBs are projected into image space by using the method developed in the work [58]. In terms of the evaluation of the grasp orientation estimation, Pointnet grasp estimation shows accurate performance in the experimental results, as shown in Table II. The mean  2) Experiments in Orchards Environments: In this experiment, we performed the fruit recognition (Dasnet) and Pointnet grasp estimation on the collected RGB-D images from apple orchards. The performance of the Dasnet is evaluated by using the RGB images in test set. We apply F 1 score and IoU as the evaluation metric of the fruit recognition. IoU mask stands the IoU value of instance mask of fruits in colour images. Table III show the performance of the Dasnet (in terms of the detection accuracy and recall) and Pointnet grasp estimation, Figure shows fruit recognition results by using Dasnet on test set. Experimental results show that Dasnet performs well on fruit recognition in orchard environment, which are 0.88 and 0.868 on accuracy and recall, respectively. The accuracy of the instance segmentation on apples is 0.873. The inaccuracy of the fruit recognition is due to the illumination and fruit appearance variances. From the experiments, we found that Dasnet can accurately detect and segment the apples in the most of conditions.   Table IV shows the performance comparison between Pointnet grasp estimation, RANSAC, and HT. In the orchard environments, grasp pose estimation is more challenging compared to the indoor environments. The sensory depth data can be affected by the various environmental factors, as shown in Figure  15. In this condition, the performance of the RANSAC and HT show the significant decrease from the indoor experiment while Pointnet grasp estimation shows better robustness. The IoU 3D achieved by Pointnet grasp estimation, RANSAC, and HT in orchard scenario are 0.88, 0.76, and 0.78, respectively. In terms of the grasp orientation estimation, Pointnet grasp estimations show robust performance in dealing with flawed sensory data. The mean error of orientation estimation by using Pointnet grasp estimation is 5.2 • , which is still within the accepted range of orientation error. The experimental results of grasp pose estimation by using Pointnet grasp estimation in orchard scenario is shown in Figure 14.

3) Common Failures in Grasp Estimation:
The major reason lead to the grasp estimation failure by using Pointnet grasp estimation is due to the sensory data defect, as shown in Figure  15. When under this conditions, the results of Pointnet grasp estimation will always predicts a sphere with a very small value of radius. We can applies a radius value threshold to filter out this kind of failure during the operation.

C. Experiments of Robotic Harvesting
The Pointnet grasp estimation was tested by using a UR5 robotic arm to validate its performance in the real working scenario. We arranged apples on a fake plant in the laboratory environment, which is shown in Figure 10. We conducted multiple trails (each trail contains three to seven apples on the fake plant) to evaluate the success rate of the grasp. The success rate records a fraction of success grasps in the total number of grasp attempts. The operational procedures follow the design of our previous work [59], as shown in Figure  16. We simulate the real outdoor environments of autonomous harvesting by adding noises and outliers into the depth data.  We also tested our system in dense clutter condition. The experimental results are shown in Table V. From the experimental results presented in Tabla V, Pointnet grasp estimation performs efficiently in the robotic grasp tests. Pointnet grasp estimation achieves accurate grasp results on normal, noise, and outlier conditions, which are 0.91, 0.87, and 0.9, respectively. In dense clutter condition, the success rate shows a decrease compared to the previous conditions. The reason for the success rate decreasing in dense clutter condition is due to the collision between gripper and fruits side by side. When collision presented in the grasp, it will cause the shift of the target fruit and lead to the failure of the grasp. This defect can be either improved by re-design the gripper or propose multiple grasp candidate to avoid the collision. The collision between gripper and branches can also lead to grasping failure in the other three conditions. Although such defect can affect the success rate of robotic grasp, it still achieves good performance in experiments. The success rate of robotic grasp under dense clutter and all factors combined conditions are respectively 0.84 and 0.837.
V. CONCLUSION AND FUTURE WORK In this work, a fully deep-learning neural network based fruit recognition and grasp estimation method were proposed and validated. The proposed method includes a multi-function network for fruit detection and instance segmentation, and a Pointnet grasp estimation to determine the proper grasp pose of each fruit. The proposed multi-function fruit recognition network and Pointnet grasp estimation network was validated in RGB-D images from the laboratory and orchard scenario. Experimental results showed that the proposed method could accurately perform visual perception and grasp pose estimation. The Pointnet grasp estimation was also tested in the laboratory scenario, which achieved a high success rate in the experiments. The future work will focus on optimising the design of end-effector and proposes multiple grasp candidates to improve the success rate of the grasp in dense clutter condition.