Manipulation Planning for Object Re-Orientation Based on Semantic Segmentation Keypoint Detection

In this paper, a manipulation planning method for object re-orientation based on semantic segmentation keypoint detection is proposed for robot manipulator which is able to detect and re-orientate the randomly placed objects to a specified position and pose. There are two main parts: (1) 3D keypoint detection system; and (2) manipulation planning system for object re-orientation. In the 3D keypoint detection system, an RGB-D camera is used to obtain the information of the environment and can generate 3D keypoints of the target object as inputs to represent its corresponding position and pose. This process simplifies the 3D model representation so that the manipulation planning for object re-orientation can be executed in a category-level manner by adding various training data of the object in the training phase. In addition, 3D suction points in both the object’s current and expected poses are also generated as the inputs of the next operation stage. During the next stage, Mask Region-Convolutional Neural Network (Mask R-CNN) algorithm is used for preliminary object detection and object image. The highest confidence index image is selected as the input of the semantic segmentation system in order to classify each pixel in the picture for the corresponding pack unit of the object. In addition, after using a convolutional neural network for semantic segmentation, the Conditional Random Fields (CRFs) method is used to perform several iterations to obtain a more accurate result of object recognition. When the target object is segmented into the pack units of image process, the center position of each pack unit can be obtained. Then, a normal vector of each pack unit’s center points is generated by the depth image information and pose of the object, which can be obtained by connecting the center points of each pack unit. In the manipulation planning system for object re-orientation, the pose of the object and the normal vector of each pack unit are first converted into the working coordinate system of the robot manipulator. Then, according to the current and expected pose of the object, the spherical linear interpolation (Slerp) algorithm is used to generate a series of movements in the workspace for object re-orientation on the robot manipulator. In addition, the pose of the object is adjusted on the z-axis of the object’s geodetic coordinate system based on the image features on the surface of the object, so that the pose of the placed object can approach the desired pose. Finally, a robot manipulator and a vacuum suction cup made by the laboratory are used to verify that the proposed system can indeed complete the planned task of object re-orientation.


Introduction
With the development of intelligent automation and artificial intelligence technologies such as deep learning, applications and development of intelligent robots have gradually attracted attention in the academic and industrial fields. Among the application requirements of many robots, the development of service robots has become one of the important projects. For example, for a service robot used in stores, if the robot is used to assist service personnel in placing and sorting goods on shelves, key technologies such as image recognition, object picking strategies, and collision avoidance are needed for the robot. The appearance, weight, size, and placement relationship of each product directly affect the complexity of image recognition, picking strategies, and mechanisms design. Therefore, how to determine the object's pick-and-place method for robot manipulators is worthy of discussion.
The pick-and-place tasks in robotics have been extensively developed and researched not only in industrial manufacturing but also in academia. In recent years, according to the rise of deep neural networks, in addition to classify and logistic regression [1,2], deep learning has also been widely used in the pick-and-place tasks for robot manipulators. It can be roughly divided into four topics: (i) object detection; (ii) object pose estimation; (iii) object picking planning; and (iv) robot motion planning. Regarding object detection, there has some research on the architecture of neural network used for object detection in recent years, such as Region-based Convolutional Neural Networks (R-CNN) related series [3][4][5][6], You Only Look Once (YOLO) series [7][8][9], and Single Shot MultiBox Detector (SSD) [10]. The R-CNN series use a two-stage method to first extract candidate Region Of Interest (ROI) and then classify each ROI so that they can achieve very high accuracy. YOLO series and SSD are one-stage methods, using deep neural network to simultaneously classify objects and detect object positions. The one-stage methods are usually faster and suitable for real-time applications, but the overall recognition accuracy may be lower than the two-stage methods. Object pose estimation is commonly based on RGB-D image [11] or 3D model [12,13] to calculate the 3D pose of object. Manuelli et al. [14] proposed the keypoint [15] method, which is used to detect human skeleton and find the relative keypoints, and combined it with local dense geometric information from a point cloud. In this way, a simpler pose vector can be used to represent the object pose in the environment and facilitate subsequent object picking operations. Semochkin et al. [16] proposed a keypoint detection method for different camera perspectives based on a trained CNN and used keypoints as the representation of object in the task of grasping task. Vecerik et al. [17] proposed a self-supervised training system to detect 2D keypoints for all individual scene views and used CNN to estimate the possible depth of objects, so as to estimate the 3D location of each keypoint.
Research in robotic manipulation over the past decades has mostly focused on how to pick up objects [18], but it has seldom discussed how to manipulate objects or the process of placing objects after being grasped. In fact, in the research about robot pick-and-place planning, most of the scenarios are drop the object down from a height to the box or table without considering the consequences [19][20][21]. There has not yet been a state-ofmind research for the object placement task, so it has been deriving the issue of multiple regrasp [22][23][24] to achieve the issue of specific placement or the issue of re-orientation [25] of the object. In recent years, there have been related studies on generating grasping strategies based on how to use the grasped objects [18]. For example, Do et al. [26] used the affordance of object for task-oriented grasping, while Lai et al. [27] and Qin et at. [28] used the grasped objects as the robot manipulator's tool to achieve operation control according to the specific objectives.
The purpose of this paper is to combine object detection, object pose estimation, and motion planning for the robot manipulator to achieve the function of object re-orientation. However, since the method of directly segmenting the target object in the 3D point cloud needs to build a 3D model for each target object independently, it also cannot achieve the characteristics that are widely used in the same type of object. As a result, this paper starts from 2D image recognition which is based on the deep learning method, combined with the depth image, and converts the pixel coordinates of the segmented area into 3D coordinates to complete the task of object pose estimation. It also integrates functions such as coordinate conversion, grasp planning, and trajectory planning, so that the robot manipulator can complete the object re-orientation.
The rest of this paper is organized as follows. In Section 2, the system structure of the proposed object re-orientation manipulation planning method is described. In Section 3, five main parts of the proposed 3D keypoint detection system are described. In Section 4, the object pose comparison and robot operation of the proposed manipulation planning system are described. In Section 5, some experimental results are presented to illustrate the proposed method and a laboratory-made robot manipulator is used to verify that the proposed method can indeed run on the actual robot. Finally, the conclusions are summarized in Section 6.

System Structure
The system architecture of the proposed manipulation planning method for object re-orientation in a category-level manner is shown in Figure 1. There are two main modules: 3D keypoint detection and manipulation planning. In the 3D keypoint detection module, since this paper uses Deep Convolutional Neural Network (DCNN) to detect the target object and each pack unit of the object, it usually requires plenty of time to build such a large training dataset to optimize the DCNN model. To solve this problem, a data augmentation module is added in the system and is used off-line to construct training dataset, which creates ample training data by superimposing a small number of manually marked datasets on different backgrounds. As a result, the database can be efficiently used to optimize the DCNN model and fine-tune a pre-train model. The data augmentation system process is introduced in detail in Section 3.5. In the on-line process, this paper uses a trained Mask R-CNN and a CNN-based semantic segmentation for object detection. First, Mask R-CNN is used to segment the pixel level of a target object. After gaining the segmented object picture, it is input into the semantic segmentation network to segment each pack unit of the object. After obtaining the position of each pack unit of object, the centroid position of each pack unit is used as the keypoint of the object, and the 3D pose of the object is estimated through the normal vector of the keypoint. The detailed operation process is presented in Section 3 and the operating time of the real robot is described in Section 5.3.
In the manipulation planning module, the pose of the object and the normal vector of each pack unit are first converted into the working coordinate system of the robot manipulator. Then, according to the current pose and the expected pose of the object, the Slerp [29] algorithm is used to enable the robot manipulator to perform a series of movements in the workspace to re-orientate the object. Finally, the object is repositioned through the laboratory-made 7-DoF robot manipulator and vacuum suction cup. The detailed operation process is presented in Section 4.

3D Keypoint Detection System
The main purpose of this section is to introduce how to establish the 3D keypoint detection system and explain how to integrate multiple object detection algorithms to achieve the purpose of estimating pose of the object. There are five main parts: (a) Mask R-CNN; (b) CNN-based semantic segmentation; (c) CRFs-based refinement; (d) keypoint annotation and object average normal vector; and (e) data augmentation for training dataset generation. They are described as follows.

Mask R-CNN
Mask R-CNN, a two-stage object detection algorithm which can achieve category-level detection and a mask that delimits the pixels constituting each object, enables the complete contour pixel position of the object to be accurately segmented while maintaining the operating speed. This algorithm is a deep learning algorithm based on Faster R-CNN [5] proposed by He et al. [6] for object detection and instance segmentation. The two-stage structure of Faster R-CNN model is used for object detection. In the first stage, Feature Pyramid Network (FPN) is used to generate a set of regions of interest as potential bounding box candidates to improve the robustness of detecting objects of different sizes. In the second stage, the candidate regions of interest are classified, and then the pixel position of the object is predicted by deconvolution, which provides three outputs for each object: a class label, a bounding box that delimits the object, and a mask that delimits the pixels that constitute. The mask image with the highest confidence index is selected as the input of the semantic segmentation system. Figure 2 shows the detailed two-stage architecture of Mask R-CNN.

CNN-Based Semantic Segmentation
This paper uses the semantic segmentation method based on convolutional neural network (CNN-based semantic segmentation) to perform secondary segmentation on the complete object mask image output in the previous section. The purpose is to segment the precise position of each pack unit of the object. For example, the PET bottle can split into two pack units: the body and the cap. In this paper, the existing DeepLab-ResNet [30] is employed as the network architecture at this stage, which is characterized by the use of atrous convolution shown in Figure 3 for convolution processing. The benefit is that it can maintain a larger Field-of-View (FOV) after processing with the convolution layer while maintaining the resolution of the feature map and the size of the convolution kernel, to ensure that more features are stored when the image size is equal during the convolution process.

CRFs-Based Refinement
After the CNN-based semantic segmentation process described in Section 3.2, an approximate object result can be found, which may affect the subsequent centroid calculation. To overcome this problem, a fully connected CRFs layer [31] is used to refine the CNN output, produce accurate semantic segmentation results, and recover object boundaries at a level of detail. The CRFs are a type of discriminative undirected probabilistic graphical model, which represents the Markov random field that outputs another set of random variables under the condition of a given set of random input variables x. The energy function of the model is described by where x is the label assignment for pixels, θ i (x i ) = − log P(x i ) is the unary potential, and P(x i ) is the corresponding label assignment probability at pixel i obtained from the CNN output. The following expression can be used to connect all pairs of image pixels, where i, j is the pairwise potential for each pair In Equation (2), µ(x i , x j ) = 1 if x i = x j , while the rest are all 0, which means that only nodes with distinct labels are penalized. The two Gaussian kernels are used to express in different feature spaces. The first Gaussian kernel depends on the penalty between two pixel positions (denoted as p i and p j ) and between their RGB values (denoted as I i and I j ). The second kernel considers spatial proximity when enforcing smoothness and only depends on the pixel position.

Keypoint Annotation and Object Average Normal Vector
After the output of the above process, the complete contour of pack unit is marked with different colors, so that the centroid position of each part can be calculated separately. Take the bottle cap as an example. The centroid calculation method is described as follows.
Step 1: Convert the pack unit mask into grayscale, and define the points on the mask image by where k is the length of pixel of the bottle cap mask.
Step 2: Define the maximum and minimum coordinate values in x-axis and y-axis by Step 3: Calculate the moment of order (p + q) of an image in the interval range x, y for a 2D continuous function f (x, y) by where p, q = 0, 1, 2, . . ..
Step 4: Calculate the zero-order moment m 00 of the function f (x, y), which represents the total of mass of the given image and is defined as Step 5: Calculate the two first-order moments m 10 and m 01 , which represent the centre of mass of the given image and are defined as Step 6: Calculate the centroidx andȳ by dividing m 10 and m 01 by m 00 , which can be expressed as Set the calculated centroid position to a depth of range d pcl , calculate the vector of each pixel in the range through the point cloud image, and calculate the average normal vector of the object surface. The average normal vector calculation process is shown in Algorithm 1.

Data Augmentation for Training Dataset Generation
This subsection introduces the data augmentation module for deep learning proposed in this paper. In the preparation of training data, since this paper uses the Mask R-CNN to perform the first stage of object detection, its main purpose is to segment the complete object contour. As a result, the ground truth data are marked with the type of individual objects, as shown in the left picture of Figure 4. In the second stage of semantic segmentation, the main purpose is to use the complete object contour image in the output of previous stage to detect each pack unit of the object, so as to annotate the keypoints of the object in Section 3.4. The corresponding ground truth is shown in the right picture of Figure 4, each color pixel representing each pack unit of objects classified from the input image. if Not enough to calculate the average normal vector then 6 Readjust the distance d pcl The procedure of proposed data augmentation system is shown in Figure 5. This paper uses a variety of background pictures and manually marked images to randomly superimpose and related image processes to geometrically transform the ground truth, which includes rotation, zoom, shift, random stack, etc., so that the object can be combined with the background to synthesize a random training dataset while retaining the original features for training in DCNNs. In the training process, this paper uses the data augmentation system to build individual datasets to train Mask R-CNN and CNN-based semantic segmentation network. Table 1 shows the information of data augmentation and network training, which includes the backbone network, the number of objects, the number of random scenes, the manually label photos, the number of individually generated training photos, and the resolution of the photos.

Manipulation Planning System
The main purpose of this section is to introduce how to use the keypoints of the object combined with the robot manipulator to plan the operation of sucking the object and explain how to compare the expected pose and current pose of object to achieve the purpose of re-orientating the object. There are two main parts: (a) object pose comparison; and (b) robot operation. They are described as follows.

Object Pose Comparison
After obtaining the 3D keypoints of the object in Section 3, this section mainly discusses how to use the keypoint information of each pack unit to compare the pose difference of the object. The task of "How to re-orientate a PET bottle on the shelf" is presented as an example; the keypoints p target_cap and p target_bottle of the pack unit are used to represent the positions of the PET bottle cap and PET bottle body, respectively. The points p target_bottle and p target_cap are used to obtain a unit vector v target_axis to represent the pose of the PET bottle, and the point p target_cap is projected on the camera coordinate of the y-axis to get a project point p target_cal . Then, the rotation angle θ obj_rotation of the PET bottle in space can be calculated by θ obj_rotation = π − a cos p target_cap p target_cal p target_cap p target_bottle (10) After calculating the rotation angle θ obj_rotation of the object, the position of the object and the pose unit vector in the workspace can be obtained. Next, invert this pose unit vector as the orientation of end-effector to suck the object by converting the coordinate system from camera to robot so that the robot manipulator can be used to suck the target object.
Finally, by comparing the expected pose v goal_axis that is initially given by the user with the current pose v target_axis , the initial and target suck position and orientation of the end-effector of the robot manipulator can be obtained through coordinate conversion, and the object can be sucked from the target to goal to achieve the purpose of "re-orientate a PET bottle on the shelf". The schematic diagram is shown in Figure 6, where the black dashed line represents the movement trajectory of the pick-and-place action in workspace of two keypoints.

Robot Operation
This section mainly discusses the coordinate conversion and the linear trajectory planning in workspace, the purpose of which is to generate the trajectory of endpoint of robot manipulator. The flowchart of robot manipulator operation planning is shown in Figure 7. The unit vector of an object can be obtained through the 3D keypoint detection in Section 3; through the coordinate conversion, an orientation of endpoint of robot to suck the object can be obtained. Finally, the operation of the robot manipulator is planned through the kinematics [32].  After obtaining the position and orientation that are represented by quaternion of the end-effector of the robot manipulator, the current pose and the target pose of the object can be calculated by interpolating the two quaternions at a ratio to plan the movement path. The purpose of this section is to use the Slerp algorithm to generate the trajectory points that the robot manipulator passes through and arrange the rotation of each point, which is expected that the trajectory of orientation to also be smooth.
To perform general linear interpolation between point p 1 and p 2 , if T interpolation points are to be generated, the interpolation points at time t are calculated as In the orientation planning part, the spherical linear interpolation of the quaternion in the four-dimensional space can be simplified to two unit vectors v target_axis and v goal_axis on the two-dimensional plane to interpolate on the unit circle. The purpose is that, when two endpoints of the robot manipulator are fixed, the orientation can be formed as a spherical surface in the three-dimensional space, and the trajectory between the two orientation vectors is the sphere by the approach vector of the robot manipulator. At time t, the angle between the two vectors is w, the interpolation vector v t is calculated by Equation (12), and the schematic diagram is shown in Figure 8.

Experimental Results
The purpose of this section is to verify the validity of the methods proposed in this paper. The three main hardware instruments used in the experiment are shown in Figure 9. The overall software system was implemented with Robot Operating System (ROS), which provides a standard communication tunnel to help collaborative development. The DCNN and normal vector were implemented with Tensorflow and Point Cloud Library (PCL), respectively. In this section, the experimental results of the object re-orientation are divided into three main parts: (a) image recognition results; (b) real-world demonstration; and (c) computational eficiency. They are described as follows:

CNN-Based Object Detection
Since DCNN is used to detect the target object and each pack unit of object, it usually requires a large training dataset to improve the accuracy of the method. To solve this problem, the data augmentation module described in Section 3.5 was added to the system. The information of data augmentation and network training of Mask R-CNN and CNNbased semantic segmentation is shown in Table 1. The Intersection-over-Union (IoU) metric was added to quantitatively evaluate the accuracy of the proposed visual perception algorithm. It is defined by Regarding to image recognition result, the original camera image was the input of the Mask R-CNN, and a single object was extracted to be the input of the CNN-based semantic segmentation for the pack unit detection. After that, the semantic segmentation result was optimized by the CRFs-based refinement method. The image processing results of three types of objects by the proposed method are described in Tables 2 and 3. From the output results of the processes by the Mask R-CNN, object extraction, CNN-based semantic segmentation, and CRFs-based refinement, the mean IoU of Mask R-CNN and pack-unit segment through the comparison with ground truth data are 0.942 and 0.913, respectively. Through the experimental results, it can be verified that the proposed method can completely segment each position of pack-unit.

Object Pose Estimation
Through the pack unit image output in Section 3.3, the keypoint annotation method described in Section 3.4 could be used to separately calculate the keypoints representing pack unit. The experimental results to mark the keypoints in the original image are shown in Figure 10, where the left picture is the PET bottle stand upright on the shelf and the right picture is placed flat on the shelf. The keypoints of the cap and body of the PET bottle are marked with red and blue, respectively. After obtaining the keypoint information of the object, through the point cloud-based unit vector calculation process described in Section 3.4, a unit vector could be obtained to represent the pose information of the object in the environment. In Figure 11, the unit vector in the workspace is described by the red arrow and the coordinate system is the coordinate system of the camera, which is installed on top of the slide rail of robot. After obtaining this unit vector, the inverse unit vector of this vector could be used as the orientation of the end-effector of the robot manipulator to suck the object through coordinate conversion, so that the robot manipulator can suck the target object. Table 2. Image processing results of three types of obtained by the proposed CNN-based object detection method.

PET Bottle Carton Drink Tomato Can
Origin picture

Output of Mask R-CNN
Object extraction

Output of CNN-based Semantic Segmentation
CRFs-Based refinement Table 3. Image processing results of three different pose of the PET bottle obtained by the proposed CNN-based object detection method.

PET Bottle PET Bottle PET Bottle
Origin picture To analyze the accuracy of the proposed 3D keypoint detection system, this study used the Mean Absolute Error (MAE) metric to measure the pose estimation results. Using manually marked key-points as ground truth data, the two poses of the PET bottle in Figure 12 were detected multiple times and the position and pose were transformed into the robot base frame. To measure pose estimation errors of the proposed method, the position and rotation estimation errors are, respectively, defined as where Ω = {x, y, z} represents one of the three axes of the Cartesian coordinate system. T Ω and R Ω represent the position and rotation of the ground truth, respectively.T Ω andR Ω represent the position and rotation of the estimate data, respectively. Based on the above definition, the mean absolute error metric is defined as where N is the total test number and X represents one of the estimate variables T or R, and N is the total test number. The MAE measures of the pose estimation results are described in Table 4.  In Table 4, the average errors of the position estimation results obtained from the proposed method are, respectively, 0.13, 0.58, and 0.73 cm along the x-, y-, and z-axis, and the average rotation errors are, respectively, 3.27, 5.68, and 2.36 degrees across the x-, y-, and z-axis. This accuracy level is suitable for the robot manipulator to perform the object picking and re-orientation task. Therefore, the experimental results validate the pose estimation accuracy of the proposed algorithm.

Real-World Demonstration
An experiment of the proposed manipulation planning for the object re-orientation is presented and its flow chart of the whole experiment process is described in Figure 12. First, an RGB-D camera was used to capture environmental information, and the expected position and pose of the object was estimated through the keypoint detection system described in Section 3. After obtaining the expected position and pose, the object was replaced by the user, and the keypoint detection system was used to estimate current position and pose of the object. The keypoint detection system suspended detection during the operation of the robot manipulator. By comparing the difference of the expected and current position and pose, a series of movements of the robot manipulator was generated to pick the object and re-orientate it to the expected position and pose. After completing the above process, the camera was used to capture the current pose of the object, and the picture of the expected object's pose was compared with the Intersection over Union described in Equation (12). If the similarity was greater than 0.9, the process was determined to be successful. If the similarity was less than 0.9, the object re-orientation was performed again to achieve a closer result. To ensure that the detection system has good accuracy in the operation of the real robot, the checkerboard calibration method provided by realsense™ [33] was used to calibrate the camera before the experimental process. The movement states of the robot manipulator shown in each picture of Figure 13 are described as follows: Return to the initial state.

Computational Efficiency
The proposed system was implemented in C++ and Python running on an Ubuntu 16.04 platform personal computer that is equipped with an Intel® Core™ i7-7700 CPU, 32 GB DDR4 system memory, and a NVIDIA GeForce GTX 1080 GPU with 8 GB frame buffer memory. The average processing time in each phase of the proposed 3D keypoint detection system is described in Table 5. We can see that the average total processing time took about 1.742 s in the real operation, and the CNN-based semantic segmentation and CRFs-based refinement spent more time in the whole process.

Conclusions
An object re-orientation planning method based on 3D keypoint detection is proposed for the robot manipulator so that it can re-orientate an object from an arbitrary pose to a specified position and pose. There are three main contributions of this research: (i) In the object pose estimation system, the CNN-based object detection algorithm is used to recognize the position of each object by separating them into some pack units, and the depth image is added to estimate the object's pose in the environment. Due to the characteristics of the CNN-based object detection algorithm, it can be executed in a category-level manner by adding several relative categories of the object, without the need to relatively build a 3D model of the object in the training phase. Compared with other 3D model-based object detection systems, this method can simplify the procedure more effectively. (ii) In the second stage of CNN-based pack unit detection, the mask image obtained by instance segmentation in the first stage is used as input so that the environmental effect is reduced and the contour of each pack unit can be accurately and completely segmented. In addition, it can simultaneously generate fewer label categories while using CRF for individual contour optimization. (iii) In the manipulation planning for object re-orientation, the spherical linear interpolation method is used to effectively generate a series of movements in the workspace for the robot manipulator by comparing the difference of the current pose and the expected pose of the object. It can be concluded from the experimental results that the proposed method can indeed complete the object re-orientation.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this paper: