Curve Set Feature-Based Robust and Fast Pose Estimation Algorithm

Bin picking refers to picking the randomly-piled objects from a bin for industrial production purposes, and robotic bin picking is always used in automated assembly lines. In order to achieve a higher productivity, a fast and robust pose estimation algorithm is necessary to recognize and localize the randomly-piled parts. This paper proposes a pose estimation algorithm for bin picking tasks using point cloud data. A novel descriptor Curve Set Feature (CSF) is proposed to describe a point by the surface fluctuation around this point and is also capable of evaluating poses. The Rotation Match Feature (RMF) is proposed to match CSF efficiently. The matching process combines the idea of the matching in 2D space of origin Point Pair Feature (PPF) algorithm with nearest neighbor search. A voxel-based pose verification method is introduced to evaluate the poses and proved to be more than 30-times faster than the kd-tree-based verification method. Our algorithm is evaluated against a large number of synthetic and real scenes and proven to be robust to noise, able to detect metal parts, more accurately and more than 10-times faster than PPF and Oriented, Unique and Repeatable (OUR)-Clustered Viewpoint Feature Histogram (CVFH).


Introduction
Removing individual objects from an unordered pile of parts in a carrier or box (bin picking) is one of the classical problems of robotics research [1]. Typically, the system consists of a sensor mounted above the box, an industrial robot arm and a processor. As 3D sensors are becoming cost effective, bin picking systems using 3D sensors have been developed in recent years [1][2][3][4]. In this paper, we address the challenge of estimating the poses of parts in bin picking systems with point cloud data efficiently.
Many state-of-the-art pose estimation algorithms incorporate color information with depth information [5][6][7][8]. Hinterstoisser et al. proposed multimodal-LINE (LINEMOD) by combining color images with a dense depth sensor in [7], and Rios-Cabrera et al. [5] proposed Discriminatively Trained Templates (DTT) based on LINEMOD. Hinterstoisser et al. improved LINEMOD in [8] and achieved an average recognition rate of 96.60% and a speed of 119 ms/frame on their ACCV3D dataset. These algorithms can achieve a high recognition rate and speed. However, if color information is not available, the performance declines.
Compared to the daily objects like the ACCV3D dataset of [8], objects in bin picking tasks have two features. One is that the objects usually share the same color. Therefore, the algorithms incorporating color information, such as LINEMOD, cannot present their best performance. Another feature is that many industrial objects are composed of shape primitives such as cylinders, spheres and planes. Some features of the points on these shape primitives are very similar, which makes it more difficult to recognize and localize the industrial objects than daily objects for some algorithms.
One of the promising pose estimation algorithms was proposed by Drost et al. [9]. The algorithm combines an efficient voting scheme with point pair features and does not use color information.

Curve Set Feature and Rotation Match Feature
In this paper, we denote s i ∈ {S} for points in the scene cloud, m i ∈ {M} for points in the model cloud, n(m i ) for the normal of m i , N m for the number of model points, N s for the number of scene points and N select for the number of selected scene points for which we compute the descriptor and match with model points. The model diameter diam(M) is the diameter of the circumcircle of the model. We further denote SP(s i ) for the visible points within the sphere centered at s i with radius diam(M).

Curve Set Feature
The element used in our algorithm is a curve feature. A curve feature of a point m i is a quantized 2D curve starting from m i and within the sample plane as the normal n(m i ), as presented in Figure 1. It is computed by the following steps: 1.
Choose a vector v 1 starting from m i and perpendicular to n(m i ). Build a 2D local coordinate system whose origin is m i ; the y axis is n(m i ), and the x axis is v 1 .

2.
All of the points within the local coordinate system whose x value is between zero and diam(M) are denoted as C 1 . Starting from x = 0, divide the local coordinate system into small intervals with length X int (in our experiment, we set X int as the integer not smaller than the downsampling size) in the x direction. In every small interval, reserve the point with the largest y value, and delete others from C 1 to choose visible points.

3.
Divide the local coordinate system in the x direction again with a larger length X step > X int . For the points of C 1 within the n-th interval, compute the average y value y n . If there is no point in this interval, set y n = ∞.

4.
The curve feature of point m i in the direction of v 1 is f The above steps describe how to compute the curve feature of m i in one direction. To describe the surrounding points of m i , we need to compute the curve features in all directions. For a point, we compute a curve feature every one degree. Therefore, every point has D 2 = 360 curve features, and the set of these features is the curve set feature of m i : In Step 2, we need to find the points within the plane, but it is time consuming to traverse all points in SP(m i ) 360 times. Instead, we only traverse SP(m i ) once before computing the features and assign every point to the plane to which it belongs.
When computing the curve features, we delete invisible points to enable consistency between the scene cloud and the model cloud. In general, the model cloud contains all points of the object, while the scene cloud contains only a part because of occlusion. If we do not consider this difference, the y of the model cloud will be smaller than that of the scene cloud. When the object is in different poses, the visible part changes, and we cannot consider so many situations. Therefore, when computing the curve features of a point (both model point and scene point), we assume that the camera's view direction and the normal of this point are collinear.

Compare Curve Set Features
We define Curve Similarity (CS) to compute the similarity between two curve features: where y thres is the threshold. In our experiment, we set y thres = 0.05 × diam(M).
We further define Curve Set Similarity (CSS) and the Summation of Curve Set Similarity (SCSS) to describe the similarity between two curve set features. Considering the curve features in different directions of a point are different, we need to specify the rotation angle in CSS and SCSS: Curve similarity CS( f m i p , f m j q ) is equivalent to matching the local coordinate systems of the curves and counting the number of curve feature elements, which share the same interval index and a similar value. Therefore, if the CS value of two curve features is large, it means these two curves are similar.
Curve set similarity CSS(F(m i ), F(s j ), α) is equivalent to aligning m i , s j and their normals, rotating F(m i ) around the normal by α and computing the curve similarities with F(s j ) in every direction, which is shown in Figure 2. The transforming from F(m i ) to F(s j ) can be expressed as: where P(m i , s j , α) is the pose, and we borrow this idea from [9]. Therefore, CSS(F(m i ), F(s j ), α) shows the curve similarity in every direction for pose P(m i , s j , α), and SCSS(F(m i ), F(s j ), α) can be used as a rough pose estimation for P(m i , s j , α). The dimension of the curve set features is D 1 D 2 , which only depends on diam(M) and X step . Therefore, the downsampling sizes of the model cloud and the scene cloud can be different as long as the size is smaller than X step . Figure 2. Transformation of the model and scene local coordinate systems [9] proposed. A pose can be computed from Equation (4).

Rotation Match Feature
In pose estimation tasks, the key is to find correspondence between the model and scene. However, it is difficult to search corresponding points using curve set features without any preprocessing. Firstly, the dimension of curve set features is D 1 D 2 (usually larger than 1000), which makes it difficult to search efficiently. Secondly, the curve set features are not rotationally symmetric around the normal, and that is why we need to specify α when computing CSS. Therefore, we propose the Rotation Match Feature (RMF) to solve these problems.
The RMF of a point m i is computed by the following steps: 1. Randomly choose a model point m r as the reference point.

3.
The RMF of m i is the CSS when α = α max (m i , m r ).
A sample of the SCSS value against α is presented in Figure 3. The aim of selecting model reference points and computing the α max (m i , m r ) is to eliminate the rotation degree of freedom and decrease the dimension of the feature. If m i and s j are corresponding points, their RMF features with the same reference point should be close. Suppose a pair of corresponding points m i and s j are founded, and we need to compute the transformation to match the model to the scene. After aligning the two points and their normals, another rotation R y (α) around the normal is necessary, and we can decide α by using RMF. When computing RMF of the two points, we match m i and s j to m r based on Equation (4): Therefore, the transformation from the model to the scene is: Instead of only one reference point, we use N r model reference points to improve the recognition rate. Besides, to improve efficiency, we do not traverse all 360 degrees when computing the RMF of scene points. Instead, we compute the SCSS value every five degrees and find the angle α temp_max with the maximum SCSS value. Then, we check the neighboring angles of α temp_max and choose α max .

Matching Process
The workflow of the matching process is presented in Figure 4.

Normal Estimation and Modification
Normal estimation is performed by fitting a plane to some neighboring points, and it has been widely used; therefore, we do not introduce it in detail.
After estimating the normals, we have to decide the sign of the normals, and in general, there is no mathematical way to solve this problem. For our algorithm, the key is that if the model and scene are correctly matched, the sign of their normals should be consistent. Therefore, we make the normals of the model and the scene cloud point outward from the objects in our algorithm, as presented in Figure 5. Considering that there is always occlusion in the scene cloud, the sign of model normals and of the scene normals is computed in a different way.
A scene cloud is always partially visible, and the viewpoint is always outside the object. Therefore, we define a vector vs i staring from a scene point s i to the viewpoint. The angle between n(s i ) and vs i should be less than 90 • . If not, the sign of n(s i ) changes.
If the model cloud is from a CAD model, the triangle vertices (v 1 , v 2 , v 3 ) of CAD meshes are always ordered consistently, so that the cross products outward. The sign of model normals can be decided by the mesh the pointing inward. If the model cloud is from a 3D sensor, the same method as for the scene cloud can be used.

Build Model Feature Library
The model feature library is built during offline stage. Suppose we have N m model points and N r reference points, then the library contains N m N r items. Each item contains four pieces of information: model point index, reference point index, RMF and α max of these two points, as shown in Figure 6.

Scene Cloud Preprocess
In real bin-picking tasks, the position of the bin is usually known. Therefore, in order to reduce the computation time and noise points, we remove these points from the scene cloud.
Then, we proceed with a Euclidean segmentation on remaining points. Performance increases by considering only scene points in the same cluster when computing curve features. However, our algorithm does not rely heavily on the segmentation result, and we will show that in Section 4.
Selected scene points are points for which we compute the curve set features and match with model points. The features and the matching process rely on the normals of these points. We found that the normals of the points near boundary points were not reliable. Therefore, a boundary estimation is proceeded on the scene cloud, and scene points that are far away from boundary points are the candidates. Then, we randomly choose N select points from the candidates as selected scene points.

Scene Feature Computation and Nearest Neighbor Search
For a selected scene point s j , RMF(s j , m r ) is computed and searched in the model feature library using the Fast Library for Approximate Nearest Neighbors (FLANN) [20].
Suppose we search RMF(s j , m r ) in the library and get RMF(m i , m r ). s j and m i may be corresponding points because they share a similar RMF. We match m i to s j as described in Section 2.3 and Equation (7).
Following [9], here, we call the transformation pair (m i , s s , α) a local coordinate. Because of noise, occlusion and other factors, the model point with the nearest RMF may not be the corresponding point of the scene point. Therefore, we search Knn nearest model RMF for every scene RMF. The transformations (poses) between the model points and scene points are saved for pose verification. There are in total N select N r Knn poses to verify. The process is presented in Algorithm 1.

Pose Verification
The number of poses (local coordinates) to verify from the last stage is large. As mentioned before, the SCSS value can be treated as a rough pose verification method, and we can use it to improve the verification efficiency. Given a local coordinate (m i , s j , α), we compute SCSS(F(m i ), F(s j ), α) to evaluate the performance. After computing the SCSS value for all of the local coordinates, we select top N p from them for the next verification.
An idea of pose verification is to transform the model cloud into the scene space. Then for every model point, the nearest scene point is searched and the distance between these two points and the angle between their normals are computed. If the distance and the angle are smaller than the specified threshold, this model point is considered to be fitted. If the number of fitted points is sufficiently large, the pose is considered to be correct [21]. This method is intuitive and effective, but time consuming, because it needs to search the nearest scene point for every model point and every pose.
The key in pose verification is to search the nearest scene point for transformed model points efficiently. To achieve this, we divide the scene space into small cubic voxels with length L voxel . The edges of the voxels are parallel to the axes of the scene coordinate system. At first, the values of all voxels are −1. The 3D coordinate of a voxel center is converted to three non-negative integers by Equation (8): where (x voxel , y voxel , z voxel ) is the coordinate of the voxel center, min_x, min_y, min_z are the minimum coordinate components of the scene cloud and x int , y int , z int are the integers. The voxel is accessed through these three integers. Then, for every scene point s j , Equation (8) is used to find the voxel that s j is in, and this voxel is a seed voxel Sv j . The values of Sv j and surrounding voxels within the sphere centered at Sv j with radius Radius seed change to the index of the scene point j. The constant Radius seed is set as the distance threshold (in our experiment, Radius seed = 2 mm). In the verification, for a transformed model point m i , we access the voxel m i in using the same method, and the value of the voxel is the index of the nearest scene point. If the value is −1, it means the distance from the nearest scene point is larger than the threshold, and further verification is unnecessary, as presented in Figure 7. If the transformed model point m i finds a valid nearest scene point s p and the angle between their normals is less than a threshold, m i is a fitted model point. The score of a pose is the fitted model point number. By using this method, we can verify poses efficiently, as presented in Figure 8. The pose score computed from our method and kd-tree is very similar, but our method is more than 30-times faster than kd-tree. After verifying all of the poses, the final pose is the one with the highest score. The verification process is presented in Algorithm 2.
(a) (b) (c) The pose score against displacement of the model cloud by our verification method and the kd-tree-based method. The score difference between the two methods is small, and our method is more than 30-times faster.
Values of surrounding voxels change to j; end for k = 1 to N p do Transform k-th selected local coordinate to pose P k according to Equation (4); if index = −1andcos(n(s index ), P k n(m i ))>thres then Score(P k )+ = 1; else end end end

Multiple Pose Detection
Sometimes, it is necessary to detect multiple poses in one scene, and our algorithm is capable of that. During pose verification, a large number of poses was evaluated. These verification results are used by the following steps:

1.
Rank all of the poses with their scores.

2.
Suppose P 1 is the first selected pose. Transform the model cloud into scene space according to P 1 .

3.
For every transformed model point, check whether the value of the voxel it is in is −1. If not, change the value of all of the voxels sharing the same value with this voxel to −1.

4.
Verify the poses with a high grade in Step 1, and choose the pose P 2 with the highest grade. P 2 is the new pose.
This is actually to delete the scene points corresponding to the old pose and to select the new pose. If we do not delete the points, we will always get the same pose.

Experiment
We evaluated our algorithm against a large number of synthetic and real scenes. Six industrial parts were used in our experiment. The models and their diam(M) are shown in Figure 9. We were interested in the recognition rate and speed of the algorithm. Every resulting pose was considered to be correct if the error was less than the specified threshold. In our experiment, the threshold was set to diam(M)/10 for the translation and 5 • for the rotation. For all experiments, We set X step as 3 mm, and the downsampling size of the model cloud and the scene cloud was smaller than X step . The default values of the selected scene point number and reference model point number were N select = 50 and N r = 20. We compared our algorithm with Drost PPF [9], denoted as PPF, and OUR-CVFH [19], denoted as OUR-CVFH. The resulting poses of all three algorithms were refined by ICP. All given timings contain the whole process including the normal estimation, boundary estimation, matching and ICP refinement. The algorithms were implemented in C++ and run on an Intel Core i7-4810MQ CPU with 2.8 GHz and 8 GB RAM.

Synthetic Scenes
We firstly evaluated our algorithm against synthetic scenes. The scenes were generated with multiples of the same object in every scene using the simulator in [22]. One hundred synthetic scenes were generated for every model, and the number of objects in every scene varied from 7-12. Then, occluded points were removed based on the viewpoint. We ran our algorithm four times. The default parameters were used in the first set of experiments, and a smaller N select = 20 and N r = 5 were used in the second set. In order to measure the influence of segmentation, we ran the algorithm without Euclidean segmentation using default parameters and fast parameters in the third and fourth set, respectively. The four experiment results are denoted as CSF Default, CSF Fast, CSF No Seg Default and CSF No Seg Fast respectively. The recognition rate and speed of the algorithms on every model are presented in Tables 1 and 2, respectively. Some experiment results are presented in Figure 10. It is seen that when using default parameters, our algorithm achieved a recognition rate of 97.36%. When performed at high speed, our algorithm still gave a higher recognition rate than OUR-CVFH and PPF, and at the same time, it was more than 10-times faster than OUR-CVFH and 35-times faster than PPF.
When using default parameters without Euclidean segmentation, the recognition rate and speed do not change greatly. For fast parameters, the segmentation can improve the recognition rate of the algorithm. As stated in Section 3.3, our algorithm does not depend heavily on segmentation. Using fast parameters without segmentation, our algorithm still performed with a recognition rate of 82.69%.  We then tested our algorithm against noise. Gaussian noise with a standard deviation of σ = 0.05diam(M) was applied on a part of the points (10-50%) in the synthetic scenes. Then, our algorithm was applied on the scenes using default parameters with different percentages of noise points. The performance is presented in Figure 11 and some detection results are presented in Figure 12. It is seen that our algorithm worked well against noise. When the noise was added on 50% of the scene points, the worst recognition rate was 83.00% for magnet, and an overall recognition rate of over 88.36% was achieved.

Real Scenes
We tested our algorithm for real 3D data scanned with our 3D sensor. We did not experiment on switch because in real scenes, sometimes, the part had two possible poses, and it was difficult for us to distinguish which was correct, as presented in Figure 13. Because of the occlusion and noise in the real scenes, it is very difficult to distinguish which pose is correct and to make ground truth poses. Therefore, we did not experiment on switch.
Firstly, we performed quantitative evaluation on the gear, L-shaped part and magnet. We took 25 scenes for each part, and the ground truth poses of the objects were made manually. The same as the simulation experiment, six poses were detected in every scene. The performance of the algorithms were presented in Tables 3 and 4, respectively, and some results are presented in Figure 14.  Though OUR-CVFH presented good results in the synthetic experiment, it performed with the worst recognition rate on the L-shaped part. This is mainly because the segmentation in real scenes was not good enough. OUR-CVFH performs Euclidean segmentation before recognition, and as shown in Figure 15, the segmentation result of the real scene was worse than that of the simulation scene. For gear and magnet, OUR-CVFH performed better because the clouds were easier to segment. Compared to OUR-CVFH, our algorithm is more robust to noise and the failure of segmentation.   The metal L part and bulge are metal parts, and we performed qualitative evaluation on these two parts. Some results are presented in Figure 16. It is seen that our algorithm can estimate metal parts when the lost point number is not very large.

Conclusions
This study proposes a 6D pose estimation algorithm for a robotic bin picking system. Two features, CSF and RMF, are proposed to describe and match scene points with model points. To improve the efficiency of the pose verification method, we divide the scene space into voxels to replace the kd-tree. Our algorithm was evaluated against a large number of synthetic and real scenes and a high recognition rate and efficiency were presented. Our algorithm is also proven to be robust to noise, heavily cluttered scenes and able to detect metal parts. However, the performance of our method tends to decline for occluded objects because the occlusion causes a change of RMF.