Accurate Object Pose Estimation Using Depth Only

Object recognition and pose estimation is an important task in computer vision. A pose estimation algorithm using only depth information is proposed in this paper. Foreground and background points are distinguished based on their relative positions with boundaries. Model templates are selected using synthetic scenes to make up for the point pair feature algorithm. An accurate and fast pose verification method is introduced to select result poses from thousands of poses. Our algorithm is evaluated against a large number of scenes and proved to be more accurate than algorithms using both color information and depth information.


Introduction
Vision-based object recognition and pose estimation has been widely researched because of its importance in robotics applications. Given the CAD model of the object, the task is to recognize the object and estimate the 6 Degree-of-Freedom (DOF) pose accurately. Though a lot of works have been conducted, it is still a challenging task in computer vision because of sensor noise, occlusion and background clutter. Generally, the objects are captured by 2D/3D sensors and based on the vision sensors, three kinds of information are utilized for recognition: RGB, depth and RGB-D.
In order to estimate pose of objects using the RGB cameras, some research has been carried out. In [1], an approach for building metric 3D models of objects using local descriptors from several images was proposed. Given an input image, local descriptors are matched to the stored models online, using a novel combination of the RANSAC and Mean Shift algorithms to register multiple instances of each object. However, this method can only be used for the objects with texture in household environments. For the texture-less objects, Munoz et al. [2] proposed a method using the edge information with only one image as the input. The pose is estimated using edge correspondences, where the similarity measurement is encoded using a pre-computed linear regression matrix. However, the edge detection is heavily affected by the illumination conditions so that some research using MFC (Multi Flash Camera) [3][4][5] has been conducted. In [3], the silhouettes are segmented into different objects and each silhouette is matched across a database of object silhouettes in different poses to find the coarse pose. Liu et al. [4] proposed the Chamfer Matching method to extract the depth edge and the method is able to perform pose estimation within one second in an extremely cluttered environment. In [5], a method for finding a needle in a specular haystack is proposed by reconstructing the screw axis as a 3D line.
As 3D sensors are becoming more and more affordable, methods using point clouds or depth images are proposed [6][7][8][9][10]. Rusu et al. [11] introduced a Viewpoint Feature Histogram (VFH) descriptor that performs a 3D segmentation on the scene, calculates one single descriptor for the whole object surface and matches with model descriptors. Based on it, Clustered Viewpoint Feature Histogram (CVFH) [8] and Oriented, Unique and Repeatable Clustered Viewpoint Feature Histogram (OUR-CVFH) [12] were proposed. These methods could detect multiple objects with only depth

Method
In our algorithm, the model size of the target diam(M) is defined as the maximum 3D distance between every two points in the model. The pipeline of our estimation algorithm is presented in Figure 1. The input is a depth image or a point cloud. Firstly, scene preprocessing is performed to remove some irrelevant points. Then, a point pair feature algorithm is performed on the remaining points to generate pose candidates. These poses are evaluated by the pose verification method. The result poses are selected from poses with high scores.

Scene Preprocessing
Before matching, a boundary-based scene preprocessing is performed to remove the points belonging to background and foreground objects whose sizes are larger than diam(M), as shown in Figure 2. For a depth image, the gradient of every pixel is calculated and, if the magnitude of a pixel is larger than a threshold (in our experiment, 10 mm), this pixel is considered as a boundary pixel. Then, based on the Connected-Component Labeling Algorithm of [25], the boundary pixels are clustered as curves if they meet the following conditions: (1) Every pixel of a curve can find at least one pixel of the same curve among the eight surrounding pixels. (2) The 3D distance between the corresponding 3D points of every two neighbor pixels is less than a threshold d con (the threshold is slightly larger than the average point distance of the cloud).
The curves have two functions. Generally, the pixels of the same curve belong to the same object as long as d con is not very large. If the length of a curve (the maximum 3D distance between every two points in the curve) is larger than diam(M), we assume that this curve does not belong to the target object and remove such curve consequently, as shown in Figure 3. Therefore, curves can be used to remove useless boundary pixels. The curves are also used in the boundary verification, which will be introduced in Section 2.4.3. It should be noted that it is very difficult to ensure that all the boundary pixels of an object are in one curve, and, at the same time, the pixels of different objects are not connected. The former needs a large d con that contradicts with the latter. Instead, we want to ensure that all the pixels in a curve belong to the same object, even if an object contains multiple curves. Therefore, d con is set slightly larger than the average point distance.
Then, we introduce how to distinguish foreground points with background points using the boundaries. Suppose there is a cuboid on a plane and the camera is above it, as shown in Figure 4a Figure 4d, the angle between the gradient direction and the vector is larger than 90 • . This difference is used to distinguish foreground points with background points.
Starting from a point s i , the nearest boundary point b m in a direction is searched on the 2D boundary map. If the angle between the gradient direction of b m and the vector from s i to b m is less than 90 • and the 3D distance between s i and b m is less than diam(M), s i is considered to find a valid intersection. This search is performed in 36 directions for s i (every 10 • on the 2D map) and if the valid intersection number is larger than a threshold N valid , s i is considered to be a foreground point and reserved. Otherwise, s i is removed. We found that the threshold 10 ∼ 20 is proper for most objects. The result of the process is shown in Figure 2.

Point Pair Feature
To obtain an initial guess of pose, we use the point pair feature algorithm [10]. Given an oriented scene point cloud or depth image and a target model, the point pair feature will be calculated for oriented points, respectively. By aligning the point locations and the normals of the point pairs sharing the same feature, the 6-DoF pose can be recovered. For two points m 1 and m 2 with normals n 1 and n 2 , d = m 2 − m 1 , the feature is defined by Equation (1): where (a, b) ∈ [0; π] denotes the angle between two vectors. In the point pair, the first point m 1 is called the reference point and the second point m 2 is called the referred point. During the offline stage, a hash table that stores all point pair features computed from the target model is built. The features are quantized and used as the key of hash table and the point pairs with the same feature are stored in the same slot.
Given a depth image (scene cloud), pose hypotheses are computed by calculating the transformation between a scene point pair and a set of model point pairs. To make this search efficient, a voting scheme based on a 2D local coordinates is utilized. For the scene point pair (s r , s i ), suppose a corresponding point pair (m r , m i ) is found in the hash table H. Next, s r and m r are aligned in an intermediate coordinate system as shown in Figure 5. By rotating the model pair around the normal with an angle α, the referred points, s i and m i can be aligned. The 2D vector (m r , α) is defined as a local coordinate. The transformation is defined by Equation (2): and is explained in Figure 5.

Figure 5.
Transformation of corresponding points in model and scene. The transformation T m→g translates the model point m r into the origin and rotates its normal n m r onto the x-axis. T s→g does the same for the scene point pair. In many cases, s i and m i will be misaligned, and the rotation R x (α) around the x-axis with angle α is required to match them.
In our task, only the reserved scene points from preprocessing are processed as reference points. For a reserved scene point s r , point pairs with other scene points are computed and matched with model pairs using the above-mentioned process. Since the depth image is available, it is unnecessary to compute features with all other scene points. Instead, the referred scene points far from the reference point on the depth image are rejected to save time. A 2D accumulator is created to count the number of times every local coordinate is computed (vote). The top local coordinates (top five poses in our experiment) are selected based on their votes.
It should be noted that the removed points of scene preprocessing are still used as referred points since a few foreground points may also be removed.
Finally, the pose hypotheses are clustered such that all poses in one cluster do not differ in translation and rotation for more than a predefined threshold. Different from [10], who used the vote summation of clusters to select result pose, the average pose of every cluster is computed and stored along with the pose hypotheses for the verification because the accuracy of poses is improved by the pose clustering.

Partial Model Point Pair Feature
The hash table stores the features between every two points in the model to allow for the detection of any pose. However, if the camera views the target in such a viewpoint that only a small part of the target is visible, the point pair feature algorithm may fail to select the correct pose, as presented in Figure 6.
There are two reasons for this failure. One is that the features of the visible part are not distinguishable enough from other parts of the object. Another is that the normals of points near the boundaries in the scene could be quite different from those in the models. As a result, the correct poses can not get high votes in this case, which results in the detection failure.
Therefore  (1) Create a synthetic scene of the object and generate partial clouds from thousands of viewpoints on the upper hemisphere, as presented in Figure 7.
(2) For every generated cloud SS, find the points belonging to the object and perform the point pair feature algorithm using these points as reference points with H all . Every reference point generates one pose. The score of SS is the number of points whose poses are correct. Find the nearest model template MT j based on the viewpoint of SS and pose of the object. (3) The score of a template MT j is defined as the average score of the generated clouds whose nearest template is MT j . (4) After all clouds are processed, find the template with the lowest score. If the score is less than 50% of the average template score, this template is selected for the additional table.
(a) (b) In the template selection, the score of a template means the difficulty of recognizing the object in similar poses with H all . If the lowest score of the templates is much lower than the average score, it means that the object under similar poses is difficult to recognize with H all . Therefore, the additional hash table built with the template is necessary to handle these situations. Generally, no more than one template is selected to balance the trade-off between accuracy and computation time.

Pose Verification
Different from [10], which uses the summation of votes of clusters to select result poses, we verify every pose proposed in the last step and select the best one. In order to improve the efficiency, and, at the same time, obtain a satisfying accuracy, the poses are firstly verified by the depth verification in Section 2.4.1. The top poses (in our experiment 5% poses) are selected and three scores are evaluated for them, which will be introduced in Sections 2.4.2-2.4.4, respectively, as presented in Figure 8. Finally, the pose with the highest score is selected as the result pose.

Depth Verification
Given a pose P i , the model points are transformed onto a depth map according to P i . The score of P i in depth verification is the number of transformed model points whose depth difference from the pixel on the depth map is less than a threshold (in our experiment, the threshold is set as 0.02diam(M)). Depth verification is a fast, rough verification method and its function is to remove bad poses efficiently. The top poses are selected for next-stage verification.

Inverse Verification
The inverse verification method is an improvement on the voxel-based verification method of [26] for wide space search. The idea of the verification is that, if the pose is correct, the transformed model points will find corresponding scene points near them. [26] divided the scene space into small voxels and every voxel stores the scene point within it. It built another hash table to access the voxels with a 3D coordinate efficiently. To verify a pose P i , [26] transformed all model points into scene space and checked whether there are scene points near the transformed model points by the voxel hash table.
However, this is difficult to implement when the scene space is very wide. If the length, width and height of the scene space are 1000 mm and the voxel length is 1 mm, 1000 3 voxels are necessary to cover the scene space. The storage and time for it are unacceptable. Therefore, instead of transforming the model into scene space, we do it inversely: (1) During the offline stage, divide the model space into small voxels and each voxel stores the model point in it. (2) Build a hash table to efficiently access the voxels with 3D coordinates.
(3) To verify a pose P i , transform the model center c m into scene space according to P i : c mt = P i c m .
Select scene points from the depth image whose distance from c mt is less than 0.5diam(M). (4) Transform the selected scene points into model space by P −1 i . For every transformed scene point st j , if the voxel st j contains a model point, it means that st j has a corresponding model point. The inverse score of P i , which is denoted as S inverse (P i ), is the number of transformed scene points with corresponding model points.
The advantage of inverse verification is threefold: (1) It saves time and storage to build a voxel map for a model instead of a scene.
(2) By using the model voxel map, it is quick to search corresponding model points for transformed scene points. Then, the question may come that since depth verification can do the same work, why is inverse verification used? It is true that depth verification is faster and can also calculate the distance between model and scene points. However, the accuracy of depth verification is worse than inverse verification. Suppose the target is a planer object, as shown in Figure 9. The transformation error between the estimated pose and ground truth is approximately equal to d 2 . However, if the pose is evaluated by depth verification, the average distance error will be d 1 , which is much larger than d 2 . Therefore, the depth verification method is not accurate when the depth gradient is large. Suppose the z-axis is the camera axis, the red line is the scene points of the object we want to estimate and the blue line is the estimated pose. If the pose is evaluated by the inverse verification method, the error should be d 2 . However, if it is evaluated by the depth method, the error will be d 1 , which is much larger.

Boundary Verification
Different from the inverse verification that evaluates poses in 3D model space, the boundary verification is performed in a 2D image because verification in 3D costs too much time and storage.
A scene boundary map B scene is computed from the depth gradient, as introduced in Section 2.1. The model boundary map for pose P i , denoted as B model (P i ), is obtained by transforming model points to scene space according to P i , projecting the points onto the plane perpendicular to the camera axis and extracting the contour of the projected image.
Given B scene and B model (P i ), if two pixels in the same position (row and column) of the two images are both boundary pixels, these two pixels are called corresponding boundary pixels and the model boundary pixel is called a fitted pixel. If many boundary pixels of B model (P i ) are fitted pixels, it means the model boundary matches well with scene boundary in 2D and the boundary score of P i should be high.
In Section 2.1, scene boundary pixels are clustered into curves by their continuity and this clustering information is utilized in boundary verification. In general, boundary pixels from the same curve belong to the same object. If only a small part of pixels of a curve correspond to the pixels of B model (P i ), these corresponding boundary pixels are considered to be invalid for P i , as presented in Figure 10. Therefore, the boundary verification is performed by the following steps: (1) Spread the boundary pixels in B scene among neighboring pixels to allow for small pose error.
(2) Given a pose P i , for every boundary pixel in B model (P i ), if it is a fitted pixel, record the curve that the corresponding scene boundary pixel belongs to. (3) For a curve, if a certain percentage R curve of its pixels correspond to B model (P i ), this curve is considered to be valid for P i . (4) Search corresponding boundary pixels for B model (P i ) again. This time, only scene boundary pixels of curves valid for P i are searched. The boundary score of P i is the number of fitted pixels divided by the number of boundary pixels in B model (P i ):

Visible Points Verification
The inverse verification counts the number of scene points with corresponding model points. The more points are matched, the higher the score is. However, the visible point number of the object may be small in some poses, for example, the cuboid in Figure 6. In this case, the correct pose will get a low score and cause recognition failure. Therefore, the visible score S visible (P i ) is computed to make up for it by the following steps: (1) Compute the visible model point based on P i and camera viewpoint.
(2) Transform the visible points onto depth image according to P i . Similar to the depth verification, count the number of fitted pixels whose depth difference from estimated depth is less than a threshold. (3) S visible (P i ) is defined as the fitted pixel number divided by the visible point number.

Select Result Pose
The score of a pose is the product of the three scores: If only one instance is detected in the scene, ICP refinement [9] is performed on top poses (in our experiment, top 10 poses) and the pose after refinement with highest verification score is selected as the result pose. In case of selecting multiple poses, we firstly select the rough result poses and then perform ICP on them.

Experiment
We compare our algorithm with state-of-the-art algorithms on the ACCV dataset of [21] and on the Tejani dataset of [19]. We implemented our algorithm in C++ on an Intel Core i7-7820HQ CPU with 2.90 GHz and 32 GB RAM. Multicore enhancement like GPU was not used.
In our experiments, the model clouds and scene clouds were subsampled so that the model point numbers were around 500. Some parameters are presented in Appendix A. For a 3D model M, having the ground truth rotation R and translation T and the estimated rotation R and translation T, we use the equation of [21] to compute the matching score of a pose: The pose is thought to be correct if k m diam(M) ≥ m. Following state-of-the-art algorithms, we set k m = 0.1 as the threshold in our experiments. For ambiguous objects, Equation (6) is used:

ACCV Dataset
This dataset consists of 15 objects and there are over 1100 images for every object. We skipped two objects, the bowl and the cup since state-of-the-art algorithms removed them.
Model templates were selected for nine models: Benchvise, Can, Cat, Driller, Glue, Hole puncher, Iron, Lamp, Phone and some of them are presented in Figure 11. The performance of the algorithms are in Table 1 and some detection results are shown in Figure 12.  Same as [10,17], we only used depth information in the experiment, but we achieved the highest accuracy for seven objects and highest average accuracy.

Tejani Dataset
The Tejani dataset [19] contains six objects with over 2000 images and there are two or three instances in every image with their ground truth poses. Model templates were selected for four models: Camera, Juice, Milk and Shampoo. Following [19,28], we reported the F1-score of the algorithms in Table 2. Some detection results are shown in Figure 13.  The results of the compared algorithms come from [28]. The LINEMOD [20] and LC-HF [19] used RGB-D information, Kehl [28] used RGB only and our algorithm used depth information only. Our algorithm presented a better average accuracy than the compared algorithms.

Computation Time
The components of our computation time are presented in Figure 14. The average computation time for the ACCV datasets, including the time for the additional hash table, for one scene was 1018 ms times faster than the point pair feature algorithm of [10], whose computation time was reported to be 6.3 s [21], thanks to the scene preprocessing. Using an additional hash table for model template increased the computation time by 106 ms. In our algorithm, the most important algorithm is the subsampling size and we explore how it affects the performance of the algorithm on ACCV dataset. For every object, the model and scene clouds were subsampled such that the model point number N model was around 300, 500, 700 and 900. The recognition rate and computation time are presented in Table 3. From N model = 300 to N model = 500, the recognition rate increases by 6.7% and computation time increases by 530 ms. From N model = 500 to N model = 900, the recognition rate only increases by 0.8%, but the computation time increases by 2558 ms. Therefore, N model = 500 was selected in our experiments.

Contribution of Each Step
In order to explore the contribution of the scene preprocessing, additional hash table and pose verification, we conducted experiments on the ACCV dataset. In the first experiment, only the scene preprocessing was not performed for all 13 of the objects. In the second experiment, only the verification was not performed and the poses were selected by the simple depth verification. In the third experiment, only the additional hash table was not performed for the eight objects. The results of the first and second experiment are presented in Table 4 and the result of the third experiment is presented in Table 5. We can see from the tables that, with the scene preprocessing, the computation time decreases by 79.4% with only 0.3% decrease in recognition rate. The pose verification improves the recognition rate by 29.8% and the additional hash table improves that of the 8 objects from 96.1% to 97.5%.

Conclusions
This paper proposed an object recognition and pose estimation algorithm using depth information. The scene preprocessing method could improve the efficiency in cluttered scenes with boundary pixels. Model template selection method and the pose verification method improved the accuracy from 78.9% to 97.8% on the ACCV dataset. Our algorithm outperformed state-of-the-art algorithms, including algorithms using both color information and depth information, on a large number of scenes.
Acknowledgments: This work is partially supported by JSPS Grant-in-Aid 16H06536.
Author Contributions: Mingyu Li has designed the algorithm, carried out the experiment and written the paper. Koichi Hashimoto has revised the manuscript.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A
For different objects, the valid intersection number threshold N valid for scene preprocessing and the curve corresponding percentage threshold R curve for boundary verification were different in the experiment. They are provided in this section (Table A1 for the ACCV dataset and Table A2 for the Tejani dataset), in order to allow exact re-implementation.