1. Introduction
Object recognition and localization are essential functionalities for autonomous robot working in unstructured applications [
1]; generally, fast and accurate 3D posture information retrieval is important to tasks, such as material handling, manipulation, assembly, welding, etc [
2].
Although 2D image [
3]/3D point cloud can all be used to achieve such goal, image-based methods are sensitive to ambient lighting conditions, object texture, and have to estimate 3D information indirectly; hence, point cloud-based methods are becoming promising especially with the emerging of many affordable depth sensors, such as Microsoft Kinect and Intel Realsense cameras.
According to the used features, there are two kinds of pipelines for recognizing 3D objects from point clouds: the global pipeline and the local pipeline. Global feature descriptors, such as ensemble of shape functions (ESF) [
4], global fast point feature histogram (GFPFH) [
5], Viewport Feature Histogram (VFH) [
6], and cluster view feature histogram (CVFH) [
7], can be used to evaluate an object observed from different view angles. By generating candidate model clouds all around the object/CAD and extracting their features, the segmented scene cloud can be evaluated and recognized, given some thresholds. The coarse postures can further be refined using Iterative Closest Point (ICP) method [
8]. Global feature-based algorithms are efficient in computation time and memory consumption. However, these methods require segmenting the object point cloud from the scene and are difficult to be applied in a cluttered environment with occlusion and complex foreground and background [
9].
On the contrary, the local feature descriptors only model the surface characteristics around a single point. Because the points are generally located all around the object and cannot be blocked entirely, hence, they are more robust in occlusion and cluttered environment. The local pipeline mainly consists of keypoint detection, feature matching, hypothesis generation, and evaluation.
Johnson et al. [
10] proposed a pipeline using “Spin image (SI)” to identify the objects. SI is a data level shape descriptor which transforms the point cloud into a stack of 2D spin images, and then uses them to represent the 3D model. By matching the model/scene spin images, and filtering and grouping these correspondences, the transformation can be established.
Guo et al. [
11] introduced a local descriptor Rotational Projection Statistics (RoPS) to recognize object in the scene, by constructing local reference frames and projecting neighbor points on the reference plane and using statistical values to describe the local feature. It is robust to noise and varying mesh resolution. By selecting some feature points and calculating their feature descriptors, the candidate models are evaluated one-by-one to recognize possible objects.
Rusu proposes Point Feature Histograms (PFH) [
12] and Fast Point Feature Histograms (FPFH) [
13] descriptors and Sample Consensus Initial Alignment (SAC-IA) initial align method for point cloud registration and object recognition. The use of histograms makes the descriptor invariant view port differences and is robust to noises, hence its widely being used in a lot of registration and matching tasks (usages).
In order to improve the performance, Buch proposes to use both 2D image data and 3D contextual shape data to increase the quality of the correspondences [
14]. Results show that the efficiency is significantly improved.
It is noticed that, although local featured-based algorithms have advantage in dealing with occlusion, they require additional time to establish correspondence between model and scene. This process is typically done through Random Sample Consensus (RANSAC). Commonly used initial registration methods for the point cloud mainly include SAC-IA and Super4PCS. Both of these algorithms are implemented based on RANSAC [
15]. The RANSAC algorithm mainly uses the idea of multiple sampling iterations to obtain the optimal solution. The algorithm can provide reliable parameter estimates when there are abnormal points in the sample data. The algorithm is highly robust, so it is widely used in machine vision. The field of image registration and object recognition. However, if the number of abnormal points exceeds half, the number of iterations will increase exponentially. These shortcomings will make the algorithm difficult to be applied in actual engineering.
To overcome above shortcomings, some researchers have improved the classic RANSAC algorithm in terms of model solving, sample pre-test, sample selection, and post-stage optimization. The improved method for model solving is mainly for the improvement of the loss function. It includes the following: M-Estimation [
16], LMedS [
17] MLESAC [
18,
19], Mapsac [
20], etc. M-Estimation is a method that adds weight to data and converts the model solving problem into a least-squares problem with weight added. LMedS uses the method of optimizing which requires the square of the data points residual and minimizing the median value to obtain the optimal solution of the model. However, the above two methods are useful only when the proportion of external points are less than half. Torr et al. used the distribution of inner point/outer point to evaluate the assumption and proposed a method based on maximum likelihood estimation to express the loss function MLESAC to find the optimal solution of the model. After that, the MLESAC method was improved by Torr et al., who proposed Mapsac method. In the improved method, the loss function has been described by Bayesian and maximum posterior probability. The new RANSAC method improves the robustness of the classic algorithm.
For the sample pre-test, the improvement is mainly checking the sample and eliminating the bad sample. The computing time can be decreased by this improvement. RANSAC-Tdd [
21] adds the pre-test step and tests the accuracy of the model parameters that are built from the pre-sample. Mates uses conditional probability to describe the quality of each matching point, as well as uses the threshold to select the model, and it proposes the RANSAC-SPRT algorithm [
22,
23]. The ARRSAC algorithm uses interior points to generate additional hypothesis evaluations after model parameter evaluation. This method uses a competitive evaluation framework to reduce the running time of local optimization, and it improves the efficiency of the algorithm [
24].
For the sample selection, it can be divided into sorting method (PROSAC [
25]) and positional relationship method (NAPSAC [
26] and GROUPSAC [
27]), etc., according to the selection strategy.
For the post-stage optimization method, the current optimal solution is used as a starting point to further optimize the model to reduce the number of iterations and improve the efficiency and stability of the algorithm. Some famous methods include LO-RANSAC [
28] and OPTIMAL-RANSAC [
29].
With the development of machine learning, some researchers combine the reinforcement learning and deep learning with RANSAC. DSAC [
30] AND NG-SAC [
31] have been proposed to improve the efficiency and accuracy.
In summary, to improve the efficiency and accuracy of point cloud coarse registration, and further improve the three-dimensional object recognition algorithm based on local features, it is necessary to start with the improved RANSAC algorithm.
SAC-IA algorithm is a commonly used architecture for local feature-based methods. However, its efficiency and accuracy is limited in complex cluttered environment. Therefore, the goal of this paper is to improve its performance. The paper analyzes the most time-consuming part of the SAC-IA algorithm: sample generation and evaluation, and proposes two improvements to increase efficiency. The contributions are summarized as follows.
In the initial aligning stage, instead of sampling the key points, the correspondence pairs between model and scene key points are generated in advance and chosen in each iteration, which reduces the redundant correspondence search operations.
A geometric filter is proposed to prevent the invalid samples to the evaluation process, which is the most time-consuming operation because it requires transforming and calculating the distance between two point clouds. The introduction of the geometric filter can significantly increase the sample quality and reduce the required sample numbers.
2. The Original SAC-IA Approach
A diagram of the SAC-IA is shown in
Figure 1. The modifications lie at the sampling and fine refinement process. Instead of sampling the keypoints, the correspondence is sampled. In addition, a geometric constraint is proposed to filter the “impossible” samples in advance to increase the efficiency. In order to overcome the problem brought by a small object in a large scene and avoid to stuck in local minimum, an adaptive ROI extraction operation is proposed to balance restrict the ICP in a local area of the scene point cloud.
2.1. Keypoints Detection
Typically, there are thousands of points in a point cloud. Recognizing an object with all of them is complex and not necessary. A commonly used approach is to focus on the more distinctive points instead of ordinary ones. The simplest keypoints detection method is uniform subsampling. However, this operation does not consider the distinctive information among the points; hence, the obtained keypoints are not informative and lack in repeatability.
Several more sophisticated keypoints detection methods are available. Narf 3D keypoint detector [
32] is a method of extracting keypoints in the range image. It can only be used in range image captured from a single perspective and cannot be used in point cloud generated from multiple view angles. Harris 3D keypoint detector and SIFT Keypoints [
33] follow similar ideas as their 2D versions. Harris 3D uses surface normals to replace the image gradient, while SIFT 3D still has the scaling and rotation invariant features. ISS3D keypoints [
34] is an effective keypoints detection method, which mainly uses eigenvalues to select key points. A local coordinate frame is defined for each point using its neighbors within a radius
. A 3 × 3 scatter matrix is generated, and the eigenvalue
and vectors are obtained. Then, keypoints can be picked out by evaluating those eigenvalues with some thresholds
In this paper, ISS3D keypoints is chosen because of its efficiency and robustness (details in comparison can be found in Reference [
35]).
2.2. Local Shape Descriptors
In order to identify and estimate the pose of an object, it is necessary to find correspondence between model point cloud and scene point cloud. Different from global feature descriptors, local shape descriptors encode features of a point instead of the whole point clouds. Most of the descriptors use the histogram to achieve viewport invariance. They accumulate geometric or topological measurements into histograms according to a specific domain (e.g., point coordinates, geometric attributes) to describe the local surface.
The “spin image (SI)” [
11], introduced by Johnson and Hebert, is a typical spatial distribution histogram-based descriptor. The “spin image” descriptor uses the 2D data to represent the 3D feature, which counts the 3D points in a surrounding supporting volume, and then produces a 2D histogram. Therefore, this method will lose valuable 3D shape information.
Signature of Histogram of Orientations (SHOT) [
36] encodes histograms of the surface normals in different spatial locations. Therefore, it has the advantage of rotational invariance and robustness to noise. In addition, Prakhya made improvements by converting the real value vector to the binary value vector based on SHOT descriptor and named Binary Signatures of Histogram of Orientations (B-SHOT) [
37].
Rusu et al. [
13] proposed Point Feature Histogram (PFH) descriptor. This descriptor calculates the normal vector angle and distance between any two points in the search radius of the keypoints to generate the histogram. This descriptor has a computational complexity of
, hence its not being suitable for real-time application. Fast Point Feature Histogram (FPFH) is proposed by Rusu [
14] to resolve this problem; FPFH only encodes the relation between the querying point and its k neighborhood points. Therefore, its computational complexity is
. FPFH descriptor is a robust and efficient feature to characterize the local geometry around a point. In this paper, it is used to measure the similarity between different keypoints.
2.3. Initial Alignment
For a solid object, at least three corresponding points are required to estimate the transformation matrix between the model and scene point cloud. It is not easy to find the correct match due to similarities in different key points and uncertainties in the point clouds. Although the number of keypoints is fewer than the original point cloud, the search space is still large. Assume there are M model key points and N scene keypoints, and, for each model keypoint, K candidates are considered; the total search space is , which is 125,000, even with 10 keypoints and 5 candidates.
Compared to brute grid search, the random search is more efficient. One such coarse registration method for 3D point clouds is Sample Consensus Initial Alignment (SAC-IA) [
14]. It consists of the following four steps:
Keypoints Sampling. Select three sample points from the model point cloud, and ensure that their distances are greater than the user-defined minimum distance .
Correspondence Searching. For each sample point, find a list of points (K candidates) in the scene point cloud with a similar local descriptor. This is usually done by searching KD-Tree. Finally, randomly choose one point from the candidates as the correspondence pair.
Transformation Matrix Estimation. With three correspondence pairs, it is able to estimate the transformation matrix , where j is the sampling iteration index.
Performance Evaluating. Transfer the model point cloud with
and compare it against the scene point cloud. Compute an error metric based on those two point clouds using a Huber penalty measure
where
is the overall measure,
and
is a predefined threshold used to exclude the outliers, while
is the minimum distance between a model point and scene point cloud.
Finally, by repeating the above four processes, the true correspondence is assumed to definitely occur. The robustness is proportional to the trial number. However, too many trials will decrease the efficiency; hence, there is a trade-off between stability and efficiency. In spite of that, SAC-IA can still get a good result after many iterations and be widely used in the initial alignment process.
2.4. Fine Alignment
SAC-IA randomly chooses three corresponding points in a model and scene point cloud and obtains the best guess for the object location. However, due to the uncertainty in keypoint extraction and sensors, the obtained transformation matrix may deviate from the real one. Refinement has to be performed to the aligned model and scene with all points involved.
Iterative Closest Point (ICP) [
9] is the commonly used registration method. It is based on point-to-point registration and uses LM to calculate the transformation matrix. ICP has good precision when aligning two similar point clouds with a small initial posture error. If the initial posture differs too much, it may stuck in local minimum. Therefore, typically, ICP follows a curse alignment method, such as SAC-IA.
2.5. The Overall Workflow of Standard SAC-IA Algorithm
The above process can be summarized in the following Algorithm 1. By providing model point cloud and scene point cloud, the object pose is returned in the form of a transformation matrix.
Algorithm 1: Standard SAC-IA Pose Estimation Algorithm. |
- Input:
The model point cloud: The scene point cloud: - Output:
The transformation matrix of the best match: - 1:
Downsampling and - 2:
Detecting keypoints - 3:
Calculating FPFH descriptors - 4:
for to do - 5:
Randomly choose three points - 6:
Filter samples by distance threshold - 7:
For , find similar points from - 8:
Randomly choose one for - 9:
Estimate the transformation with - 10:
Transform , count inliers with threshold - 11:
end for - 12:
Find the transformation matrix with most inliers - 13:
Using ICP to find the optimal transformation - 14:
return
|
It is noted that the model processing, keypoint detection, and PFPH calculation can be done offline. Therefore, the scene point processing and the loops 3–8 take most of the computation time.
2.6. The Problems
The standard SAC-IA algorithm depends on RANSAC to estimate the model. Some limitations of conventional RANSAC will affect the efficiency.
Random sampling:
The sampling is performed on the keypoint level. For a typical configuration, a model set includes 50,000 points. To find their correspondence points, 5 candidates are explored. This means KD-tree search operation is performed totally, which will result in 750,000 times.
Transform matrix estimation and evaluation:
In standard SAC-IA, the models are represented as a transformation matrix. One has to optimize the matrix parameters and then transform all point clouds and calculate their distances to the scene point cloud. Therefore, this estimation and evaluation process is time-consuming. Moreover, this operation is repeated for times for robustness consideration, which greatly restricts its efficiency
4. Experiments
4.1. Experimental Setup
A Kinect v2 Camera is used to capture the point clouds of object. The algorithm is implemented using C++ and Point Cloud Library (PCL) in Windows 10. The computer is equipped with an Intel Core i7-8650U CPU (2.11 GHz, 4 cores) and 16 GB memory.
A stackable storage bin is used as the test object, whose CAD is shown in
Figure 4. By placing the box randomly on the table and grabbing it with different poses (with occlusion), point clouds are captured, as shown in
Figure 4. Totally, there are 17 scene point clouds in the dataset.
4.2. Efficiency Verification Experiments
The proposed Efficient SAC-IA (E-SAC-IA) and original SAC-IA were both implemented for comparison. The used parameters are listed in
Table 1.
The CAD file is converted into a point cloud by subdividing the triangular meshes; therefore, the point density is not even. To resolve this issue, a fine resolution is firstly used and then downsampled with a leaf size of 0.01 m, to match the scene point clouds captured with real sensor.
In the implementation of Efficient SAC-IA, instead of using specific keypoint detector, the generic subsampling operation is used. By setting the leaf size as 0.04 m, a sparse point cloud is generated from the original density one. The remaining points are treated as keypoints. For model point cloud, there are 160 key points, while, for the scene point clouds, typically, 370 keypoints are available.
4.2.1. Performance against Max Iteration Number
Max iteration number is the key to control the robustness and efficiency of the RANSAC-based algorithms. For the proposed E-SAC-IA, the performance changes against max iteration number are evaluated. The 17 scene point clouds are fed into the algorithm with different max iteration numbers ( 50; 100; 200; 300; 400; 500; 600). The results are given in
Figure 5.
Figure 5a shows the success rate and recognition time when changing the max valid sample number from 50 to 600. It is seen that, with the increase of max valid sample number, the success rate increases. When 500 more samples are used, the success rate is above 90% percent. In addition, the recognition time is proportional to the sample number. The average recognition time is around 200–300 ms.
Figure 5b shows the ratio between total samples and valid samples. The valid samples are samples that fulfill the minimum distance and triangular similarity constraints. It is found that the ratio is 100:1 and nearly keeps constant with the change of max valid sample number parameters.
Due to the randomness and variations between each scene point cloud, the actual recognition time may changes.
Figure 5c shows that mean and standard division of the recognition time. The standard division is 30 ms.
In comparison, the original SAC-IA is also tested. The max sample number is chosen as 50,000, to guarantee the success rate. The success rate is 0.941, while the recognition time is shown in
Figure 5d. It can be found that the recognition time is around 10 s, which is nearly 30 times that of the proposed E-SAC-IA (Algorithm 2).
4.2.2. Performance against Sampling Order
In spite of the iteration number in the RANSAC procedures, the sampling order and type also affect the efficiency. As previously mentioned, in conventional SAC-IA algorithm, the key points are randomly selected in each iteration, and then, by finding their nearest neighbors through Kd-Tree search, the correspondence is established online. This nearest neighbor search is queried for the same times as the iteration number, which is significantly larger than the keypoints number. Therefore, by moving the correspondence building process outside the loop and randomly selecting point pairs, this issue can be eliminated.
In
Figure 5d, the recognition time of three configurations is given. The original SAC-IA with sample number is 50,000, and the E-SAC-IA with sample number is 500 (equivalent to about 50,000 total samples). The last one is the original SAC-IA with the above improved sample selection process, but without sample filters.
From the results, one can see that reorganizing the sampling order can also improve the overall efficiency. Compared to original SAC-IA, it saves about 2.5 s for 50,000 iterations, i.e., 50 s for each nearest neighbor search operation.
4.3. Test on Bologna 1 Dataset
In order to verify the performance on objects with different shapes, the algorithm is tested on Bologna 1 dataset [
37]. These datasets, created from some of the models belonging to the Stanford 3D Scanning Repository, are composed of 6 models (armadillo, buddha, bunny, Chinese dragon, dragon, and statuette, as shown in
Figure 6) and 45 scenes. Each scene contains a subset of the 6 models that are randomly rotated and translated. The 45 scenes belong to three types: 3 objects, 4 objects, and 5 objects. Each type has 15 scenes.
In this experiment, ISS algorithm is used to detect the keypoints. The configuration is listed in
Table 2.
For each scene, no matter whether it contains specific model, it is processes to detect all those models. The detection time is averaged and listed in
Table 3. Because of the ISS keypoint detector, the keypoints are less than previous experiments. Therefore, the overall runtime is lower. With the increase of points in the scene, the detection time increases slightly. The recognition time is below 200 ms for the proposed E-SAC-IA method, while, for the original SAC-IA method, the recognition time is around 2000 ms.
The acceleration and success ratio are listed in
Table 4. One can find that it is within the range of 11–18 times.
One can find that the success rate is identical for original and improved SAC-IA algorithms. In most cases, the rate is above 90%. As shown in previous experiments, the success rate is related to several issues, such as the max iteration number, the quality of the keypoints, and fuzziness of the scene point cloud.
Therefore, it can be concluded that the proposed E-SAC-IA method indeed can increase the object recognition efficiency without scarifying the success rate. For typical configuration, the recognition time is about 200–300 ms, i.e., 3–4 fps. Therefore, it can be applied in some real-time applications. Because the method still follows the local pipeline, the keypoint detection, feature calculation, and ICP restriction all cost some time (about 150 ms), and the space to decrease the recognition time is bounded if no GPU acceleration or multiple core paralleling techniques are used. The acceleration ratio is related to the keypoints number, randomness, and point numbers. According to the experimental results, the typical ratio is 10× to 30×.