3-D Point Cloud Registration Using Convolutional Neural Networks

: This paper develops a registration architecture for the purpose of estimating relative pose including the rotation and the translation of an object in terms of a model in 3-D space based on 3-D point clouds captured by a 3-D camera. Particularly, this paper addresses the time-consuming problem of 3-D point cloud registration which is essential for the closed-loop industrial automated assembly systems that demand ﬁxed time for accurate pose estimation. Firstly, two different descriptors are developed in order to extract coarse and detailed features of these point cloud data sets for the purpose of creating training data sets according to diversiﬁed orientations. Secondly, in order to guarantee fast pose estimation in ﬁxed time, a seemingly novel registration architecture by employing two consecutive convolutional neural network (CNN) models is proposed. After training, the proposed CNN architecture can estimate the rotation between the model point cloud and a data point cloud, followed by the translation estimation based on computing average values. By covering a smaller range of uncertainty of the orientation compared with a full range of uncertainty covered by the ﬁrst CNN model, the second CNN model can precisely estimate the orientation of the 3-D point cloud. Finally, the performance of the algorithm proposed in this paper has been validated by experiments in comparison with baseline methods. Based on these results, the proposed algorithm signiﬁcantly reduces the estimation time while maintaining high precision.


Introduction
Automated assembly systems capable of repeatedly performing assembly tasks with high-precision and high-speed play an indispensable role in modern manufacturing lines. Intuitively, to perform these automated assembly tasks, cameras are employed for the purpose of providing vision feedback to estimate the pose (position and orientation) of objects to the system. Based on the feedback, the system controls manipulators to perform object picking and assembly tasks in order to interact with the environment. In recent years, much of the work related to the vision-based automated assembly was published. S. Huang et al. [1,2] proposed the dynamic compensation to realize a fast peg-in-hole alignment in the case of a large position and attitude uncertainty based on vision information provided by high-speed cameras. In References [3,4], 3-D scenes used to estimate the pose of objects were reconstructed from multiple images acquired by several cameras. Nonetheless, these algorithms encountered some problems when extracting features as well as reconstructing 3-D feature points from images in the case of general objects that have no distinctly geometric and color features. Thus, in order to address these problems, in Reference [2], color features were added to objects which are not real in general automated assembly systems. Besides, in Reference [3], B. Wang et al. reconstructed 3-D

Related Work
Basically, point cloud registration consists of two stages including rough registration and precise registration stage for algorithms based on Iterative Closest Point (ICP) approach and its variants. In particular, the rough registration is performed to give an initial estimation about the transformation between two point clouds, followed by the precise registration to refine this estimation to best align these two point clouds. On the one hand, to address the rough registration problem, a well-known algorithm, RANdom Sample Consensus (RANSAC) approach, was proposed in Reference [12] to roughly estimate the rotation and translation between two point clouds. Because of slow computational speed, several variants of the RANSAC algorithm were developed to address this problem such as IMPSAC [13], Preemptive RANSAC [14], Distributed RANSAC [15] and Recursive-RANSAC [16]. Besides, Han et al. [17] proposed an improved RANSAC algorithm based on 3-D Region Covariance (RC) descriptor, RC-RANSAC method, to enhance the successful percentage of point cloud registration in a single repetition when describing region covariance of each point. Additionally, other global registration algorithms in the case of arbitrary initial poses between two point clouds were also published in recent years. In Reference [18], Aiger et al. developed a robust and fast alignment principle, 4-Points Congruent Sets (4PCS) algorithm, employing wide bases robust to outliers and noise for 3-D point clouds based on extracting all sets of coplanar four-points from a 3-D point cloud. Moreover, similarity and affine transform problems were taken into account in their research. However, a drawback existed in their method because 4PCS approach took approximately a quadratic time complexity according to the quantity of data points. Hence, they presented an improvement of 4PCS method, called Super 4PCS approach [19], that degraded the quadratic complexity to linear complexity Appl. Sci. 2019, 9,3273 3 of 20 and confined the volume of candidate conjugate pairs created to tackle this negative point. Super 4PCS approach appeared to be an effective solution to handle 3-D point cloud registration when existing a low overlap and outliers. The other existing approach, the Iterative Closest Point method integrated with the Normal Distribution Transform approach (NDT-ICP), was proposed in Reference [20] with the aim of accelerating the registration speed. In addition, Liu et al. [21] developed a point cloud registration algorithm based on improved digital image correlation approach, followed by the ICP method with a high accuracy, anti-noise capability and efficiency. Moreover, Chang et al. proposed the candidate-based axially switching (CBAS) computed closer point (CCP) algorithm [22] to effectively and robustly address rough registration problem in the case of an arbitrarily initial pose between two point clouds.
On the other hand, as regards the precise registration fulfilled after the rough registration, the most common registration algorithm is the ICP approach proposed by Besl et al. in Reference [23] that minimizes the distance between two point clouds. In addition, many variants of the ICP method were presented in recent years with the aim of improving the original method in some aspects such as high accuracy, high speed, robust registration, scale registration, affine registration and globally-optimal solution. Particularly, dual interpolating point-to-surface algorithm adopting surface fitting by using B-spline to establish correspondences between two point clouds was proposed in Reference [24] to address the precise registration for two point clouds with a partial overlap. Chen et al. [25] developed the Hong-Tan based ICP method (HT-ICP) for the purpose of improving the speed and precision of registration for partially overlapping point clouds. Additionally, a registration algorithm by using the hand and soft assignments to improve the robustness and accuracy for partially overlapping point cloud registration was proposed in Reference [26]. To extend the robustness to noise and/or outliers, the Trimmed ICP approach was presented in Reference [27] that applied the Least Trimmed Squares algorithm in all stages of the operation. In Reference [28], Phillips et al. presented the Fraction ICP method that can robustly identify and discard outliers and guarantee the convergence to a locally optimal solution. The other existing approach, Efficient Sparse ICP method, using sparsity-inducing norms to considerably improve robust registration caused by noise and outliers whilst employing a Simulated Annealing search in order to efficiently resolve the underlying optimization problem was proposed in Reference [29]. Yang et al. [30] developed the ICP-based algorithm integrating with a branch-and-bound (BnB) principle to figure out the globally optimal solution for Euclidean registration between two 3-D surfaces or 3-D point clouds starting from any initialization. Scale-ICP approach integrating a scale factor with the original ICP method was proposed in Reference [31] so as to accurately perform the registration between two 3-D point clouds in the case of large-scale stretches and noises. Wang et al. [32] developed a multi-directional affine registration (MDAR) algorithm by using the statistical characteristics and shape features of point clouds for affine registration problem. In addition to the aforementioned distance-minimization-based registration algorithms, another registration approach using distinct features extracted from 3-D point cloud data sets was also presented in References [33][34][35][36], to robustly and efficiently enhance the registration problem. Particularly, many feature descriptors including local-based descriptor, global-based descriptor and hybrid-based descriptor were presented in recent years and a comprehensive review of the existing algorithms can be sought in Reference [37].
Nevertheless, the aforementioned existing algorithms all are based on randomly selecting different points in two 3-D point clouds to form a pair of corresponding bases in rough registration phase whilst iteratively minimizing the distance between two 3-D point clouds until the convergence within the desired tolerance of ICP-based methods in the precise registration phase. These are the reasons why those algorithms cannot be employed in real-time closed-loop object picking and assembly systems due to their slow computation. In addition to the aforementioned algorithms, there are some approaches related to point cloud registration using deep learning which were developed in recent years. In Reference [38], a point cloud registration method using a deep neural network auto-encoder for outdoor localization between a close-proximity scanned point cloud and a large-scale point cloud Appl. Sci. 2019, 9, 3273 4 of 20 was proposed. Specially, Aoki et al. [39] developed an efficient and robust point cloud registration architecture, PointNetLK architecture, based on an iterative image registration approach [40] and PointNet architecture [9]. However, the PointNetLK architecture involved iterative computation with the number of iterations in accordance with a minimum threshold of the optimal twist parameters. In this paper, a seemingly novel efficient algorithm using deep learning is proposed to accelerate the estimation time while maintaining the required accuracy.
The key contributions of this research can be summarized as follows: 1. We propose a seemingly novel registration architecture including 3-D convolutional neural networks (3-D CNNs) for the rough registration stage and 2-D CNNs for the precise registration stage that transforms the 3-D point cloud registration problem to a regression problem with estimated outputs as angular disparities between the model point cloud and a data point cloud. 2. We propose two descriptors in order to extract distinct features such as Simplified Spatial Point Distribution and 8-Corner-Based Closest Points from a given 3-D point cloud as data sets for the purpose of training the proposed 3-D convolutional neural network (3-D CNN) model and the proposed 2-D CNN model, respectively. After training, the trained CNN models can quickly estimate the rotation between two 3-D point clouds, followed by the translation estimation based on computing average values of these two point cloud data sets in fixed time with high precision.
The remainder of this paper is organized as follows. Section 3 shows the problem formulation and the strategy of 3-D point cloud registration using CNNs. In Section 4, two categories of descriptors employed to extract coarse and detailed features in terms of the orientation of a 3-D point cloud are proposed with the aim of generating training data sets. Section 5 develops the proposed registration architecture which consists of two stages including the rough and precise registration based on 3-D CNN and 2-D CNN architecture, respectively. Finally, Section 6 provides experimental results whilst conclusions of this paper are drawn in Section 7.

Problem Formulation
Given the model point cloud P and a data point cloud Q, point cloud registration is to estimate the transformation including the rotation matrix R and the translation vector t between these two point clouds. P = {p 1 , p 2 , . . . , p n }, where p i and q j stand for the i th point and j th point in the model and data point cloud, respectively. To fulfill this task, ICP-based approaches minimize an objective function, the mean squared error, based on corresponding closest point pairs (p i , q j ) in two point clouds as follows.
where m is the number of points in the data point cloud. Because the ICP algorithm can only successfully match if the uncertainty between two point clouds is small enough, an initial registration stage is necessary to address a large uncertainty problem based on the well-known RANSAC approach. Thus, inspired by this two-stage registration process, we propose a deep CNN framework with two consecutive CNN models including the first model for the rough registration and the second model for the precise registration. In the proposed architecture, only rotation angles are estimated instead of estimating both rotation and translation because the translation can be computed by subtracting coordinate values of two corresponding center points of two point clouds. In addition, because of arbitrary translations between two point clouds in 3-D space, it is not reasonable to use the translation vector as ground truth for the purpose of training. Moreover, the limitation of rotations is confined in the range from −180 • to +180 • that make rotation angles potential ground truths for the convergence of the learning process. Figure 1 shows a block diagram of the proposed architecture for 3-D point cloud registration.  Figure 1. A block diagram of the proposed registration architecture using convolutional neural networks (CNNs).
Hence, point clouds of an object with different rotation angles in the xyz coordinate system and these corresponding rotation angles are collected with the aim of creating a data set. As regards the CNN architecture, the first CNN model in the rough registration stage is to cover an arbitrary initial rotation in a full range of uncertainty from −180 • to 180 • whilst the second CNN model is used in order to refine the rotation estimation in covering a smaller range with a higher angular resolution in the precise registration stage. In particular, in this precise registration stage, a trade-off about selecting the range of the uncertainty of the data set should be satisfied in a such way that this uncertainty is small enough to make the CNN model learn deep characteristics and large enough to cover errors existing after the rough registration stage. In this research, the range of this uncertainty is selected from −15 • to 15 • based on experiments. Ideally, only one CNN model can be considered to handle the registration problem if the number of samples for training is large enough to cover the full range of uncertainty with a high resolution of rotation angles. Nonetheless, it is impossible to realize this idea as a huge data set is required in this case that leads to a difficulty in the convergence of solutions and the time-consuming problem during the training process due to using too much training data. Thus, the precision of the rotation estimation is reasonably guaranteed by using this two-stage CNN model. To demonstrate this estimation process, a simple representation of the model and data point cloud in a plane is visualized in Figure 2. Initially, the arbitrarily angular disparity α is witnessed between model and data which is covered in the full range of uncertainty from −180 • to 180 • . After the first stage, data can be transformed to data 1 based on a rough estimation α 1 that exists an error α − α 1 compared with the ground truth α. Then, the second CNN model seemingly precise estimates the disparity α 2 between model and data 1 to finalize the whole angular disparity α 1 + α 2 between model and data.  In addition, because points in a general 3-D point cloud are not in order and the number of points is not fixed according to different viewpoints for the same object, a descriptor extracting distinct features from a point cloud is employed to tackle these problems. In this case, a powerful descriptor must be able to capture the geometric structure of a data point cloud that can precisely describe the orientation of this point cloud in the coordinate frame and be invariant to translation. Thereby, instead of directly inputting 3-D point cloud data sets, features extracted from 3-D point clouds based on the descriptors are employed as the input of these CNN models. By using features, every sample in the data set has the same format such as the same order and the same finite size, which is good for the Appl. Sci. 2019, 9, 3273 6 of 20 learning process. In accordance with using these descriptors, a comprehensive architecture of the proposed algorithm is illustrated in Figure 3.  Figure 3. A comprehensive block diagram of the proposed registration architecture.

The Proposed Descriptors for 3-D Point Clouds
In this section, two types of descriptors used to extract features in the rough and precise registration stage, respectively, with the aim of generating the training data sets are given. Intuitively, the first descriptor should extract coarse features roughly describing the orientation of an object to ensure the fast convergence in the learning process whilst detailed features precisely expressing the orientation of the object should be captured by the second descriptor. Before showing procedures to compute these two descriptors, two corresponding data sets in the case of a large uncertainty of rotation angles from −180 • to 180 • and a small uncertainty from −15 • to 15 • , need to be created in advance.

Point Cloud Data Set Preparation
In order to generate the data set, several 3-D point clouds with different orientations of an object need to be collected. Nevertheless, it is not suitable to collect a large number of 3-D point clouds of an object when using a 3-D scanner to capture these point cloud data sets. Therefore, the data set can be obtained by randomly rotating and downsampling the model point cloud of the object. Particularly, the box grid filter algorithm provided by Matlab is used to downsample the rotated model point cloud to low-resolution point clouds with the different number of points and distribution characteristics. Based on this approach, the created point clouds can simulate disparities among point clouds that would exist in real 3-D point clouds captured by a 3-D camera. The principal difference between two kinds of data sets of an object is that each rotation angle with respect to an axis is randomly initialized with 80 different values in the range from −180 • to 180 • for the first data set whereas the second data set is generated based on 80 random values of each rotation angle in the range from −15 • to 15 • . Totally, there are 3 rotation angles with respect to x, y, and z axis that results in generating 80 3 = 512,000 point clouds with different orientations. Each data set is divided into 3 subsets including training set (392,000 samples), validation set (60,000 samples) and test set (60,000 samples). Algorithm 1 shows procedures step by step in order to generate these data sets.

Algorithm 1 Data set generation
Input: Model point cloud P; the number of random values of each rotation angle n ra ; range of uncertainty r u ; 1: Randomly initialize rotation angles (ψ, θ, φ) with n ra values for each rotation angle; 2: for i = 1 to n ra do for j = 1 to n ra do for k = 1 to n ra do -Rotate the model point cloud, P = rot (ψ,θ,φ) (P); -Downsample the point cloud, Q l ⇐ downsample(P ); -Record the data point cloud Q l and (ψ, θ, φ) as a sample; end end end Output: Data point clouds Q l .

Simplified Spatial Point Distribution Descriptor
A 3-D data point cloud Q employed as an input of the descriptors can be given as follows where q j denotes the j th data point. In the first stage, the rough registration stage, features extracted from the point set Q should generally describe the orientation of Q such that a successful and fast convergence in the learning process can be guaranteed. Because the rough registration covers a full range of uncertainty from −180 • to 180 • that leads to significant disparities in characteristics of these features, it is hard to obtain optimal weights with a small enough loss for a CNN model after training. Therefore, a Simplified Spatial Point Distribution (SSPD) descriptor is proposed in this paper to describe coarse characteristics of the orientation of a 3-D data point cloud in 3-D space. The key point is to divide a spatial region involving the point set into several sub-regions and count the number of points in each sub-region, followed by a normalization with the total quantity of points of the point cloud. To begin with, a cubic bounding box with the length of an edge l cube is created that consists of the point set.
where norm 2 denotes the Euclidean distance between two points q i and q j . Secondly, this cube is divided into several sub-cubes as shown in Figure 4 with the size of s x s x s where the length of an edge of a sub-cube can be computed as follows.  Finally, the quantity of points in each sub-cube is counted and then divided by the total number of points of the point cloud. Because each sub-cube has one value, SSPD is a 3-D matrix with the size of s × s × s where the parameter s = 15 is selected based on experiments. Algorithm 2 illustrates steps in details to compute SSPD based on a given 3-D point cloud.

Algorithm 2 Offline computations of SSPD
Input: Data point cloud Q; length of an edge of a sub-cube l sub−cube , the number of slices s; 1: for i = 1 to m do -Shift all points to the first quadrant; .., m ; -Bin points into sub-cubes by converting coordinate values to according positions; Output: SSPD of a data point cloud.

8-Corner-Based Closest Points Descriptor
In comparison with the SSPD descriptor for the rough registration, features employed for training a CNN model to precisely estimate the spatial orientation of a given 3-D point cloud should be able to thoroughly describe the characteristics of this point set in details. Thereby, 8-Corner-Based Closest Points (8CBCP) descriptor is proposed in this paper for the purpose of extracting oriented characteristics of a 3-D point cloud in 3-D space. As shown in the name of this descriptor, the main point is to find d closest points according to each corner of the Axis-Aligned Bounding Box (AABB) [41] of a data point cloud. Based on considering each corner of AABB as a viewpoint when observing the point cloud, the 8CBCP descriptor possesses the robustness in describing oriented features of this point cloud and a simple format because only key points can be collected. In addition, by numbering corners and arranging closest points with a suitably predetermined order, the problem without a specific order of a general point cloud can be effectively addressed. Moreover, to tackle the problem about the disparity in the point distribution among point clouds, the coordinate values of these closest points are normalized with the total number of points of the corresponding point cloud. Hence, the result of 8CBCP is a 2-D matrix with the size of 8d × 3 because 8d corresponding closest points can be obtained based on 8 corners. In this research, d is selected based on a trade-off between the total number of points of a point cloud at approximately 1000 points after downsampling and the necessary number of points according to each corner for the purpose of describing orientation features of a point set. Thus, by testing in experiments, d is chosen at 40 points. Although this paper focuses on whole point cloud registration, 8CBCP can be a potential descriptor when resolving partial point cloud registration problem due to multiple different observations applied for one object. Algorithm 3 shows computation procedures in details to generate 8CBCP features of a given 3-D point cloud.

Algorithm 3 Offline computations of 8CBCP
Input: Data point cloud Q; 1: Compute AABB and use these parameters to make a cube that bounds of the data point cloud; 2: Shift all points to the first quadrant; where {k} α i , {k} β i , {k} γ i stand for three coordinates according to three axes of point q i with respect to the local frame {k}; 6: Compute 8CBCP, a 2-D matrix with the size of 8d x 3; Output: 8CBCP of a data point cloud.

The Proposed CNN Architectures
Based on two kinds of features including SSPD and 8CBCP shown in Section 4, the corresponding CNN architectures are developed in this section with the aim of estimating the orientation of a 3-D point cloud on the generated point cloud data set. Related to the first stage, the rough registration, because the input of CNN model is SSPD, a 3-D matrix with the size of s × s × s, a 3-D CNN model is selected. When it comes to the second stage, the precise registration, a 2-D CNN model is employed in accordance with the size of the input 8CBCP in the case of a 2-D matrix. Figure 8 shows the proposed 3-D architecture for the rough registration whilst the proposed 2-D architecture for the precise registration is demonstrated in Figure 9.

3-D CNN Architecture for Rough Registration
As shown in Figure 8, to start with, SSPD with the size of s × s × s extracted from an input point cloud is convoluted with each of 16 kernels with the size of 2 × 2 × 2 that results in 16 feature maps in the layer C1. Then, to continuously increase the number of feature maps, we apply 3-D convolutions with the same kernel size 2 × 2 × 2 as the first convolutional layer for the layer C2 and C3. Consequently, the layer C2 has 32 feature maps while there are 64 feature maps in the layer C3. Additionally, in these layers C1, C2 and C3, zero padding is used in convolutional operations such that the output of feature maps has the same size as the original input. Moreover, after the convolutional operation, the result is added with a bias and then passed through a hyperbolic tangent activation function tanh(.) to generate the output of each feature map as follows.
where χ k stands for the output of the k th feature map, b k is a bias of the k th feature map, w ki denotes the weight of kernel connected from the i th input map, x i means the i th input map and * denotes the convolutional operation. In the subsequent layer, the 3-D max-pooling layer S1 with a window size of 2 × 2 × 2 is applied to each of the feature maps in the layer C3 that results in the same quantity of feature maps but a reduced spatial dimension, which is two times smaller. Next, the output of the max-pooling layer S1 is flattened to become a feature vector such that each element in this vector is fully connected to each of 2048 neurons in the layer H1. Similarly, the output of the layer H1 is fully connected to the layer H2 with 1024 neurons whilst the fully connected layer H3 has 512 neurons connected from the output of the layer H2. Finally, neurons of the last fully connected layer H3 are connected to 3 neurons representative for the estimation of Euler angles in the output layer. The activation function used in each neuron in these fully connected layers is tanh(.) function.

2-D CNN Architecture for Precise Registration
With the explanations in the previous section, because the 8CBCP descriptor collects d closest points according to each corner of AABB to generate one sample, there are totally 8 closest point groups that play an equal role in describing the orientation of a 3-D point cloud. Inspired by this idea, the proposed architecture of the 2-D CNN model employs 8 sub-models in parallel in order to independently estimate Euler angles, followed by an average computation of these estimated Euler angles. As shown in Figure 9, as regards the first sub-model, the first part with the size of d × 3 of 8CBCP extracted from an input point cloud is convoluted with each of 8 kernels with the size of 5 × 1 that results in 8 feature maps in the layer C1. Then, the quantity of feature maps is continuously grown in the sequential layers by applying 2-D convolutional operations with a kernel size of 5 × 1 similar to the previous layer C1. As a result, the layer C2 and C3 have 16 feature maps and 32 feature maps, respectively. Similar to 3-D CNN architecture used in the rough registration, zero padding is applied in convolutional operations so as to make no modification about the size of the output of feature maps compared with the original input. Additionally, before being connected to the next layers, the output of the convolutional operation is added by a bias and then delivered to the activation function tanh(.) to create the output of feature maps. Next, the output of feature maps in the layer C3 is passed through the 2-D max-pooling layer S1 with a window size of 2 × 1 that leads to a reduced spatial dimension, which is two times smaller but the same number of feature maps. Then, the output of the max-pooling layer S1 is flattened as a feature vector and elements in this vector is fully connected to each of 1024 neurons in the layer H1. In the same way, the output of the layer H1 is fully connected to 512 neurons in the layer H2 and the outputs of the layer H2 are fully connected to the layer H3 with 256 neurons. Eventually, neurons in the layer H3 are connected to 3 neurons in the output layer Output1 which figures out a more precise estimation of Euler angles ψ 1 , θ 1 , φ 1 in comparison with the estimation in the rough registration stage. Besides, the activation function applied for neurons in fully connected layers is tanh(.) function. Similar to the first sub-model, 7 sub-models left from the second sub-model to the eighth sub-model possess the same 2-D CNN architecture with the output from ψ 2 , θ 2 , φ 2 to ψ 8 , θ 8 , φ 8 , respectively. After all estimations of 8 sub-models, an average computation of these Euler angles is required to obtain the mean estimation of the orientation of a data point cloud as the output of the whole 2-D CNN model as follows.

Experiments
In this section, experimental results for 3-D point cloud registration are provided to validate the proposed registration algorithm based on CNN architectures. In particular, when it comes to point cloud data, open data including "Bunny", "Dragon" & "Buddha", "Armadillo" and "Horse" from the Stanford 3-D Scanning Repository [42][43][44] and NTU CSIE [45], respectively, are employed. Besides, two algorithms such as the "RANSAC then ICP" (RANSAC+ICP) algorithm using Fast Point Feature Histograms (FPFH) [46] to describe features for point clouds and the "Fast Global Registration" (FGR) algorithm [47] provided in Open3D [48,49], a modern library for 3-D data processing, are applied to evaluate the performance of the proposed algorithm. Moreover, in order to give the quantitative evaluation, in addition to the computation time, the mean squared error (MSE) value used to compare the matching rate among these methods is given as follows [22].
where m is the number of points in the data point cloud and p i and q j stand for the corresponding closest points in the model point cloud and the data point cloud, respectively. All programs performing these approaches are programmed in Python on an Intel Core i9-9900K CPU @3.6GHz PC equipped with a GeForce RTX 2080 Ti SEA HAWK X Graphics Processing Unit (GPU). In the training process, for each of point cloud data, it takes about 6 hours and 3 hours to train a 3-D CNN model for the rough registration and a 2-D CNN model for the precise registration, respectively. Nevertheless, transfer learning can be employed to reduce the training time of CNN models for a new 3-D point cloud data. After training, 20 unforeseen samples which are generated by randomly rotating and translating, followed by downsampling the model point cloud, according to each of these five data sets including Bunny, Horse, Dragon, Buddha and Armadillo are employed to validate the performance of the trained CNN models as well as two other approaches such as RANSAC+ICP and FGR approach. Table 1 shows a comparison on the MSE and computation time for test samples from 1 to 10 of Bunny data whilst the similar comparison for test samples from 11 to 20 is given Table 2 where n and m stand for the number of points in the model and data point cloud, respectively. When it comes to the MSE, the smaller MSE is, the better registration result is. Generally, it is noticeable that from all the three algorithms, FGR algorithm gives the worst result in terms of MSE value whereas the results of the proposed approach are a little bit worse than RANSAC+ICP method. Particularly, although FGR method shows the smaller computation time compared with RANSAC+ICP method, MSE values according to this method are quite large with three imprecise estimations observed in samples 3, 14 and 15. Besides, the maximum MSE value at approximately 3.418 mm in sample 3 and the minimum MSE at around 2.851 mm in sample 8 experienced in the proposed algorithm shows the lowest disparity, nearly 0.567 mm, in all estimations of 20 test samples while these differences are witnessed at around 1.063 mm and 11.442 mm in accordance with RANSAC+ICP and FGR approach, respectively. This shows that the developed registration architecture possesses the best stable result in estimating the transformation between two 3-D point clouds in the case of Bunny data. In addition to this advantage, obviously, the smallest computation time is also experienced in the proposed algorithm with the largest value at around 0.027 s whilst it takes approximately 0.140 s and 0.028 s (an imprecise estimation) for the fastest estimation in the case of RANSAC+ICP and FGR approach, respectively. Similar to Bunny data, comparable results on two benchmarks including the MSE and computation time for these three algorithms are shown in Tables 3-10 for Horse, Dragon, Buddha and Armadillo data, respectively.         In order to draw thorough comparisons and reinforce an efficient improvement on reducing the computation time but maintaining the high precision when estimating the pose of a 3-D data point cloud of the proposed algorithm, average values in terms of the MSE and computation time of all tested samples for 5 objects are figured out and visualized into two bar charts as shown in Figures 10 and 11. In fact, it is obvious that FGR method witnesses the strongest fluctuation as well as the highest error of MSE values that lead to the worst performance in terms of the precision of this method. Particularly, the average MSE value of 5 tested objects of this approach is approximately two times as high as the average MSE value of the proposed and RANSAC+ICP approach as illustrated in the last column group in Figure 10. In addition, the lowest discrepancy among the average MSE values of 5 tested objects at roughly 1.130 mm, followed by a larger disparity at approximately 1.840 mm according to RANSAC+ICP approach, draws that our CNNs-based algorithm achieves the robust performance for a variety of objects. Moreover, the proposed approach matches the comparable precision with a slightly smaller average MSE value at around 0.108 mm in comparison with RANSAC+ICP algorithm as given in the column group "Average" in Figure 10. As regards the computation time, the presented algorithm significantly outperforms with approximately 15 times and 2 times as fast as RANSAC+ICP and FGR method, respectively, as shown in the last column group in Figure 11.    To conclude these comparisons, registration results of these three algorithms with specific samples are visualized in Figures 12-16 for Bunny, Horse, Dragon, Buddha and Armadillo data, respectively. It is noticeable that data point clouds transformed by the proposed and RANSAC+ICP algorithm are closer to the model point clouds than FGR approach. Additionally, as shown in Figures 12b-16b, the registration result in the precise stage (blue point cloud) is better than the result in the rough stage (green point cloud) because the 2-D CNN model covers a smaller uncertainty from −15 • to 15 • than the 3-D CNN model with the full range of uncertainty from −180 • to 180 • . Therefore, based on 3-D point clouds of an object captured by a 3-D camera, the pose of the object can be precisely estimated in fixed time that is sufficient for the application of real-time closed-loop object picking and assembly systems because of two redeeming features such as the high accuracy and fast estimation.

Conclusions
This paper resolves the time-consuming problem of 3-D point cloud registration. The estimation time is significantly shortened by using the proposed two-stage CNN architecture, 3-D CNN model and 2-D CNN model. To generate the training data sets, firstly, the model point cloud of a specific object is randomly rotated in the full range from −180 • to 180 • and the small range from −15 • to 15 • and then downsampled several times. Then, these two point cloud data sets, according to the full range and the small range of rotation angles, are passed through two descriptors, SSPD and 8CBCP, respectively. Based on these training data sets, CNN models are independently trained for each specific object to update optimal weights with the aim of estimating the relative orientation between the 3-D model point cloud and a 3-D data point cloud. Specifically, to guarantee comparable accuracy with existing algorithms such as RANSAC then ICP approach, the 2-D CNN model is trained on the labeled data set created by the 8CBCP descriptor. Hence, in the training process, this 2-D CNN model can thoroughly learn distinct characteristics of several point clouds with different orientations in the case of the small uncertainty. After training, the proposed two-stage CNN architecture can quickly estimate the transformation between the model point cloud and a data point cloud in which the translation is estimated based on the average values of these two point cloud data sets. The experimental results show that the proposed algorithm can precisely and quickly estimate the relative transformation between the model point cloud and a data point cloud in fixed time. This leads to potential significance and applicability of the proposed algorithm for the purpose of estimating the pose of an object based on 3-D point clouds in automated assembly systems. Additionally, according to the results of this research, 3-D partial point cloud registration employing CNNs appears to be an important direction of future research.