An Improved Estimation Algorithm of Space Targets Pose Based on Multi-Modal Feature Fusion

The traditional estimation methods of space targets pose are based on artificial features to match the transformation relationship between the image and the object model. With the explosion of deep learning technology, the approach based on deep neural networks (DNN) has significantly improved the performance of pose estimation. However, the current methods still have problems such as complex calculation, low accuracy, and poor real-time performance. Therefore, a new pose estimation algorithm is proposed in this paper. Firstly, the mask image of the target is obtained by the instance segmentation algorithm, and its point cloud image is obtained based on a depth map combined with camera parameters. Finally, the correlation among points is established to realize the prediction of pose based on multi-modal feature fusion. Experimental results in the YCB-Video dataset show that the proposed algorithm can recognize complex images at a speed of about 24 images per second with an accuracy of more than 80%. In conclusion, the proposed algorithm can realize fast pose estimation for complex stacked objects and has strong stability for different objects.


Introduction
Targets pose estimation refers to the establishment of the transformation relationship between the world coordinate system and the camera coordinate system, which can accurately estimate the 3D position and pose of the object. In this way, more flexible space robot grasping and manipulation can be realized. The traditional pose estimation method is mainly based on a local feature operator [1]. This method is only suitable for objects with texture information, but it cannot do anything for weak texture objects. Aiming at the problem of pose estimation of weak texture objects, researchers propose a two-dimensional plane detection method based on edge extraction, but the method relies heavily on key point matching, target detection, and a known three-dimensional model [2].
With the development of deep learning technology in the field of two-dimensional images, the method of remote pose estimating from images directly based on CNN architecture has been proposed multiple times [3]. For example, the BB8 algorithm [4] first segments the pixels of the object from the two-dimensional image; then, it uses another CNN network to regress the eight corners of the object and the three-dimensional bounding box and finally optimizes the 6D pose of the object through the PNP algorithm. The SSD-6D algorithm [5] generates the position and rotation parameters on account of the target detection algorithm SSD. First, a convolutional neural network is used to estimate the initial 6D pose of the target. Finally, the ICP algorithm is used for iterative correction. The com-mon point of these methods is to improve the existing target detection algorithm and estimate the 6D pose of the object in an RGB image by calculating the offset of the object and bounding box. However, these algorithms extract information from RGB images separately. The lack of depth information seriously limits the application of these algorithms in complex environments [6].
With the popularity of 3D cameras, the point cloud data based on depth information can not only express the basic shape features of the object but also contain the texture features of the object [7]. Therefore, a pose estimation method based on RGB-D is proposed that can combine the local features of the image with the 3D point cloud to build a template library with 3D information, which can effectively improve the robustness of the algorithm. In order to avoid the interference of illumination, environment, and other factors, the representative algorithm Linemod [8] based on RGB-D extracts image boundary features and surface normal features of depth map as templates, which greatly improves the stability of the algorithm. After a binary coding, these features can be achieved by a parallel template search, which improves the processing speed of the algorithm. The PoseCNN algorithm [9] is the first end-to-end 6D pose estimation network, which directly extracts features and then completes the output by adding three branch networks of image segmentation, position, and pose. Finally, the ICP algorithm iteratively optimizes the pose. However, the defect of this algorithm is that it needs to know the 3D model of the target, and in order to improve the accuracy of pose estimation, many algorithms need to add a post-processing operation, which is bound to fail to meet the real-time requirements of space robot grasping [10].
Accurate remote estimation of 6D pose is the key factor for the space robot to grasp successfully. It can provide the 3D translation coordinates and 3D rotation vectors of the object in the camera coordinate system. Traditional methods of pose estimation have disadvantages such as complex calculation, low accuracy, insufficient robustness to occlusion, and poor generalization. Therefore, in such complex scenes as space robot grasping, the method based on deep learning has unparalleled advantages. In addition, compared with the RGB-based pose estimation method, the RGB-D method contains more modal information and is not limited to the target texture information. To solve the above problems, this paper, based on the instance segmentation algorithm, obtains three-dimensional data of the object from the image, which extracts the semantic features and point cloud features of two heterogeneous kinds of data and fuses multi-modal features at the pixel level. In addition, the post-processing module for pose fine tuning is integrated into the deep learning framework to further improve the accuracy of the algorithm without increasing computation too much. Experiments on a common dataset show that the algorithm has high accuracy and significantly improves the real-time performance of pose estimation.
The main contributions of the proposed algorithm framework are as follows. Firstly, a multi-modal feature fusion method is proposed, which integrates the two-dimensional information contained in each pixel into the corresponding three-dimensional point cloud of the model, effectively preserving the spatial structure information of the model. Secondly, the network structure based on posture estimation is integrated into a refine module and integrated into the end-to-end neural network to avoid the dependence on ICP and other post-processing methods, which effectively improves the processing speed of the model.
The remainder of the study is arranged as follows. In Section 2, we introduce the related work of pose estimation. Section 3 focuses on the construction of a new neural network that can estimate the pose of the object based on feature fusion. In Section 4, the performance of the proposed neural network is discussed according to the experiment result. Finally, the conclusion that the new method has faster processing speed and better generalizability is given, and future research is described.

Related Work
The traditional architecture of a pose estimation network is to extract local or template features first and then classify the extracted features to realize the recognition of different targets [11]. However, the performance of the algorithm is seriously constrained by the artificially designed features and fixed matching process. Therefore, the pose estimation methods based on CNN architecture appear repeatedly, such as PoseNet and Pvnet [12]. However, due to the lack of depth information, pose estimation is still a difficult problem, so the method of remote pose estimation based on 3D data has become a trend.
The traditional method is based on RGB-D to extract 3D features, which integrates the depth information as an additional channel of the image into the CNN-based architecture [13]. Recent algorithms transform depth information into 3D data, such as voxels and point clouds, and then estimate the target pose from the 3D data. For example, Song et al. [14] estimated the target pose based on the voxelized input of a 3D convolution network and achieved a good result. However, these methods still need expensive post-processing steps to make full use of 3D data, resulting in low real-time performance of the algorithm.
The two-dimensional image can only reflect the projection information of the target at a certain angle in the three-dimensional space, so the representation of the related attributes of the object is not accurate, and the three-dimensional data can accurately represent the real shape and geometric structure of the target [15]. Previous methods are based on CNNs, which treat depth information as an additional image channel, but these methods ignore the inherent three-dimensional structure of depth information. Therefore, this paper chooses to process 3D data directly. The representation methods of 3D data mainly include point clouds, voxels, meshes, and multi-view images [16]. A point cloud is composed by a number of three-dimensional discrete points. Voxels are the individual cells of a cubical grid, which are combined into objects in a standard way. Meshes mean that polygon surfaces are formed by the relationship between point pairs. Multi-view images are several images from different viewpoints of the same scene to form a complete three-dimensional model. As a result of the disorder and unstructured characteristics of a 3D point cloud, the robustness of the method based on a point cloud is higher, but the feature extraction and characterization of a point cloud are more difficult.
At present, many methods choose to voxelize the point cloud data and then extract features based on deep learning, but these methods have the problem of information loss, which is not suitable for real-time requirements [17]. The multi-view image method to predict the pose of the target is based on two-dimensional image object detection to learn from multiple perspectives. However, this method is limited to the number of views, and it needs a scene without occlusion, so it is not suitable for complex scenes. Similarly, the grid-based method transforms the point cloud into regular 3D data for processing, which inevitably has defects [18].

Pose Estimation Based on Feature Fusion
The objects to be located by our algorithm are defined by RGB-D color data and 3D point clouds which segmented their associated 3D bounding boxes from images. When presented with a new RGB-D image, our goal is to recognize these objects and determine their 3D positions and orientations as needed in a space robot grasping task. To solve these problems, according to the known camera internal function, the depth information obtained from segmentation can be converted into 3D point cloud data, and then the point cloud data of the model can be directly processed. According to the characteristics of point cloud data, a symmetric function is proposed to extract the point cloud information, and the corresponding position and category of each point can be obtained directly, as shown in Formula (1).
where ( ) S n is the symmetric group of all permutations of n points, and h is a multilayer perception. f is the final function, ( ) P n x means that the point cloud data are processed by a multi-layer perceptron. The disorder of the point cloud will not affect the arrangement of final function f after being treated by a symmetric function. This function is to extract the multi-dimensional features of a 3D point cloud model through a neural network and then take the maximum value of each dimension feature, so as to effectively avoid the disorder problem. At the same time, in order to improve the accuracy of point cloud processing, a three-dimensional spatial transform network (STN) is added for the invariance problem. A rotation matrix that is most conducive to network processing can be learned from the pose information of point cloud, as shown in Formula (2).
where , HW denotes the height and weight of the image, and the network is the process of converting the original coordinates U is the color of ( , ) n m in the image channel C . After adjusting the point cloud in space, the point cloud can be rotated to a more appropriate angle for processing. However, for the scene segmentation problem, the relationship between local point pairs is not learned, and the uneven density in the feature extraction process also restricts the effect of the network model. Aiming at the problem of missing local features, this paper establishes hierarchical grouping of points based on a hierarchical point cloud feature learning method to extract abstract local features. First, the local part of the 3D model is randomly sampled by the farthest point sampling (FPS), which is sampled in a range. Then, in the aggregation layer, the sampled points are taken as the center, and the points in the aggregation sphere form a subset of the point cloud. Finally, the global features of the point cloud subset are extracted based on the PointNet network. After many iterations of such operations, the number of points remaining in the point cloud becomes less, and the local feature extracted from the point cloud contains more information. The overall process of the hierarchical point cloud feature learning method is shown in Figure 1. Finally, we need to fuse the point cloud information and the 2D information of the image through the ResNet network, which processes the point cloud data through Point-Net and fuses the two sub-networks to get the 3D boundary box. In order to avoid the influence of different sizes of the target, the concatenation technique in the image target detection algorithm is used for reference, and the point cloud features and 2D features are spliced together. Finally, the offset value and confidence of each point compared with the center of the bounding box are obtained. The loss function of the network is shown in Formula (3). Here, x is the predicted corner position, stn L is the loss of spatial transformation regularization, which can strengthen the orthogonality of the spatial transformation matrix, and score L is the loss of fractional function, which can make the network learn the offset from the point cloud to the nearest target box [19]. It is expected that the algorithm can be robust to noise and changing illumination in complex environments and meet the real-time requirements. This paper proposes an endto-end deep learning method, which estimates the 6D pose of objects based on RGB-D. The core is to fuse image features and point cloud features at the pixel level. This fusion method can clearly explain the local appearance and geometric information, and it can effectively improve the accuracy of pose estimation. At the same time, integrating the pose fine-tuning module into the end-to-end learning framework can not only improve the performance of the model but also meet the requirements of real-time tasks.
First, the segmented objects are obtained based on the instance segmentation algorithm PSPNet [20]. At the same time, the depth pixels of the mask are converted into 3D point clouds containing spatial structure information. Then, for each segmented object, the 2D features of the image are extracted based on the CNN network, and the geometric feature vector of the 3D point cloud is extracted based on the PointNet network. Such separate processing can effectively preserve the data information. Finally, multi-modal features are fused at the pixel level to estimate the 6D pose of the target. The refine module based on this network framework can iterate repeatedly to optimize the pose estimation results so as to improve the accuracy of the algorithm.

Feature Extraction
RGB and depth information are located in different spaces. How to extract features from heterogeneous data sources and fuse them properly is the key technology to be studied in this paper. Previous algorithms either use the high-cost 3D convolution algorithm or only use the depth information as an additional channel outside the RGB channel, resulting in a low accuracy of pose estimation. In this paper, RGB information and depth information are processed separately. First, the input image is segmented to obtain the target block diagram and mask image. At the same time, the depth pixels of the mask are converted into three-dimensional point clouds containing spatial structure information. Then, for each segmented object, 2D features of the image are extracted based on a CNN network, and geometric feature vectors of the 3D point cloud are extracted based on the PointNet network. The whole process is shown in Figure 2.  As can be seen from the above figure, the first step of the algorithm is to segment the region of interest in the image, and then, we need to extract information from RGB and depth channels and fuse them. Although the structures of these two information sources are similar, they reside in different spaces. The RGB-D image was a collection of pixels, each with RGB, and the mask for the segmented object of interest was a subset of these pixels. In this case, there are as many 3D points as RGB pixels, so there is no discrepancy in size. In this paper, we choose to process the two data sources separately so as to retain the internal structure information of the data source. The CNN network for extracting two-dimensional image features is pyramid scene parsing. At the same time, the up-sampling structure of the network should make the size of two-dimensional features and three-dimensional point cloud features consistent, which is convenient for pixel-by-pixel fusion. The main body of PSPNet (Pyramid Scene Parsing Network) uses ResNet18 as the basic network, which can extract information of various scales in different regions, greatly improving the scene understanding ability of the algorithm, as shown in Figure 3. As can be seen from the above figure, the core of the two-dimensional feature extraction algorithm based on PSPNet is Spatial Pyramid Pooling (SPP), which can understand the scene more accurately through the combination of local and global features. Point cloud data processing can be accomplished in the PointCNN algorithm, mainly using a symmetric function to solve the problem of translation invariance in unordered point cloud processing.

Multi-Modal Data Fusion
After the feature extraction of RGB and point cloud using different networks, in order to ensure the maximum retention of target information, the features of different levels in the network are extracted and fused, and finally, they are spliced into a pixel-level fusion feature, as shown in Figure 4. In order to keep consistent with the dimension of the point cloud, the image features should adjust the dimension Then, the two layers of features in the network are concatenated on the basis of pixels to generate fusion features. After average pooling of fusion feature 2, fusion feature 3 is repeated to ensure consistency with fusion feature 1 and fusion feature 2. Finally, a multi-layer fusion feature is obtained to ensure that the final feature contains multiple size information. Then, the preliminary prediction of 6D pose can be realized based on this fusion feature. At the same time, three groups of parameters based on the fusion features can be obtained, including a rotation parameter, translation parameter, and the degree of confidence. In this pose prediction model, the loss function is the least square method between the real points in three-dimensional space and the corresponding points of the model after pose regression, as shown in Formula (4). R q x R q x is a matrix of the predicted rotation parameters and real rotation parameters, and , T T is a vector of predicted translation parameters and real translation parameters. However, this loss function is not suitable for centrosymmetric or axisymmetric objects, because when the loss function takes a value, it will correspond to many situations, which is not conducive to the stability of the network. Therefore, the loss function of a symmetric target is modified to Formula (5).
The total loss function of the whole network combined with confidence is Formula (6).
In the formula, i denotes the current point, N is the sum of the number of points, i c is the confidence, and α is the super parameter. Finally, the final 6D pose parameters are determined according to the maximum confidence based on an unsupervised method.

Pose Refine Network
After the initial pose is obtained, in order to improve the accuracy of space robot grasping, the final pose parameters need to be optimized to realize the fine matching of 6D pose. Based on the idea of the ICP algorithm, this paper proposes a precise registration method integrated into the whole network, which can not only improve the accuracy of the algorithm, but also ensure the real-time performance of the model. This module first needs to process the target point cloud according to the predicted pose to achieve the network input of iterative fine matching, as shown in Figure 5. It can be seen from the figure that the predicted pose has been obtained from the target point cloud, but the predicted pose must be deviated from the real pose. Therefore, the inverse matrix  As can be seen from the figure, taking the last iteration result as the input point cloud, the RGB feature and the geometric structure feature of the point cloud can be extracted again, and then the fusion feature is obtained to predict the 6D pose parameters in refine mode. The difference between this part of network and the main network is that there is no output of confidence, but the category parameters are still included in the pose parameters. This is the first iterative process of fine matching. According to the output pose, the new position can be obtained by inversion, and so on until convergence. From the experiments presented later in this paper, it can be seen from that this iterative method can effectively improve the accuracy of 6D pose estimation, and the increased amount of calculation is very low.

Network Model Optimization
This section mainly studies the problems in the process of network model training, as well as the optimization for the network, so as to ensure that the algorithm achieves the desired effect. First of all, the point cloud in the YCB-Video dataset [21] focuses more on the boundary information of the object, so the 3D model can well describe the shape and key point information of the object, which is very helpful for the shape extraction of the object. However, this will lead to an uneven distribution of the point cloud and the low recognition of the model, which will affect the training results. To solve this problem, this paper expands the number of point cloud samples from 3000 to 10,000, and it adjusts the distribution of the template point cloud to make it evenly distributed [22], as shown in Figure 7. In the upper row of the figure, we can see that the distribution of the original 3D model point cloud is uneven, which leads to the lack of surface information in many parts. The lower row adjusts the distribution of the point cloud, and the 3D model can take the contour information and texture information of the object into account, so as to effectively improve the recognition accuracy of the algorithm.

Experiment Results and Discussion
The experimental workstation is configured with i7 CPU, 16GB memory, and an NVIDIA GTX 1080Ti graphics card. In order to study the effect of the pose estimation network in the space robot grasping application, this paper chooses to evaluate the algorithm in the YCB-Video dataset. It contains 133,827 independent scenes, depicts various object instances, and has complete annotation information. This dataset is a typical dataset with wide coverage and difficult synthesis in the field of pose estimation, which is very close to an actual scene of space robot grasping. The dataset contains real data and synthetic data, including asymmetric objects, centrosymmetric objects, and axisymmetric objects, and it contains a variety of background surfaces.
Each example contains an RGB image, depth image, and instance segmentation annotation image. After obtaining the 6D pose estimation results, we need to use a quantitative evaluation index to evaluate the algorithm, so as to intuitively compare the algorithm effect. Since 6D pose estimation is a regression problem, the error metrics used in classification scenarios, such as accuracy and recall, are not applicable. For 6D pose estimation, it is necessary to define the error measure as the standard to evaluate the performance of the model. In this paper, the average distance metric (ADD) is selected as the error measurement standard of 6D pose estimation. It is a measurement method with invariable ambiguity, which represents the two-norm average of the three-dimensional distance between the predicted 6D pose and the real 6D pose, as shown in Formula (7).
According to the formula, ADD is to calculate the average distance from each 3D K represents the degree of confidence of the pose estimation. In order to avoid deviation, the real pose and the estimated pose can be projected into the camera coordinate system, and then, the average distance of the symmetrical object can be calculated. In this paper, ADD is used to integrate symmetric and asymmetric objects into the overall evaluation.
By comparing the experimental results of two models in the YCB-Video dataset, the effectiveness of the proposed pose estimation algorithm is demonstrated. After the initial pose estimation of the whole network and the iterative refining of the refine network, we can see how the pose estimation process gradually calibrates the pose of the 3D model, as shown in Figure 8. The blue point cloud represents the real pose of the target, and the red point cloud represents the recognized pose. It can be seen that there are still some errors between the initial pose and the real pose after the network training. However, through the repeated iteration of the fine-tuning network, it is obvious that most of the final point cloud models come close to the real point cloud. Therefore, the effectiveness of the pose estimation algorithm proposed in this paper can be repetitious and proved. In Table 1, the algorithm is compared with the most advanced pose estimation methods, including PoseCNN based on RGB and Densefusion based on RGB-D. In order to maintain consistency, all algorithms are configured with the same training set and test set. The comparison of ADD values in the quantitative evaluation of a 6D pose of the YCB-Video dataset is shown in Table 1. It can be seen from the table that the effect of PoseCNN based on RGB is not as good as that of Densefusion based on RGB-D, which proves that depth data are indispensable information in the process of pose estimation. The effect of the proposed method is similar to that of Densefusion, indicating that the multi-mode data fusion module plays a major role in improving the accuracy. Through the actual scene test, it can be found that the calculation speed of the proposed algorithm can reach about 24 pieces per second, while the accuracy can still remain above 80%. More importantly, compared with the existing methods, it can better solve the problem of grasping when the objects are mutually occluded. At the same time, in order to prove the advantages of the proposed algorithm in the space robot grasping, the three algorithm models were deployed on the actual platform respectively, and the running time of each step was counted, as shown in Table 2. As can be seen from Table 2, compared with the RGB-based method PoseCNN, the proposed algorithm in this paper has a great improvement in the real-time performance of remote pose estimation. The previous two methods spend most of their effort on the post-processing, while the new algorithm integrates the fine-tuning module into the whole network, which improves the performance without much increase in computation. Although the proposed pose estimation algorithm may not be the most outstanding in average distance, the real-time performance of the model is of great value for space robot grasping.
Traditional 6D pose estimation requires either 3D convolution or a 3D model for ICP registration, which requires a great amount of computation. Generally, the speed of these algorithms can only process a part in 20 s, which is obviously not in line with production requirements. On the other hand, it is difficult for an algorithm calculating an object pose based on RGB images to deal with the complicated stacking of parts. We believe that the depth map contains the spatial pose information of the object, and the RGB map contains the category and boundary information of the object. Therefore, the feature information of these two modes can be extracted separately and fused to maximize the use of depth information. At the same time, the depth map is transformed into the point cloud map, which greatly reduces the calculation and does not reduce the running speed. The proposed algorithm can achieve a 6D pose quickly and accurately when dealing with the accumulation of complex parts. More importantly, it is robust enough for complex industrial environments, which has important practical significance for industrial production.

Conclusions
Aiming at the problems of low accuracy and slow speed of current pose estimation algorithms, an end-to-end pose estimation algorithm based on multi-mode feature fusion is proposed and verified in this paper. First, the proposed algorithm segments the RGB information and depth information of the target at the same time, and it effectively preserves the spatial information of the data by extracting semantic features and point cloud features of the two heterogeneous data respectively. Then, multi-modal features are fused at the pixel level to obtain a new global feature, which is used to complete the task of remote pose estimation. At the same time, the post-processing module of pose fine tuning is integrated into the end-to-end framework, so that the algorithm can improve the accuracy without adding too much computation.
Experimental results show that the pose estimation algorithm based on RGB-D input can realize the real-time detection of a 6D pose while maintaining sufficient detection accuracy. This algorithm is different from previous methods that treat depth maps as a channel of RGB maps to process images. Instead, RGB and point cloud features are extracted respectively; then, image features with the strongest correlation are screened out for stitching in one dimension. In the future work, I will continue to optimize the accuracy and calculation speed of the algorithm so that it can be applied to the actual production in a more complicated environment.