Robust Image Matching Based on Image Feature and Depth Information Fusion

: In this paper, we propose a robust image feature extraction and fusion method to effectively fuse image feature and depth information and improve the registration accuracy of RGB-D images. The proposed method directly splices the image feature point descriptors with the corresponding point cloud feature descriptors to obtain the fusion descriptor of the feature points. The fusion feature descriptor is constructed based on the SIFT, SURF, and ORB feature descriptors and the PFH and FPFH point cloud feature descriptors. Furthermore, the registration performance based on fusion features is tested through the RGB-D datasets of YCB and KITTI. ORBPFH reduces the false-matching rate by 4.66~16.66%, and ORBFPFH reduces the false-matching rate by 9~20%. The experimental results show that the RGB-D robust feature extraction and fusion method proposed in this paper is suitable for the fusion of ORB with PFH and FPFH, which can improve feature representation and registration, representing a novel approach for RGB-D image matching.


Introduction
Since the advent of the Microsoft Kinect camera, various new RGB-D cameras have been launched. RGB-D cameras can simultaneously provide color images and dense depth images. Owing to his data acquisition advantage, RGB-D cameras are widely used in robotics and computer vision. The extraction and matching of image features are the basis for realizing these applications. Significant progress has been made in the feature extraction, representation, and matching of images and depth maps (or point clouds). However, there is room for further improvement of these processes. For example, the depth image includes information not contained in the original color image. Further research is required to effectively and comprehensively utilize the color image information and depth information to improve feature-matching accuracy. Therefore, to effectively fuse image and depth information and improve feature-matching accuracy, a robust RGB-D image feature extraction and fusion method based on image and depth feature fusion is proposed in this paper. The main idea of the proposed method is to directly splice the image feature point descriptor and the corresponding point cloud feature descriptor to obtain the fusion descriptor of feature points to be used as the basis of feature matching. The methodology framework comprises image feature extraction and representation, point cloud feature extraction and representation, and feature fusion, as shown in Figure 1.
The main contributions of this paper are as follows:

1.
A feature point description method that fuses image feature and depth information is proposed, which has the potential to improve the accuracy of feature matching.

2.
The feature-matching performance of different fusion features constructed based on the proposed method is verified on public RGB-D datasets. The main contributions of this paper are as follows: 1. A feature point description method that fuses image feature and depth information is proposed, which has the potential to improve the accuracy of feature matching. 2. The feature-matching performance of different fusion features constructed based on the proposed method is verified on public RGB-D datasets.

Related Work
The aim of the present study is to design a robust RGB-D image feature extraction and fusion method to improve RGB-D image registration accuracy. However, a method to fully fuse RGB images with depth information remains to be established. In this section, we review current related research on feature extraction, representation, and fusion method of images and point clouds.
(1) Image feature extraction and representation Lowe et al. proposed the famous scale-invariant feature transform (SIFT) algorithm [1]. SIFT is both a feature detector and a feature descriptor. The algorithm is theoretically scale-invariant and has good anti-interference to illumination, rotation, scaling, noise, and occlusion properties. The SIFT feature descriptor is a 128-dimensional vector. However, the calculation process of this algorithm is complicated, and the speed is slow. Rosten et al. proposed the features from accelerated segment test (FAST) algorithm [2]. FAST is a corner-detection method that can quickly extract feature points. It uses a 16-pixel circle around the candidate point, p, to classify whether the candidate point is a corner. The most significant advantage of this method is high computational efficiency, but FAST is not a feature descriptor, so it must be combined with other feature descriptors. Bay et al. proposed the speeded-up robust features (SURF) algorithm [3]. SURF is a fast and highperformance scale-and rotation-invariant feature point detector and descriptor that combines the Hessian matrix and the Haar wavelet. The SURF descriptor only uses a 64-dimensional vector, which reduces the time required for feature calculation and matching. Leutenegger et al. proposed the binary robust invariant scalable keypoints (BRISK) algorithm [4]. The BRISK algorithm usually uses the FAST algorithm to detect the image's feature points quickly, then individually samples the grayscale of each keypoint neighborhood and obtains a 512-bit binary code by comparing the sampled grayscale. BRISK has low computational complexity, good real-time performance, scale invariance, rotation invariance, and anti-noise ability but poor matching accuracy. Rublee et al. proposed the oriented FAST and rotated BRIEF (ORB) algorithm [5], which combines the FAST and

Related Work
The aim of the present study is to design a robust RGB-D image feature extraction and fusion method to improve RGB-D image registration accuracy. However, a method to fully fuse RGB images with depth information remains to be established. In this section, we review current related research on feature extraction, representation, and fusion method of images and point clouds.
(1) Image feature extraction and representation Lowe et al. proposed the famous scale-invariant feature transform (SIFT) algorithm [1]. SIFT is both a feature detector and a feature descriptor. The algorithm is theoretically scale-invariant and has good anti-interference to illumination, rotation, scaling, noise, and occlusion properties. The SIFT feature descriptor is a 128-dimensional vector. However, the calculation process of this algorithm is complicated, and the speed is slow. Rosten et al. proposed the features from accelerated segment test (FAST) algorithm [2]. FAST is a corner-detection method that can quickly extract feature points. It uses a 16-pixel circle around the candidate point, p, to classify whether the candidate point is a corner. The most significant advantage of this method is high computational efficiency, but FAST is not a feature descriptor, so it must be combined with other feature descriptors. Bay et al. proposed the speeded-up robust features (SURF) algorithm [3]. SURF is a fast and high-performance scale-and rotation-invariant feature point detector and descriptor that combines the Hessian matrix and the Haar wavelet. The SURF descriptor only uses a 64-dimensional vector, which reduces the time required for feature calculation and matching. Leutenegger et al. proposed the binary robust invariant scalable keypoints (BRISK) algorithm [4]. The BRISK algorithm usually uses the FAST algorithm to detect the image's feature points quickly, then individually samples the grayscale of each keypoint neighborhood and obtains a 512-bit binary code by comparing the sampled grayscale. BRISK has low computational complexity, good real-time performance, scale invariance, rotation invariance, and anti-noise ability but poor matching accuracy. Rublee et al. proposed the oriented FAST and rotated BRIEF (ORB) algorithm [5], which combines the FAST and BRIEF [6] algorithms, making it both a feature detector and a feature descriptor. The length of the ORB feature descriptor is generally a binary string of 128, 256, or 512. The contribution of ORB is that it adds fast and accurate direction components to the FAST and efficient calculation for the BRIEF features so that it can realize real-time calculation. However, it is not scale-invariant and is sensitive to brightness. Alahi et al. proposed the fast retina keypoint (FREAK) algorithm [7]. FREAK is not a feature detector, and it can only be applied to the keypoints that other feature detection algorithms have detected. FREAK is inspired by the human retina, and its binary feature descriptors are computed by efficiently comparing image intensities for retinal sampling patterns.
(2) Point cloud feature extraction and representation Johnson et al. proposed a 3D mesh description called spin images (SI) [8]. SI computes 2D histograms of points falling within a cylindrical volume utilizing a plane that "spins" around the normal of the plane. Frome et al. proposed regional shape descriptors called 3D shape contexts (3DSC) [9]. 3DSC directly extends 2D shape contexts [10] to 3D. Rusu et al. designed the point feature histograms (PFH) algorithm [11,12], which calculates angular features and constructs corresponding feature descriptors by observing the geometric structure of adjacent points. One of the biggest bottlenecks of using the PFH algorithm is computational efficiency for most real-time applications. In order to improve the calculation speed, Rusu et al. developed the fast point feature histogram (FPFH) algorithm [13], which is a representative, handwritten 3D feature descriptor. It provides similar feature-matching results with reasonable computational complexity. Tombari et al. proposed a local 3D descriptor for surface matching called the signature of histograms of orientations (SHOT) algorithm [14,15]. SHOT allows for simultaneous encoding of shape and texture, forming a local feature histogram. Steder et al. developed normal aligned radial feature (NARF) [16], a 3D feature point detection and description algorithm. Guo et al. proposed rotational projection statistics (RoPS) [17], a local feature descriptor for 3D rigid objects based on rotational projection statistics that it is sensitive to occlusions and clutter. In addition, many other high-performance 3D point cloud features have emerged in recent years, including B-SHOT [18], Frame-SHOT [19], LFSH [20], 3DBS [21], 3DHoPD [22], TOLDI [23], BSC [24], BRoPH [25], and LoVS [26], among others.
(3) Image and point cloud feature fusion Rehman et al. proposed a method to fuse the local binary pattern, wavelet moments, color autocorrelogram features of RGB data, and principal component analysis (PCA) features of the corresponding depth data [24]. Khan et al. proposed an RGB-D data feature generation method based on color autocorrelograms, wavelet moments, local binary patterns, and PCA [27]. Alshawabkeh fused image color information with point cloud linear features [28]. Chen et al. achieved point cloud feature extraction by selecting three pairs of two-dimensional images and three-dimensional point cloud feature points, calculating the transformation matrix of the image and point cloud coordinates and establishing a mapping relationship [29]. Li et al. proposed a voxel-based local feature descriptor, used a random forest classifier to fuse point cloud RGB information and geometric structure features, and finally constructed a classification algorithm of color point cloud [30]. With the development of artificial intelligence technology, many feature extraction and fusion technologies based on deep learning technology have emerged, such as those presented in [31][32][33][34]. These methods require a large amount of data to train network models, and obtaining these extensive training sample data may be difficult under some application conditions. Therefore, in this paper, we discuss the traditional feature extraction and fusion methods.

Feature Extraction and Matching
The specific process of the proposed feature point extraction and fusion method is as follows. First, the feature points of the RGB image are extracted, and the corresponding image feature descriptor is established. Three classical image feature points are selected in, namely SIFT, SURF, and ORB feature points. Then, according to the pixel correspondence between RGB and depth images, the depth image is transformed into a point cloud. The features of the three-dimensional point cloud corresponding to the image feature points are extracted, i.e., the PFH and FPFH features. Finally, the image feature descriptor and the point cloud feature descriptor are spliced into a fusion descriptor.

RGB-D Camera Calibration
It is worth mentioning that the depth image is generally obtained by a depth camera, and the RGB image is generally taken by an RGB camera. Due to differences in camera hardware technology, the size of the RGB image and that of the depth image is often different. Therefore, RGB-D camera calibration must be carried out to obtain the transformation matrix between the RGB camera and the depth camera. The specific calibration principle is as follows.
A schematic diagram of the RGB-D camera coordinate system is shown in Figure 2. It is assumed that the world coordinate system is O W − X W Y W Z W ; the RGB camera coordinate system and the depth camera coordinate system are O RGB − X RGB Y RGB Z RGB and O Depth − X Depth Y Depth Z Depth , respectively; and the corresponding image pixel coordinate systems are o rgb − u rgb v rgb and o depth − u depth v depth , respectively. The position of a world point, P W = X W Y W Z W 1 T , in the RGB camera and the depth camera coordinate system are shown in the following formula.
The features of the three-dimensional point cloud corresponding to the image feature points are extracted, i.e., the PFH and FPFH features. Finally, the image feature descriptor and the point cloud feature descriptor are spliced into a fusion descriptor.

RGB-D Camera Calibration
It is worth mentioning that the depth image is generally obtained by a depth camera, and the RGB image is generally taken by an RGB camera. Due to differences in camera hardware technology, the size of the RGB image and that of the depth image is often different. Therefore, RGB-D camera calibration must be carried out to obtain the transformation matrix between the RGB camera and the depth camera. The specific calibration principle is as follows.
A schematic diagram of the RGB-D camera coordinate system is shown in Figure 2.
It is assumed that the world coordinate system is W ; the RGB camera coordinate system and the depth camera coordinate system are , respectively; and the corresponding image pixel coordinate sys- , respectively. The position of a world point, , in the RGB camera and the depth camera coordinate system are shown in the following formula.
[ ] The positional relationship between the RGB camera and the depth camera can be represented by the transformation matrix, 2 T Depth RGB , as follows: The camera coordinate system can be converted to the camera image pixel coordinate system by the following equation. The positional relationship between the RGB camera and the depth camera can be represented by the transformation matrix, T Depth2RGB , as follows: The camera coordinate system can be converted to the camera image pixel coordinate system by the following equation.
where K RGB and K Depth represent the intrinsic parameter matrix of the RGB camera and the depth camera, respectively. By combining Equations (2) and (3), the depth image pixel where Z Depth is the depth value measured by the depth camera, and K RGB , K Depth , and T Depth2RGB can be obtained by the Zhang camera calibration method [35]. Through Equation (4), we can obtain the projection of the depth data in the RGB image pixel coordinate system. However, because the depth image size is usually different from the RGB image size, the depth image size is generally kept consistent with the RGB image size through the sampling method in the RGB image pixel coordinate system.

Feature Extraction from RGB Maps
(1) SIFT The process of SIFT feature point extraction and representation is shown in Figure 3a. After determining the location of the feature point, SIFT takes 4 × 4 subregion blocks around the feature point (each subregion block is 4 × 4 pixels), calculates the gradient amplitude and direction of each subregion, divides the gradient direction into eight intervals, and counts each subregion into an eight-dimensional subfeature histogram. The subfeature histograms of 4 × 4 subregion blocks are combined to form a 128-dimensional SIFT feature descriptor. A schematic diagram is shown in Figure 3b.
where K RGB and K Depth represent the intrinsic parameter matrix of the RGB camera and the depth camera, respectively. By combining Equations (2) and (3), the depth image pixel coordinate system can be converted into the RGB image pixel coordinate system, as shown in the following equation.
where Depth Z is the depth value measured by the depth camera, and K RGB , K Depth , and 2 T Depth RGB can be obtained by the Zhang camera calibration method [35]. Through Equation (4), we can obtain the projection of the depth data in the RGB image pixel coordinate system. However, because the depth image size is usually different from the RGB image size, the depth image size is generally kept consistent with the RGB image size through the sampling method in the RGB image pixel coordinate system.

Feature Extraction from RGB Maps
(1) SIFT The process of SIFT feature point extraction and representation is shown in Figure  3a. After determining the location of the feature point, SIFT takes 4 × 4 subregion blocks around the feature point (each subregion block is 4 × 4 pixels), calculates the gradient amplitude and direction of each subregion, divides the gradient direction into eight intervals, and counts each subregion into an eight-dimensional subfeature histogram. The subfeature histograms of 4 × 4 subregion blocks are combined to form a 128-dimensional SIFT feature descriptor. A schematic diagram is shown in Figure 3b. (2) SURF The process of SURF feature point extraction and representation is shown in Figure  4a. After determining the position of the feature point, SURF takes 4 × 4 subregion blocks  (3) ORB The process of ORB feature point extraction and representation is shown in Figure  5a. After determining the position of the feature point, ORB selects a 31 × 31 image block with the feature point as the center, rotates it to the main direction, and then randomly selects N pairs of points in this block (N is generally 128, 256, or 512). For point pairs A and B, a binary result is achieved by comparing the average size of the grayscale in the 5 × 5 subwindow around the two points and comparing N pairs of points to obtain a length N binary feature descriptor. A schematic diagram is shown in Figure 5b.

Feature Extraction from Point Cloud
(1) PFH PFH parameterizes the spatial difference between a reference point and its neighborhood to form a multidimensional histogram describing the geometric properties of the point neighborhood. The multidimensional space where the histogram is located provides a measurable information space for feature expression and is robust to pose, sampling  (3) ORB The process of ORB feature point extraction and representation is shown in Figure  5a. After determining the position of the feature point, ORB selects a 31 × 31 image block with the feature point as the center, rotates it to the main direction, and then randomly selects N pairs of points in this block (N is generally 128, 256, or 512). For point pairs A and B, a binary result is achieved by comparing the average size of the grayscale in the 5 × 5 subwindow around the two points and comparing N pairs of points to obtain a length N binary feature descriptor. A schematic diagram is shown in Figure 5b.

Feature Extraction from Point Cloud
(1) PFH PFH parameterizes the spatial difference between a reference point and its neighborhood to form a multidimensional histogram describing the geometric properties of the point neighborhood. The multidimensional space where the histogram is located provides a measurable information space for feature expression and is robust to pose, sampling

Feature Extraction from Point Cloud
(1) PFH PFH parameterizes the spatial difference between a reference point and its neighborhood to form a multidimensional histogram describing the geometric properties of the point neighborhood. The multidimensional space where the histogram is located provides a measurable information space for feature expression and is robust to pose, sampling density, and noise of 3D surfaces. As shown in Figure 6a, p q represents the sampling point (red). The scope of PFH is a sphere with p q as the center and radius r. Other points in the scope contribute to the PFH of p q (blue). After obtaining all the neighboring points in the k neighborhood of sampling point p q , a local coordinate system, uvw, is established at P q , as shown in Figure 6b, where p k represents a neighborhood point, and n q and n k represent the normal at p q and p k , respectively. point (red). The scope of PFH is a sphere with p q as the center and radius r . Other points in the scope contribute to the PFH of p q (blue). After obtaining all the neighboring points in the k neighborhood of sampling point p q , a local coordinate system, uvw , is established at p q , as shown in Figure 6b, where p k represents a neighborhood point, and n q and n k represent the normal at p q and p k , respectively.  In Figure 6b, the angle eigenvalues of α , ϕ , and θ are as follows.
Each angle eigenvalue is divided into five intervals. All adjacent points in the K neighborhood are combined in pairs to form a new point pair, and the times of α , ϕ , and θ values of the point pair falling in each angle interval are counted. Finally, a 125dimensional point feature histogram is obtained.
(2) FPFH As a simplified algorithm of PFH, the FPFH algorithm maintains good robustness and recognition characteristics. It also improves the matching speed and achieves excellent real-time performance by simplifying and reducing the computational complexity. The specific calculation process of FPFH is as follows: 1. For each sample point, the three angle eigenvalues are calculated between the point and each point in its K neighborhood, and each angle eigenvalue is divided into 11 intervals, so a 33-dimensional simplified point feature histogram (SPFH) is obtained; 2. The K-neighborhood points of each point are calculated to form their SPFH; 3. The final FPFH is calculated with the following formula: where i ω represents the weight coefficient, which is generally expressed by the distance between sampling points p q and p k . In Figure 6b, the angle eigenvalues of α, ϕ, and θ are as follows.
Each angle eigenvalue is divided into five intervals. All adjacent points in the K neighborhood are combined in pairs to form a new point pair, and the times of α, ϕ, and θ values of the point pair falling in each angle interval are counted. Finally, a 125-dimensional point feature histogram is obtained.
(2) FPFH As a simplified algorithm of PFH, the FPFH algorithm maintains good robustness and recognition characteristics. It also improves the matching speed and achieves excellent real-time performance by simplifying and reducing the computational complexity. The specific calculation process of FPFH is as follows: 1.
For each sample point, the three angle eigenvalues are calculated between the point and each point in its K neighborhood, and each angle eigenvalue is divided into 11 intervals, so a 33-dimensional simplified point feature histogram (SPFH) is obtained; 2.
The K-neighborhood points of each point are calculated to form their SPFH; 3.
The final FPFH is calculated with the following formula: where ω i represents the weight coefficient, which is generally expressed by the distance between sampling points p q and p k . A schematic diagram of the FPFH affected area is shown in Figure 7.

Feature Fusion
Due to the varying data types of different descriptors, we propose different descriptor fusion methods for different types of feature descriptors.

Feature Fusion
Due to the varying data types of different descriptors, we propose different descriptor fusion methods for different types of feature descriptors.

Feature Fusion
Due to the varying data types of different descriptors, we propose different descriptor fusion methods for different types of feature descriptors.
(1) SIFT and SURF feature descriptors, as well as those of PFH and FPFH are floatingpoint descriptors. For this kind of floating-point feature descriptor, we propose direct splicing of the normalized point cloud feature descriptors after the normalized image feature descriptors to form the fusion feature descriptors SIFTPFH, SIFTFPFH, SURF-PFH, and SURFFPFH. (2) The image feature descriptor of ORB is a binary string, and the point cloud feature descriptors of PFH and FPFH are floating-point descriptors. In order to maintain the respective feature-description ability of binary descriptors and floating-point descriptors, the data types of the two descriptors are kept unchanged and combined into a tuple, thereby obtaining the fusion feature descriptors of ORBPFH and ORBFPFH. Because the norm of PFH or FPFH is minor, to increase the weight of point cloud features, we usually multiply a coefficient to make the norm of PFH or FPFH after multiplication close to the length of ORB features.    As can be seen from the above figures, the fusion of the two feature descriptors expands the descriptor's length, enriches the descriptor information, strengthens the constraints of the descriptor, and makes it more special.

Feature Matching
The data types of the SIFTPFH, SIFTFPFH, SURFPFH, and SURFFPFH feature descriptors are floating point. Therefore, the Euclidean distance is used as the feature point similarity evaluation index, and the specific formula is as follows.  As can be seen from the above figures, the fusion of the two feature descriptors expands the descriptor's length, enriches the descriptor information, strengthens the constraints of the descriptor, and makes it more special.

Feature Matching
The data types of the SIFTPFH, SIFTFPFH, SURFPFH, and SURFFPFH feature descriptors are floating point. Therefore, the Euclidean distance is used as the feature point similarity evaluation index, and the specific formula is as follows. As can be seen from the above figures, the fusion of the two feature descriptors expands the descriptor's length, enriches the descriptor information, strengthens the constraints of the descriptor, and makes it more special.

Feature Matching
The data types of the SIFTPFH, SIFTFPFH, SURFPFH, and SURFFPFH feature descriptors are floating point. Therefore, the Euclidean distance is used as the feature point similarity evaluation index, and the specific formula is as follows.
where h 1 = (h 11 , . . . , h 1n ) and h 2 = (h 21 , . . . , h 2n ) are the feature descriptors to be registered. As mentioned earlier, the ORBPFH or ORBFPFH feature descriptor is a tuple in which the Hamming distance of the ORB descriptor is calculated, the Euclidean distance of the PFH or FPFH descriptor is calculated, and the two distances are added to obtain the final feature distance; the specific formula is as follows. The calculation process of the Hamming involves comparing whether each bit of the binary feature descriptor is the same. If not, add 1 to the Hamming distance.
where isBitEqual(h 1i , h 2i ) indicates whether the bit is the same, its definition is as follows,n1 represents the length of the ORB feature descriptor, and n represents the total length of the ORBPFH or ORBFPFH feature descriptor.
Then, the rough registration of the feature point is realized based on the Fast Library for Approximate Nearest Neighbors (FLANN) algorithm. Finally, the random sample consensus (RANSAC) algorithm is used to accurately register feature points.

Experiment and Results
The performance of the proposed feature extraction and fusion method is verified on the RGB-D datasets of YCB and Karlsruhe Institute of Technology and Toyota Technological Institute (KITTI). The specific index parameters characterizing the performance of the descriptor are the number and time of feature extraction, the number and time of feature matching, and the matching failure rate. The definition of the matching failure rate (MFR) is as follows.
where N f ailure represents the number of matching failed frames, and N total represents the total number of frames. The image resolution of the RGB-D image is 640 × 480. After the depth image is transformed into a point cloud, there are about 300,000 points. Such a colossal point cloud will consume many computing resources and time when calculating the normal vector and PFH/FPFH descriptor. Therefore, the point cloud is downsampled to keep the number of points in the range of 2000 to 5000, ensuring calculation accuracy and reducing the calculation time.
The sample image of the YCB dataset is shown in Figure 12, and its indices are shown in Table 1. A sample image of the KITTI dataset is shown in Figure 13, and its indices are shown in Table 2. In Tables 1 and 2   Tables 1 and 2 show that the time of feature extraction and registration are ordered as follows: image features <image features + FPFH <image features + PFH. In particular, it is worth noting that the consumption time of ORBFPFH is less than that of SURF and SIFT, indicating that ORBFPFH has the potential to be applied in a real-time system. Taking the first frame in the YCB 0024 dataset as a reference frame, the failure rates of feature matching between the first 200 frames, the first 280 frames, and the first 300 frames in the dataset and the reference frame is counted. The results are shown in Table 3. Taking the first frame in the KITTI fire dataset as the reference frame, the failure rates of feature matching between the first 100 frames, the first 125 frames, and the first 150 frames in the dataset and the reference frames are counted. The results are shown in Table 4. In Tables 3 and 4 Tables 3 and 4, the feature-matching failure rate of the fused feature descriptors SIFTPFH and SIFTFPFH is much higher than that of SIFT, indicating that point cloud feature descriptors PFH and FPFH reduce the feature representation ability of SIFT. The feature-matching failure rate of the fused feature descriptors SURFPFH and SURFFPFH is similar to that of SURF, indicating that the point cloud feature descriptors PFH and FPFH are not very helpful for improving the feature-representation ability of SURF. The feature-matching failure rates of the fusion feature descriptors ORBPFH and ORBFPFH are lower than those of ORB. On the test dataset, ORBPFH reduces the matching failure rate by 4.66~16.66% compared with ORB, and ORBFPFH reduces the false-matching rate by 9~20% compared with ORB, indicating that point cloud feature descriptors PFH and FPFH improve the feature-representation ability of orb descriptors. Some examples of successful registration of ORBPFH and ORBFPFH but failed registration of ORB are shown in Figures 14 and 15. The above results show that the feature extraction and fusion method proposed in this paper is suitable for fusing PFH and FPFH features with ORB features, offering a novel approach for RGB-D image matching.
x FOR PEER REVIEW 13 of 16 Some examples of successful registration of ORBPFH and ORBFPFH but failed registration of ORB are shown in Figures 14 and 15.  The above results show that the feature extraction and fusion method proposed in this paper is suitable for fusing PFH and FPFH features with ORB features, offering a novel approach for RGB-D image matching.

Conclusions
To effectively fuse image and depth information and improve feature-matching accuracy of RGB-D images, a robust image feature extraction and fusion method based on image feature and depth information fusion is proposed in this paper. The proposed method directly splices the image feature point descriptor with the corresponding point

Conclusions
To effectively fuse image and depth information and improve feature-matching accuracy of RGB-D images, a robust image feature extraction and fusion method based on image feature and depth information fusion is proposed in this paper. The proposed method directly splices the image feature point descriptor with the corresponding point cloud feature descriptor to obtain the fusion descriptor of feature points. The fusion feature descriptors are constructed according to the SIFT, SURF, and ORB image feature descriptor and the PFH and FPFH point cloud feature descriptor. The performance of the fusion features is tested in the RGB-D dataset of YCB and KITTI. On the test dataset, ORBPFH reduces the matching failure rate by 4.66~16.66%, ORBFPFH reduces the matching failure rate by 9~20%, and ORBFPFH has potential for real-time application. The test results show that the robust feature extraction and fusion method proposed in this paper is suitable for the fusion of ORB features with PFH and FPFH features and can improve the ability of feature representation and registration, representing a novel approach for RGB-D image matching.