An Improved Point Cloud Descriptor for Vision Based Robotic Grasping System

In this paper, a novel global point cloud descriptor is proposed for reliable object recognition and pose estimation, which can be effectively applied to robot grasping operation. The viewpoint feature histogram (VFH) is widely used in three-dimensional (3D) object recognition and pose estimation in real scene obtained by depth sensor because of its recognition performance and computational efficiency. However, when the object has a mirrored structure, it is often difficult to distinguish the mirrored poses relative to the viewpoint using VFH. In order to solve this difficulty, this study presents an improved feature descriptor named orthogonal viewpoint feature histogram (OVFH), which contains two components: a surface shape component and an improved viewpoint direction component. The improved viewpoint component is calculated by the orthogonal vector of the viewpoint direction, which is obtained based on the reference frame estimated for the entire point cloud. The evaluation of OVFH using a publicly available data set indicates that it enhances the ability to distinguish between mirrored poses while ensuring object recognition performance. The proposed method uses OVFH to recognize and register objects in the database and obtains precise poses by using the iterative closest point (ICP) algorithm. The experimental results show that the proposed approach can be effectively applied to guide the robot to grasp objects with mirrored poses.


Introduction
Three-dimensional (3D) machine vision is a key technology in the field of robotics. Although the rise of 3D vision technology [1,2] is later than two-dimensional (2D) vision technology [3,4], it presents some advantages that 2D vision does not have when performing some complex visual tasks in 3D space. For example, 3D point cloud can provide a wealth of geometric (3D coordinates, curvature variations, surface normals, depth boundaries) and luminosity (color, color variations, transparency, reflectance intensity) information, helping to achieve better results in recognizing objects with less appearance information (e.g., textureless objects) and directly estimating the full 6 degrees of freedom (DOF) object pose. In addition, under unfavorable lighting conditions, 3D data provided by infrared laser technology can achieve better results than 2D images. In recent years, low-cost real-time 3D sensors such as Microsoft Kinect and Asus Xtion have become low-cost consumer devices accessible to ordinary users. These sensors can be used to generate color 3D point clouds on the surface of a given scene in real time, which also promotes the research of 3D object recognition and registration.

Improved Global Feature Descriptor
In this section, VFH is reviewed and used to derive the proposed OVFH, which is consistent with our goal of performing well in object recognition and distinguishing mirrored poses. The specific calculation procedures are detailed in the following subsections.

Global Feature Descriptor VFH
The object's global feature descriptor is a high dimensional representation of the object's 3D shape and is designed for object recognition and pose retrieval. VFH [8] is an effective point cloud feature which is used for the applications about object recognition and 6DOF pose estimation. VFH is a combined histogram containing viewpoint direction features and an extended FPFH [5], which represents the distributions of the four angles representing the geometric characteristics of the point cloud. In point cloud P, let p v denote the position of the viewpoint, p c denote the center of gravity of of all points p i in the point cloud, and n c donate the average normal vector of all normal n i at point p c . p c and n c are calculated as follows: For each p i in the cloud P, a local coordinate system is defined at point p c , as in Equation (3). Using the local coordinate system uwv defined above, the relative deviation between the centroid vector n c and the unit normal vector n i of the point p i can be represented by a set of angles, as shown in Figure 1.
Using the local coordinate system uwv defined above, the relative deviation between the centroid vector nc and the unit normal vector ni of the point pi can be represented by a set of angles, as shown in Figure 1. The feature descriptor for each point in the point cloud can be represented by a quintuple (α, ϕ, θ, d, β), which is calculated as follows: The percentages of the values of cosα, cosϕ, θ, d, and cosβ of each point in the point cloud P falling in different bins are counted, respectively corresponding to the curves on the abscissa ranges [1,45], [46,90], [91,135], [136,180], [181,308] of the feature histogram. Since the distance d between points gradually increases along the viewpoint direction and the density of the local points will affect the feature result, d is often omitted in the 2.5-dimensional data acquired by the robot for better robustness. Finally, the VFH descriptor describes the point cloud using a total of 263 bins. Figure 2 shows an example of the VFH.  The feature descriptor for each point in the point cloud can be represented by a quintuple (α, φ, θ, d, β), which is calculated as follows: The percentages of the values of cosα, cosφ, θ, d, and cosβ of each point in the point cloud P falling in different bins are counted, respectively corresponding to the curves on the abscissa ranges [1,45], [46,90], [91,135], [136,180], [181,308] of the feature histogram. Since the distance d between points gradually increases along the viewpoint direction and the density of the local points will affect the feature result, d is often omitted in the 2.5-dimensional data acquired by the robot for better robustness. Finally, the VFH descriptor describes the point cloud using a total of 263 bins. Figure 2 shows an example of the VFH.
Using the local coordinate system uwv defined above, the relative deviation between the centroid vector nc and the unit normal vector ni of the point pi can be represented by a set of angles, as shown in Figure 1. The feature descriptor for each point in the point cloud can be represented by a quintuple (α, ϕ, θ, d, β), which is calculated as follows: The percentages of the values of cosα, cosϕ, θ, d, and cosβ of each point in the point cloud P falling in different bins are counted, respectively corresponding to the curves on the abscissa ranges [1,45] [181,308] of the feature histogram. Since the distance d between points gradually increases along the viewpoint direction and the density of the local points will affect the feature result, d is often omitted in the 2.5-dimensional data acquired by the robot for better robustness. Finally, the VFH descriptor describes the point cloud using a total of 263 bins. Figure 2 shows an example of the VFH.  Although Rusu et al. [8] have obtained promising results in using VFH as a 3D feature for object recognition and 6DOF pose estimation, they have encountered limitations of accurate 3D pose estimation. For example, if the surface of the object has mirror symmetry, it will get similar VFHs in symmetrical poses, as shown in Figure 3. The vector n i is the normal vector at point p i . Although the two poses are different poses mirrored with respect to the viewpoint direction, the two VFHs are highly similar because their surface normals of each point have similar or identical angular deviations from the viewpoint direction. In the case shown in the figure, the kind of object can be correctly identified using VFH, but the mirrored poses are confused.
Although Rusu et al. [8] have obtained promising results in using VFH as a 3D feature for object recognition and 6DOF pose estimation, they have encountered limitations of accurate 3D pose estimation. For example, if the surface of the object has mirror symmetry, it will get similar VFHs in symmetrical poses, as shown in Figure 3. The vector ni is the normal vector at point pi. Although the two poses are different poses mirrored with respect to the viewpoint direction, the two VFHs are highly similar because their surface normals of each point have similar or identical angular deviations from the viewpoint direction. In the case shown in the figure, the kind of object can be correctly identified using VFH, but the mirrored poses are confused.

Improved Global Feature Descriptor OVFH.
In order to overcome this drawback, it is necessary to modify the existing VFH descriptor. Chen et al. [18] have did some groundbreaking work in this area and proposed the MVFH descriptor to improve the viewpoint direction component of VFH. The MVFH gives three components to the viewpoint direction component using the method similar to estimating the extended FPFH. However, just as the geometric features characterized by the extended FPFH are very similar in the mirrored poses, the viewpoint direction component counted by this method still cannot explain theoretically how to distinguish the mirrored poses.
The core idea of solving this problem is to change the way to calculate the viewpoint direction component so that statistical angle values are different in the mirrored poses. According to this idea, an OVFH descriptor is proposed, which is described in detail as follows.
First, the reference frame of the point cloud needs to be estimated using the principal component analysis (PCA) method, which will help to compute the orthogonal vector of the viewpoint direction. All the points pi belonging to the point cloud are given to represent the view of the object, where i ∈ {1,..., n}. Their centroid pc are calculated according to Equation (1) and used as the origin of object reference frame. After that, the covariance matrix C of all points is calculated by pi and pc as the following equation: Then, the eigenvalue λj of C and its corresponding eigenvector vj that satisfy Cvj = λjvj, where j ∈ {1, 2, 3}, are computed. The eigenvector vmin which is corresponding to the minimum eigenvalue λmin is taken as the z-axis of the reference frame. In order to eliminate the ambiguity in the z-axis direction, if the angle between vmin and the observation direction is in the range of [−90°, 90°], the opposite vector of vmin is taken. This ensures it points to the observer all the time. The eigenvector vmax which is corresponding to the maximum eigenvalue λmax is taken as the x-axis of the reference frame. After that, the y-axis is computed by y = vmin×vmax. The reference frame estimated for a given partial view is shown in Figure 4.

Improved Global Feature Descriptor OVFH
In order to overcome this drawback, it is necessary to modify the existing VFH descriptor. Chen et al. [18] have did some groundbreaking work in this area and proposed the MVFH descriptor to improve the viewpoint direction component of VFH. The MVFH gives three components to the viewpoint direction component using the method similar to estimating the extended FPFH. However, just as the geometric features characterized by the extended FPFH are very similar in the mirrored poses, the viewpoint direction component counted by this method still cannot explain theoretically how to distinguish the mirrored poses.
The core idea of solving this problem is to change the way to calculate the viewpoint direction component so that statistical angle values are different in the mirrored poses. According to this idea, an OVFH descriptor is proposed, which is described in detail as follows.
First, the reference frame of the point cloud needs to be estimated using the principal component analysis (PCA) method, which will help to compute the orthogonal vector of the viewpoint direction. All the points p i belonging to the point cloud are given to represent the view of the object, where i ∈ {1,..., n}. Their centroid p c are calculated according to Equation (1) and used as the origin of object reference frame. After that, the covariance matrix C of all points is calculated by p i and p c as the following equation: Then, the eigenvalue λ j of C and its corresponding eigenvector v j that satisfy Cv j = λ j v j , where j ∈ {1, 2, 3}, are computed. The eigenvector v min which is corresponding to the minimum eigenvalue λ min is taken as the z-axis of the reference frame. In order to eliminate the ambiguity in the z-axis direction, if the angle between v min and the observation direction is in the range of [−90 • , 90 • ], the opposite vector of v min is taken. This ensures it points to the observer all the time. The eigenvector v max which is corresponding to the maximum eigenvalue λ max is taken as the x-axis of the reference frame. After that, the y-axis is computed by y = v min × v max . The reference frame estimated for a given partial view is shown in Figure 4.  The z-axis pointing to the observer is obtained after determining the reference frame representing the overall point cloud of the partial view. The cross product of the z-axis and the viewpoint direction is calculated by Equation (6) to obtain an orthogonal vector of the viewpoint direction: Different from VFH [8], OVFH calculates the viewpoint component by counting a histogram, which is a statistic of the angles between the orthogonal vector of the viewpoint direction and each normal, so that the viewpoint direction component of the mirrored poses are different from each other, as shown in Figure 5. The angular deviation cosβO between the orthogonal viewpoint vector and each normal ni is calculated by Equation (7): Using the speed and discriminative power of the FPFH to ensure the strong recognition result of the OVFH, the FPFH is extended to estimate the entire object point cloud. The difference between the normal ni of each point pi and the central normal nc can be represented by cosαO, cosϕO, and θO, which represent relative pan, tilt, and yaw angles, respectively. They are given by the following equations: The z-axis pointing to the observer is obtained after determining the reference frame representing the overall point cloud of the partial view. The cross product of the z-axis and the viewpoint direction is calculated by Equation (6) to obtain an orthogonal vector of the viewpoint direction: Different from VFH [8], OVFH calculates the viewpoint component by counting a histogram, which is a statistic of the angles between the orthogonal vector of the viewpoint direction and each normal, so that the viewpoint direction component of the mirrored poses are different from each other, as shown in Figure 5. The angular deviation cosβ O between the orthogonal viewpoint vector and each normal n i is calculated by Equation (7): The z-axis pointing to the observer is obtained after determining the reference frame representing the overall point cloud of the partial view. The cross product of the z-axis and the viewpoint direction is calculated by Equation (6) to obtain an orthogonal vector of the viewpoint direction: Different from VFH [8], OVFH calculates the viewpoint component by counting a histogram, which is a statistic of the angles between the orthogonal vector of the viewpoint direction and each normal, so that the viewpoint direction component of the mirrored poses are different from each other, as shown in Figure 5. The angular deviation cosβO between the orthogonal viewpoint vector and each normal ni is calculated by Equation (7): Using the speed and discriminative power of the FPFH to ensure the strong recognition result of the OVFH, the FPFH is extended to estimate the entire object point cloud. The difference between the normal ni of each point pi and the central normal nc can be represented by cosαO, cosϕO, and θO, which represent relative pan, tilt, and yaw angles, respectively. They are given by the following equations: Using the speed and discriminative power of the FPFH to ensure the strong recognition result of the OVFH, the FPFH is extended to estimate the entire object point cloud. The difference between the normal n i of each point p i and the central normal n c can be represented by cosα O , cosφ O , and θ O , which represent relative pan, tilt, and yaw angles, respectively. They are given by the following equations: In summary, the proposed OVFH descriptor contains two components: one is the surface shape component constituted of the extended FPFH, and the other is the viewpoint component improved by the orthogonal vector of the viewpoint direction. The OVFH uses 45 bins for each value of the extended FPFH by default, and another 128 bins for the improved view component, thus, the OVFH descriptor has 263 dimensions. Figure 6 shows the principle and result of OVFH. Figure 7 shows the calculation results of VFH and OVFH in the cases that the object faces the viewpoint and deviates from the viewpoint direction by +60 • and -60 • yaw angle, respectively. As shown in the figure, although both VFH and OVFH assign 128 bins to encode the viewpoint direction component, the viewpoint direction information of VFH is only distributed in 64 bins. This is because the normals of the point cloud always point to the sensor, and their dot product with the central viewpoint direction must be in the range In summary, the proposed OVFH descriptor contains two components: one is the surface shape component constituted of the extended FPFH, and the other is the viewpoint component improved by the orthogonal vector of the viewpoint direction. The OVFH uses 45 bins for each value of the extended FPFH by default, and another 128 bins for the improved view component, thus, the OVFH descriptor has 263 dimensions. Figure 6 shows the principle and result of OVFH. Figure 7 shows the calculation results of VFH and OVFH in the cases that the object faces the viewpoint and deviates from the viewpoint direction by +60° and -60° yaw angle, respectively. As shown in the figure, although both VFH and OVFH assign 128 bins to encode the viewpoint direction component, the viewpoint direction information of VFH is only distributed in 64 bins. This is because the normals of

Visual Guidance Algorithm for the Robotic Grasping System
The robotic visual grasping algorithm includes two phases, offline and online, as shown in Figure 8. In the offline phase, a database that has complete poses of experimental objects is created by changing the sensor viewpoint. The object poses under each viewpoint are combined with the available grasping poses to teach the robot how to grasp the object in different poses. After that, all relevant information is stored in the database, including point clouds, feature descriptors, classification information, and grasping poses for each viewpoint. In the online phase, the scene point cloud is captured using a depth camera. After filtering and segmenting the scene point cloud, the global feature descriptor of the object is calculated to match the database. And the recognition result is the sample that has the most similar feature histogram with the object. Finally, the iterative closest point (ICP) algorithm [19] is used to calculate the precise pose to generate the trajectory and pose for robotic grasping operation.

Visual Guidance Algorithm for the Robotic Grasping System
The robotic visual grasping algorithm includes two phases, offline and online, as shown in Figure 8. In the offline phase, a database that has complete poses of experimental objects is created by changing the sensor viewpoint. The object poses under each viewpoint are combined with the available grasping poses to teach the robot how to grasp the object in different poses. After that, all relevant information is stored in the database, including point clouds, feature descriptors, classification information, and grasping poses for each viewpoint. In the online phase, the scene point cloud is captured using a depth camera. After filtering and segmenting the scene point cloud, the global feature descriptor of the object is calculated to match the database. And the recognition result is the sample that has the most similar feature histogram with the object. Finally, the iterative closest point (ICP) algorithm [19] is used to calculate the precise pose to generate the trajectory and pose for robotic grasping operation.

Visual Guidance Algorithm for the Robotic Grasping System
The robotic visual grasping algorithm includes two phases, offline and online, as shown in Figure 8. In the offline phase, a database that has complete poses of experimental objects is created by changing the sensor viewpoint. The object poses under each viewpoint are combined with the available grasping poses to teach the robot how to grasp the object in different poses. After that, all relevant information is stored in the database, including point clouds, feature descriptors, classification information, and grasping poses for each viewpoint. In the online phase, the scene point cloud is captured using a depth camera. After filtering and segmenting the scene point cloud, the global feature descriptor of the object is calculated to match the database. And the recognition result is the sample that has the most similar feature histogram with the object. Finally, the iterative closest point (ICP) algorithm [19] is used to calculate the precise pose to generate the trajectory and pose for robotic grasping operation.

Creation of the Database
The database mainly consists of two parts: a multi-view point cloud database and a grasping pose database. A multi-view point cloud database is created to contain all the poses of the object to be grabbed. Point cloud data from different perspectives can be captured by building a rotating platform such as [8], but creating a training database for a reasonable number of objects using a real device can be a cumbersome task, and even difficult if one wants to have all the different views and poses of an object. Alternatively, as described in [9], if the object has an available 3D computer aided design (CAD) model, a virtual camera can be placed directly around the object in the rendering system and all desired viewpoints can be obtained without calibrating the system and a time-consuming capture process. This is a database creation method which is low-cost and easy to extend object set; therefore, this paper uses this method to create a database.
Taking CAD model as the center of the sphere, the bounding sphere with radius r is established. A virtual depth camera [20] is set up on the viewpoints uniformly selected on the spherical surface to capture the object point clouds. As shown in Figure 9a, first, to ensure the coverage and uniformity of viewpoint selection, viewpoints are selected on the bounding sphere at every 15 • yaw angle and pitch angle, denoted as (ϕ,ψ). Then, the virtual depth camera is set up at each viewpoint to capture the corresponding object point cloud, and record the pose data of the object model in the camera coordinate system. Finally, the point cloud data is processed offline. The OVFH descriptors of the point clouds under each viewpoint are calculated and the feature files are saved in the database.
A multi-view point cloud database is created to contain all the poses of the object to be grabbed. Point cloud data from different perspectives can be captured by building a rotating platform such as [8], but creating a training database for a reasonable number of objects using a real device can be a cumbersome task, and even difficult if one wants to have all the different views and poses of an object. Alternatively, as described in [9], if the object has an available 3D computer aided design (CAD) model, a virtual camera can be placed directly around the object in the rendering system and all desired viewpoints can be obtained without calibrating the system and a time-consuming capture process. This is a database creation method which is low-cost and easy to extend object set; therefore, this paper uses this method to create a database.
Taking CAD model as the center of the sphere, the bounding sphere with radius r is established. A virtual depth camera [20] is set up on the viewpoints uniformly selected on the spherical surface to capture the object point clouds. As shown in Figure 9a, first, to ensure the coverage and uniformity of viewpoint selection, viewpoints are selected on the bounding sphere at every 15° yaw angle and pitch angle, denoted as ( , ). Then, the virtual depth camera is set up at each viewpoint to capture the corresponding object point cloud, and record the pose data of the object model in the camera coordinate system. Finally, the point cloud data is processed offline. The OVFH descriptors of the point clouds under each viewpoint are calculated and the feature files are saved in the database.
Objects usually have multiple stable poses. It is impossible for the robot to grip the object in the same grasping pose in any cases because of the limitation of the environment and robot's working space. Therefore, it is necessary to teach the robot how to grab objects in different poses. A plurality of stable robot grasping poses relative to the object are recorded in the database, and the robot grasping poses in different object poses can be obtained by rotating and transforming the object poses under different viewpoints in the multi-view point cloud database. Figure 9b shows the database that records the grasping poses.

Object Recognition and Pose Estimation
A 640 × 480 pixels RGB-D image captured by the Kinect v1 sensor is processed with the PCL library and converted into a point cloud containing 307,200 points (see Figure 10a). It requires initial filtering before the segmentation process. First, invalid points (NaN) that are useless for 3D processing without depth information due to factors such as specular surface, occlusion or transparency are removed. Then, a passthrough filter is used to remove all the points located outside the defined range. Experiments have shown that it is impossible to identify small objects reliably outside the range of 0.4-1.5 m away from the sensor. Therefore, the cut-off distance of the passthrough filter along the Z-axis is set as this range. The appropriate X and Y axis ranges are set to confine the isolated region of interest (ROI) to the graspable workspace of the robotic arm (see Figure  10b). Then, the random sample consensus (RANSAC) algorithm is used to detect the principal plane of the remaining point cloud and remove the inlier points of the plane, that is, to remove the ground points. Finally, the cluster of the target object is obtained by Euclidean cluster segmentation, and the Objects usually have multiple stable poses. It is impossible for the robot to grip the object in the same grasping pose in any cases because of the limitation of the environment and robot's working space. Therefore, it is necessary to teach the robot how to grab objects in different poses. A plurality of stable robot grasping poses relative to the object are recorded in the database, and the robot grasping poses in different object poses can be obtained by rotating and transforming the object poses under different viewpoints in the multi-view point cloud database. Figure 9b shows the database that records the grasping poses.

Object Recognition and Pose Estimation
A 640 × 480 pixels RGB-D image captured by the Kinect v1 sensor is processed with the PCL library and converted into a point cloud containing 307,200 points (see Figure 10a). It requires initial filtering before the segmentation process. First, invalid points (NaN) that are useless for 3D processing without depth information due to factors such as specular surface, occlusion or transparency are removed. Then, a passthrough filter is used to remove all the points located outside the defined range. Experiments have shown that it is impossible to identify small objects reliably outside the range of 0.4-1.5 m away from the sensor. Therefore, the cut-off distance of the passthrough filter along the Z-axis is set as this range. The appropriate X and Y axis ranges are set to confine the isolated region of interest (ROI) to the graspable workspace of the robotic arm (see Figure 10b). Then, the random sample consensus (RANSAC) algorithm is used to detect the principal plane of the remaining point cloud and remove the inlier points of the plane, that is, to remove the ground points. Finally, the cluster of the target object is obtained by Euclidean cluster segmentation, and the noise clusters are eliminated by setting an appropriate threshold of the number of the cluster points (see Figure 10c). The OVFH feature of the target point cloud is calculated after the target point cloud in the scene is obtained through the above preprocessing. The obtained OVFH descriptor is compared with the multi-view point cloud database by the k-nearest neighbor search based on K dimension (K-D) tree, and the winning result is selected as the object recognition result with rough pose estimation. Then, the point cloud of winning pose is translated to the centroid of the target object and iteratively optimized by ICP algorithm. This algorithm iteratively modifies the rigid transformation matrix between two point clouds to minimize their distance until the iteration error is less than the threshold or the current iteration number is greater than the maximum number. Figure 11 shows the registration effect of the template point cloud and the target point cloud. Finally, the robotic manipulator grasps the object based on the object refined pose after using ICP.

Experimental Results on the Data Set
The hardware used in the evaluation experiment was a computer with Intel Core i5-7500 CPU @ 3.40 GHz processor and 16 GB RAM. In order to prove that the proposed descriptor OVFH is improved in pose retrieval compared with VFH, a publicly available dataset [21] was used for test experiments. Each object in the data set has 600 point clouds which are captured from five polar angles and 120 turntable positions, with an azimuth equidistance of 3°. The 12 objects shown in Figure  12 were selected from the BIGBIRD data set for a comparative experiment of pose retrieval using VFH and OVFH. A complete pose database (12 * 5 * 24 = 1440 in total) was created by selecting poses at 24 equally spaced 15° azimuths from each polar angle of each object. In order to verify the pose identification ability of OVFH and VFH, 24 point clouds with mirrored poses of each object were selected as the test set. Figure 13 shows the object recognition accuracy using two kinds of feature descriptors. As far as the test was concerned, OVFH had similar accuracy to VFH in object The OVFH feature of the target point cloud is calculated after the target point cloud in the scene is obtained through the above preprocessing. The obtained OVFH descriptor is compared with the multi-view point cloud database by the k-nearest neighbor search based on K dimension (K-D) tree, and the winning result is selected as the object recognition result with rough pose estimation. Then, the point cloud of winning pose is translated to the centroid of the target object and iteratively optimized by ICP algorithm. This algorithm iteratively modifies the rigid transformation matrix between two point clouds to minimize their distance until the iteration error is less than the threshold or the current iteration number is greater than the maximum number. Figure 11 shows the registration effect of the template point cloud and the target point cloud. Finally, the robotic manipulator grasps the object based on the object refined pose after using ICP. The OVFH feature of the target point cloud is calculated after the target point cloud in the scene is obtained through the above preprocessing. The obtained OVFH descriptor is compared with the multi-view point cloud database by the k-nearest neighbor search based on K dimension (K-D) tree, and the winning result is selected as the object recognition result with rough pose estimation. Then, the point cloud of winning pose is translated to the centroid of the target object and iteratively optimized by ICP algorithm. This algorithm iteratively modifies the rigid transformation matrix between two point clouds to minimize their distance until the iteration error is less than the threshold or the current iteration number is greater than the maximum number. Figure 11 shows the registration effect of the template point cloud and the target point cloud. Finally, the robotic manipulator grasps the object based on the object refined pose after using ICP.

Experimental Results on the Data Set
The hardware used in the evaluation experiment was a computer with Intel Core i5-7500 CPU @ 3.40 GHz processor and 16 GB RAM. In order to prove that the proposed descriptor OVFH is improved in pose retrieval compared with VFH, a publicly available dataset [21] was used for test experiments. Each object in the data set has 600 point clouds which are captured from five polar angles and 120 turntable positions, with an azimuth equidistance of 3°. The 12 objects shown in Figure  12 were selected from the BIGBIRD data set for a comparative experiment of pose retrieval using VFH and OVFH. A complete pose database (12 * 5 * 24 = 1440 in total) was created by selecting poses at 24 equally spaced 15° azimuths from each polar angle of each object. In order to verify the pose identification ability of OVFH and VFH, 24 point clouds with mirrored poses of each object were selected as the test set. Figure 13 shows the object recognition accuracy using two kinds of feature descriptors. As far as the test was concerned, OVFH had similar accuracy to VFH in object

Experimental Results on the Data Set
The hardware used in the evaluation experiment was a computer with Intel Core i5-7500 CPU @ 3.40 GHz processor and 16 GB RAM. In order to prove that the proposed descriptor OVFH is improved in pose retrieval compared with VFH, a publicly available dataset [21] was used for test experiments. Each object in the data set has 600 point clouds which are captured from five polar angles and 120 turntable positions, with an azimuth equidistance of 3 • . The 12 objects shown in Figure 12 were selected from the BIGBIRD data set for a comparative experiment of pose retrieval using VFH and OVFH. A complete pose database (12 * 5 * 24 = 1440 in total) was created by selecting poses at 24 equally spaced 15 • azimuths from each polar angle of each object. In order to verify the pose identification ability of OVFH and VFH, 24 point clouds with mirrored poses of each object were selected as the test set. Figure 13 shows the object recognition accuracy using two kinds of feature descriptors. As far as the test was concerned, OVFH had similar accuracy to VFH in object recognition. The average recognition rates of OVFH and VFH were both 94.44% for all 12 objects. Figure 14 shows the mirrored pose distinction accuracy using two kinds of feature descriptors. For mirrored poses, the average distinction rate of OVFH (95.49%) was significantly higher than that of VFH (82.6%). Table 1 presents the computation time for each procedure when using two kinds of descriptors to test. Although OVFH descriptor required an additional step of reference frame estimation, OVFH could avoid false pose recognition and provide a better initial pose for ICP, which greatly reduced the time consuming of refining pose. Table 1 shows that the average computation time was 856.866 ms when object recognition and rough pose estimation were performed using VFH and the pose was further refined using ICP. OVFH reduced the average computation time to 546.212 ms because reducing the number of ICP iterations. The distance root mean squared error (RMSE) of corresponding point pairs in the point cloud registration process is used to describe the error of the pose estimation. It can be seen from the Table 1 that OVFH can obtain higher average precision of pose estimation by providing better initial values for ICP. From Figures 13 and 14 and Table 1, OVFH enhances the distinction capability of the mirrored pose while preserving the identification capability of VFH, and is more computationally efficient and precise in the accurate pose estimation combined with the ICP. Experimental results show that the proposed feature (OVFH) can be used to improve the performance of pose retrieval. recognition. The average recognition rates of OVFH and VFH were both 94.44% for all 12 objects. Figure 14 shows the mirrored pose distinction accuracy using two kinds of feature descriptors. For mirrored poses, the average distinction rate of OVFH (95.49%) was significantly higher than that of VFH (82.6%). Table 1 presents the computation time for each procedure when using two kinds of descriptors to test. Although OVFH descriptor required an additional step of reference frame estimation, OVFH could avoid false pose recognition and provide a better initial pose for ICP, which greatly reduced the time consuming of refining pose. Table 1 shows that the average computation time was 856.866 ms when object recognition and rough pose estimation were performed using VFH and the pose was further refined using ICP. OVFH reduced the average computation time to 546.212 ms because reducing the number of ICP iterations. The distance root mean squared error (RMSE) of corresponding point pairs in the point cloud registration process is used to describe the error of the pose estimation. It can be seen from the Table 1 that OVFH can obtain higher average precision of pose estimation by providing better initial values for ICP. From Figures  13 and 14 and Table 1, OVFH enhances the distinction capability of the mirrored pose while preserving the identification capability of VFH, and is more computationally efficient and precise in the accurate pose estimation combined with the ICP. Experimental results show that the proposed feature (OVFH) can be used to improve the performance of pose retrieval.   recognition. The average recognition rates of OVFH and VFH were both 94.44% for all 12 objects. Figure 14 shows the mirrored pose distinction accuracy using two kinds of feature descriptors. For mirrored poses, the average distinction rate of OVFH (95.49%) was significantly higher than that of VFH (82.6%). Table 1 presents the computation time for each procedure when using two kinds of descriptors to test. Although OVFH descriptor required an additional step of reference frame estimation, OVFH could avoid false pose recognition and provide a better initial pose for ICP, which greatly reduced the time consuming of refining pose. Table 1 shows that the average computation time was 856.866 ms when object recognition and rough pose estimation were performed using VFH and the pose was further refined using ICP. OVFH reduced the average computation time to 546.212 ms because reducing the number of ICP iterations. The distance root mean squared error (RMSE) of corresponding point pairs in the point cloud registration process is used to describe the error of the pose estimation. It can be seen from the Table 1 that OVFH can obtain higher average precision of pose estimation by providing better initial values for ICP. From Figures  13 and 14 and Table 1, OVFH enhances the distinction capability of the mirrored pose while preserving the identification capability of VFH, and is more computationally efficient and precise in the accurate pose estimation combined with the ICP. Experimental results show that the proposed feature (OVFH) can be used to improve the performance of pose retrieval.     Figure 15 shows the hardware setup of the robotic grasping experiment. A KUKA Youbot robot was used to grasp objects, and Microsoft Kinect v1 was mounted on the robot's stand to capture point clouds for robotic vision. The computer used for object recognition and pose estimation was equipped with an Intel i5 CPU and 16 GB RAM. Eight objects used for the grasping experiment are shown in Figure 16, where objects A-E have mirrored structure. A multi-view point cloud database was built by the CAD models of all experimental objects, and the robot was taught how to capture objects of different poses.   Figure 15 shows the hardware setup of the robotic grasping experiment. A KUKA Youbot robot was used to grasp objects, and Microsoft Kinect v1 was mounted on the robot's stand to capture point clouds for robotic vision. The computer used for object recognition and pose estimation was equipped with an Intel i5 CPU and 16 GB RAM. Eight objects used for the grasping experiment are shown in Figure 16, where objects A-E have mirrored structure. A multi-view point cloud database was built by the CAD models of all experimental objects, and the robot was taught how to capture objects of different poses.   Figure 15 shows the hardware setup of the robotic grasping experiment. A KUKA Youbot robot was used to grasp objects, and Microsoft Kinect v1 was mounted on the robot's stand to capture point clouds for robotic vision. The computer used for object recognition and pose estimation was equipped with an Intel i5 CPU and 16 GB RAM. Eight objects used for the grasping experiment are shown in Figure 16, where objects A-E have mirrored structure. A multi-view point cloud database was built by the CAD models of all experimental objects, and the robot was taught how to capture objects of different poses.  Experiments were performed on eight objects, each of which was set to 10 initial poses in the robot's workspace. Figure 17 shows the results of object recognition and pose estimation. As shown in the figure, the point cloud collected online was precisely registered with the recognition result in database. To test the effectiveness of the proposed OVFH descriptor, the first nine matching scores were used to determine the distinction capability of the mirrored poses, as shown in Figure 18. Figure  18a,b, respectively, show the results of object recognition using the VFH and OVFH descriptor, when the object A was placed in a pose of −60° relative to the viewpoint. The result in the lower left corner was the best match. When using VFH, the true pose and its mirrored poses were alternately arranged, and a false positive was generated for the mirrored pose. When using OVFH, the true pose was correctly identified, and the matching scores of the mirror poses and the true pose were quite different, which would avoid the false positive effectively. The experimental results of eight objects are recorded in Table 2. For objects F-H without mirrored structure, VFH and OVFH exhibited similar performance. For objects A-E with mirrored structure, VFH could not distinguish its mirror poses. Identifying the wrong initial pose would increase the registration time of ICP and the pose estimation error, and even lead to registration failure. Since OVFH avoided false pose Identification, the convergence was faster when using ICP to refine the pose, and the average computation time was reduced to 0.523 s. At the same time, correct pose recognition made the model point cloud and the target point cloud better fit, thus obtaining higher pose estimation accuracy. After object recognition and registration, the refined pose was used to guide the robot to grasp the object. Figure 19 shows the grasping experiment results of an object with mirrored poses.  Experiments were performed on eight objects, each of which was set to 10 initial poses in the robot's workspace. Figure 17 shows the results of object recognition and pose estimation. As shown in the figure, the point cloud collected online was precisely registered with the recognition result in database. To test the effectiveness of the proposed OVFH descriptor, the first nine matching scores were used to determine the distinction capability of the mirrored poses, as shown in Figure 18. Figure 18a,b, respectively, show the results of object recognition using the VFH and OVFH descriptor, when the object A was placed in a pose of −60 • relative to the viewpoint. The result in the lower left corner was the best match. When using VFH, the true pose and its mirrored poses were alternately arranged, and a false positive was generated for the mirrored pose. When using OVFH, the true pose was correctly identified, and the matching scores of the mirror poses and the true pose were quite different, which would avoid the false positive effectively. The experimental results of eight objects are recorded in Table 2. For objects F-H without mirrored structure, VFH and OVFH exhibited similar performance. For objects A-E with mirrored structure, VFH could not distinguish its mirror poses. Identifying the wrong initial pose would increase the registration time of ICP and the pose estimation error, and even lead to registration failure. Since OVFH avoided false pose Identification, the convergence was faster when using ICP to refine the pose, and the average computation time was reduced to 0.523 s. At the same time, correct pose recognition made the model point cloud and the target point cloud better fit, thus obtaining higher pose estimation accuracy. After object recognition and registration, the refined pose was used to guide the robot to grasp the object. Figure 19 shows the grasping experiment results of an object with mirrored poses.  Experiments were performed on eight objects, each of which was set to 10 initial poses in the robot's workspace. Figure 17 shows the results of object recognition and pose estimation. As shown in the figure, the point cloud collected online was precisely registered with the recognition result in database. To test the effectiveness of the proposed OVFH descriptor, the first nine matching scores were used to determine the distinction capability of the mirrored poses, as shown in Figure 18. Figure  18a,b, respectively, show the results of object recognition using the VFH and OVFH descriptor, when the object A was placed in a pose of −60° relative to the viewpoint. The result in the lower left corner was the best match. When using VFH, the true pose and its mirrored poses were alternately arranged, and a false positive was generated for the mirrored pose. When using OVFH, the true pose was correctly identified, and the matching scores of the mirror poses and the true pose were quite different, which would avoid the false positive effectively. The experimental results of eight objects are recorded in Table 2. For objects F-H without mirrored structure, VFH and OVFH exhibited similar performance. For objects A-E with mirrored structure, VFH could not distinguish its mirror poses. Identifying the wrong initial pose would increase the registration time of ICP and the pose estimation error, and even lead to registration failure. Since OVFH avoided false pose Identification, the convergence was faster when using ICP to refine the pose, and the average computation time was reduced to 0.523 s. At the same time, correct pose recognition made the model point cloud and the target point cloud better fit, thus obtaining higher pose estimation accuracy. After object recognition and registration, the refined pose was used to guide the robot to grasp the object. Figure 19 shows the grasping experiment results of an object with mirrored poses.      Figure 19. The grasping results for object with mirrored poses.

Conclusions and Future Work
In order to correctly distinguish the mirrored poses relative to the viewpoint, an effective global feature descriptor OVFH is proposed in this paper, which was successfully applied to object recognition and pose estimation. The proposed method computes an orthogonal vector of the viewpoint direction by using a reference frame estimated for the entire point cloud. This orthogonal vector is used to improve the viewpoint component of the feature descriptor. Experimental results in public data set show that OVFH descriptor can characterize object poses well and enhance the ability to distinguish mirrored poses. Based on OVFH descriptor, an object recognition and pose estimation method for vision-guided robotic grasping system is designed. The experimental results show that the proposed vision-guided robotic grasping method can effectively distinguish the mirrored poses and guide the robot to grasp different objects.
For future work, the proposed feature descriptor can be extended by color description to obtain the ability to recognize objects with the same geometries and different patterns. It will also be studied to combine the proposed idea of calculating orthogonal viewpoint direction with other feature descriptors to obtain better results.

Conclusions and Future Work
In order to correctly distinguish the mirrored poses relative to the viewpoint, an effective global feature descriptor OVFH is proposed in this paper, which was successfully applied to object recognition and pose estimation. The proposed method computes an orthogonal vector of the viewpoint direction by using a reference frame estimated for the entire point cloud. This orthogonal vector is used to improve the viewpoint component of the feature descriptor. Experimental results in public data set show that OVFH descriptor can characterize object poses well and enhance the ability to distinguish mirrored poses. Based on OVFH descriptor, an object recognition and pose estimation method for vision-guided robotic grasping system is designed. The experimental results show that the proposed vision-guided robotic grasping method can effectively distinguish the mirrored poses and guide the robot to grasp different objects.
For future work, the proposed feature descriptor can be extended by color description to obtain the ability to recognize objects with the same geometries and different patterns. It will also be studied to combine the proposed idea of calculating orthogonal viewpoint direction with other feature descriptors to obtain better results.