RGB-D Image Processing Algorithm for Target Recognition and Pose Estimation of Visual Servo System

This paper studies the control performance of visual servoing system under the planar camera and RGB-D cameras, the contribution of this paper is through rapid identification of target RGB-D images and precise measurement of depth direction to strengthen the performance indicators of visual servoing system such as real time and accuracy, etc. Firstly, color images acquired by the RGB-D camera are segmented based on optimized normalized cuts. Next, the gray scale is restored according to the histogram feature of the target image. Then, the obtained 2D graphics depth information and the enhanced gray image information are distort merged to complete the target pose estimation based on the Hausdorff distance, and the current image pose is matched with the target image pose. The end angle and the speed of the robot are calculated to complete a control cycle and the process is iterated until the servo task is completed. Finally, the performance index of this control system based on proposed algorithm is tested about accuracy, real-time under position-based visual servoing system. The results demonstrate and validate that the RGB-D image processing algorithm proposed in this paper has the performance in the above aspects of the visual servoing system.


Introduction
In recent years, the use of robotic operations based on visual instructions in modern manufacturing has rapidly increased. Visual servoing as an active visual technology has attracted everyone's attention. However, both of the two main visual servoing methods-position-based visual servoing (PBVS) and image-based visual servoing (IBVS) have bottlenecks that constrain the development of this field: After successfully identifying the target object, the PBVS needs to reconstruct the target object in three dimensions, while the IBVS needs to approximate the distance between the end of the arm and target image. This means that they have higher precision requirements on depth information of the target image in a camera field of view. Researchers have done a lot of work to resolve these issues [1,2], but there are still some unresolved problems: The single-camera based attitude estimation method cannot generally produce a transformation matrix with absolute scale. For example, the stereo attitude estimation method using the stereo disparity relationship to calculate the depth, requires at least two cameras, which increases the system costs in the low-cost development trend of modern smart factories [3,4]. In addition, this method is limited by the distance between the two cameras, if the distance is too small, estimation accuracy of the long-distance target is poor [5].
The RGB-D cameras (Leap-Motion, Kinect, RealSense, uSens, and ThisVR), have been applied in many fields due to their ability to capture both RGB images and depth information [6,7]. These features provide a reference for solving target depth estimation problem in visual servoing system [8]. For

System Structure
The system structure of the proposed platform is shown in Figure 1. At the beginning, the color and the depth images of scene are acquired by RGB-D camera, and then the color image is subjected to scene segmentation, feature extraction, and gray scale reduction. Next, the object recognition module detects the foreground object in the scene and records it in the database. The feature descriptors are matched with the target feature model to identify the objects of interest. The 3D reconstruction of the target image is completed by combining the distortion coefficients of image pixel depth information. Lastly, the current image is matched with the target image to complete the servo process. During this process, the manipulator performs inverse kinematics calculation based on the difference between current and target images to obtain the angle and the speed of the robot. The angle and the speed commands are then used to control the arm of the robot by an iterative cycle. The servo process utilizes the depth ranging module of the RGB-D camera to improve the depth estimation accuracy, optimize the image processing speed, and increase the system stability.
Sensors 2020, 20, x FOR PEER REVIEW 3 of 16 The servo process utilizes the depth ranging module of the RGB-D camera to improve the depth estimation accuracy, optimize the image processing speed, and increase the system stability.

Target Recognition
This section describes the processing steps of the proposed RGB-D sensor-based 3D object recognition and matching module that is mainly composed of five unit modules: Scene segmentation, feature extraction and description, 3D model matching, pose estimation, and control law of visual servoing. After acquiring and binarizing the color image, the scene segmentation unit is responsible for removing the background point from binarization image. Then, the feature description unit is used to construct the feature descriptor of all foreground objects in the scene. Lastly, the descriptor matches the fused depth ranging unit information, completes the 3D reconstruction of current target object, and matches target pose.

Scene Segmentation
Scene segmentation is the primary task of target detection and recognition. The purpose of this task is to separate the foreground target and the planar background area of the scene in order to simplify or change the representation of scene, make the understanding and the recognition of the scene easier, and improve the computational efficiency in the subsequent feature extraction and description process. It has a wide range of applications in scene recognition, positioning, and even mapping of industrial robots [14,15]. In recent years, a large number of methods have been proposed for image segmentation through images, such as graph theory-based methods [16], clustering algorithm-based methods [17], digital image watermarking [15], and watershed transformations [18]. However, these methods are not ideal for segmenting the objects with similar colors or having the same color as the background in the scene. The reason for this is that the depth of field information is lost when the 3D scene is projected onto the 2D planar, so the information obtained from the single image is insufficient.

Target Recognition
This section describes the processing steps of the proposed RGB-D sensor-based 3D object recognition and matching module that is mainly composed of five unit modules: Scene segmentation, feature extraction and description, 3D model matching, pose estimation, and control law of visual servoing. After acquiring and binarizing the color image, the scene segmentation unit is responsible for removing the background point from binarization image. Then, the feature description unit is used to construct the feature descriptor of all foreground objects in the scene. Lastly, the descriptor matches the fused depth ranging unit information, completes the 3D reconstruction of current target object, and matches target pose.

Scene Segmentation
Scene segmentation is the primary task of target detection and recognition. The purpose of this task is to separate the foreground target and the planar background area of the scene in order to simplify or change the representation of scene, make the understanding and the recognition of the scene easier, and improve the computational efficiency in the subsequent feature extraction and description process. It has a wide range of applications in scene recognition, positioning, and even mapping of industrial robots [14,15]. In recent years, a large number of methods have been proposed for image segmentation through images, such as graph theory-based methods [16], clustering algorithm-based methods [17], digital image watermarking [15], and watershed transformations [18]. However, these methods are not ideal for segmenting the objects with similar colors or having the same color as the background in the scene. The reason for this is that the depth of field information is lost when the 3D scene is projected onto the 2D planar, so the information obtained from the single image is insufficient.
In this paper, considering the requirements of visual servoing system for estimating image depth information, the working environment of industrial scene camera is restored to the maximum extent, and the camera depth module is adopted to obtain the depth information of target image in the scene. The proposed scene segmentation algorithm is shown in Figure 2. Suppose a general scene S is composed of n points P(i = 1, 2, · · · , n) characterized by color and depth information. First, the color and the depth information need to be unified in an efficient way. In addition, a uniform color space will make the distances in each color component comparable, thus simplifying the clustering process of 3D vectors related to the color information. In this paper, RGB space is selected as the expression space of the color information, that is, the color information of each scene point p(i = 1, 2, · · · , n) is represented by a 3D vector as: Sensors 2020, 20, x FOR PEER REVIEW 4 of 16 In this paper, considering the requirements of visual servoing system for estimating image depth information, the working environment of industrial scene camera is restored to the maximum extent, and the camera depth module is adopted to obtain the depth information of target image in the scene.
The proposed scene segmentation algorithm is shown in Figure 2. Suppose a general scene S is composed of n points ( ) 1, 2, , P i n =  characterized by color and depth information. First, the color and the depth information need to be unified in an efficient way. In addition, a uniform color space will make the distances in each color component comparable, thus simplifying the clustering process of 3D vectors related to the color information. In this paper, RGB space is selected as the expression space of the color information, that is, the color information of each scene point ( = 1,2, ⋯ , ) is represented by a 3D vector as:  Since the gray value of depth map is proportional to the distance of actual object from the depth camera, the depth information can be simply represented by the gray value of each point ∈ in depth map. As the depth map is used in this paper, depth information is represented by the vector shown in Equation (2): As the scene acquisition system provides the relative distance of the scene depth instead of absolute distance, so the ideal scene segmentation algorithm should be insensitive to the relative scaling of depth information described by point cloud or depth map. In order to ensure that the segmentation algorithm is independent of depth image scaling, the depth vector needs to be normalized by the standard deviation of the scene point depth data. Additionally, final depth description is a vector as shown in Equation (3): z ( ) represents the normalized depth vector, is a key parameter in visual servo jacobian matrix estimation, the derivation of this parameter is visible in [19]. In order to balance the correlation of the two types of information (color and depth) in the merging process, the color information vector in Equation (1) is normalized by the standard deviation average of the three components R, G, and B corresponding to color information. In addition, final color information is described as: Since the gray value of depth map is proportional to the distance of actual object from the depth camera, the depth information can be simply represented by the gray value of each point p i ∈ S in depth map. As the depth map is used in this paper, depth information is represented by the vector shown in Equation (2): As the scene acquisition system provides the relative distance of the scene depth instead of absolute distance, so the ideal scene segmentation algorithm should be insensitive to the relative scaling of depth information described by point cloud or depth map. In order to ensure that the segmentation algorithm is independent of depth image scaling, the depth vector needs to be normalized by the standard deviation of the scene point depth data. Additionally, final depth description is a vector as shown in Equation (3): z(p i ) represents the normalized depth vector, is a key parameter in visual servo jacobian matrix estimation, the derivation of this parameter is visible in [19]. In order to balance the correlation of the two types of information (color and depth) in the merging process, the color information vector in Equation (1) is normalized by the standard deviation average of the three components R, G, and B corresponding to color information. In addition, final color information is described as: Sensors 2020, 20, 430 5 of 16 From the above normalized depth and color information vectors, representation of each scene point can be obtained as: where λ is a parameter that balances the color and the depth information, in the experiment part, for best results and based on the experience we adopt λ = 1. This paper only describes the scene briefly due to space limitation. The detail description of the normalized cuts algorithm based on original image segmentation obtained by the RGB-D sensor is available in [20]. Figure 3 shows the image segmentation results for the same scene obtained using different algorithms. Where a is the original image, Figure 3b-3f are the image processing results of traditional segmentation algorithms. Figure 4a,b is the binarized images obtained using the conventional algorithm and the clustering algorithm in this paper, respectively. By comparing the segmentation results under the algorithms of Figures 3  and 4, it can be seen that the binarization algorithm based on normalized cuts can separate the target object from other objects using the clustering method, and can effectively reduce the interference noise around the target object. After completing the segmentation based on normalized cuts, the feature values of the target object are extracted, and the grayscale image is restored when the target feature value is unchanged, and the target recognition is performed on this basis.
Sensors 2020, 20, x FOR PEER REVIEW 5 of 16 From the above normalized depth and color information vectors, representation of each scene point can be obtained as: where λ is a parameter that balances the color and the depth information, in the experiment part, for best results and based on the experience we adopt λ = 1. This paper only describes the scene briefly due to space limitation. The detail description of the normalized cuts algorithm based on original image segmentation obtained by the RGB-D sensor is available in [20]. Figure 3 shows the image segmentation results for the same scene obtained using different algorithms. Where a is the original image, Figure 3b-3f are the image processing results of traditional segmentation algorithms. Figure  4a,b is the binarized images obtained using the conventional algorithm and the clustering algorithm in this paper, respectively. By comparing the segmentation results under the algorithms of Figures 3 and 4, it can be seen that the binarization algorithm based on normalized cuts can separate the target object from other objects using the clustering method, and can effectively reduce the interference noise around the target object. After completing the segmentation based on normalized cuts, the feature values of the target object are extracted, and the grayscale image is restored when the target feature value is unchanged, and the target recognition is performed on this basis.   From the above normalized depth and color information vectors, representation of each scene point can be obtained as: where λ is a parameter that balances the color and the depth information, in the experiment part, for best results and based on the experience we adopt λ = 1. This paper only describes the scene briefly due to space limitation. The detail description of the normalized cuts algorithm based on original image segmentation obtained by the RGB-D sensor is available in [20]. Figure 3 shows the image segmentation results for the same scene obtained using different algorithms. Where a is the original image, Figure 3b-3f are the image processing results of traditional segmentation algorithms. Figure  4a,b is the binarized images obtained using the conventional algorithm and the clustering algorithm in this paper, respectively. By comparing the segmentation results under the algorithms of Figures 3 and 4, it can be seen that the binarization algorithm based on normalized cuts can separate the target object from other objects using the clustering method, and can effectively reduce the interference noise around the target object. After completing the segmentation based on normalized cuts, the feature values of the target object are extracted, and the grayscale image is restored when the target feature value is unchanged, and the target recognition is performed on this basis.

Feature Extraction and Description
After completing the image segmentation task, the information of the image needs to be further processed: Such as image denoising, color correction, signal to noise ratio improvement, smoothing, sharpening, positioning, and separation. Restore or reconstruct the degraded images, improve the image fidelity, and provide clear images for viewers. In terms of image processing content, two problems must be solved: First one is to determine whether there is information needed in the image, and another one is to determine the information required to further extract and identify the image features. Based on the characteristics of binarized images shown in Figure 1, the processing flow is employed in Figure 5.

Feature Extraction and Description
After completing the image segmentation task, the information of the image needs to be further processed: Such as image denoising, color correction, signal to noise ratio improvement, smoothing, sharpening, positioning, and separation. Restore or reconstruct the degraded images, improve the image fidelity, and provide clear images for viewers. In terms of image processing content, two problems must be solved: First one is to determine whether there is information needed in the image, and another one is to determine the information required to further extract and identify the image features. Based on the characteristics of binarized images shown in Figure 1, the processing flow is employed in Figure 5. Image enhancement module is followed. Image enhancement refers to the processing of highlighting certain information in an image according to specific requirements, while reducing or removing some unnecessary information. The purpose is to improve the image quality and enhance the recognition ability of certain information in the image for analysis and use [7]. Currently, several methods are available for image enhancement, which can be divided into two major categories: Spatial domain methods and frequency domain methods. The steps of the image enhancement algorithm used in this paper are as follows: First, a histogram of the target image is constructed. Histogram is a statistical graph for expressing the gray scale distribution of an image. It consists of where the abscissa is the gray value and the ordinate is the gray value probability density ( ). And Then, the image histogram is equalized. Let the total number of pixels of the original image be , with ( = 256) gray levels, and frequency of gray level is , if the original image of pixel gradation at point ( , ) is , the gradation of the histogram equalized image at point ( , ) is: The gray value after histogram equalization is inversely transformed into the original image, and the gray value of each pixel point is reset to reconstruct the final binarized image.
Image processing results are as shown in Figure 6, where (a) is the original gray scale image and its histogram, while (b) is the image after equalizing the target image and its histogram. It can be seen from the figure that equalized histogram effectively highlights the target image information in the environment, and at the same time reduces the unnecessary image information around the periphery, which improves the target image quality and enhances the ability of recognizing the interest points in the image. Image enhancement module is followed. Image enhancement refers to the processing of highlighting certain information in an image according to specific requirements, while reducing or removing some unnecessary information. The purpose is to improve the image quality and enhance the recognition ability of certain information in the image for analysis and use [7]. Currently, several methods are available for image enhancement, which can be divided into two major categories: Spatial domain methods and frequency domain methods. The steps of the image enhancement algorithm used in this paper are as follows: First, a histogram of the target image is constructed. Histogram is a statistical graph for expressing the gray scale distribution of an image. It consists of where the abscissa is the gray value r and the ordinate is the gray value probability density p(r).

p(r i ) =
Number o f pixels with gray value r i Total number pixels o f the image (i = 0, 1, · · · , n) And Then, the image histogram is equalized. Let the total number of pixels of the original image be N, with L(L = 256) gray levels, and frequency of i gray level r i is n i , if the original image of pixel gradation at point (i, j) is r i , the gradation of the histogram equalized image at point (i, j) is: The gray value after histogram equalization is inversely transformed into the original image, and the gray value of each pixel point is reset to reconstruct the final binarized image.
Image processing results are as shown in Figure 6, where (a) is the original gray scale image and its histogram, while (b) is the image after equalizing the target image and its histogram. It can be seen from the figure that equalized histogram effectively highlights the target image information in the environment, and at the same time reduces the unnecessary image information around the periphery, which improves the target image quality and enhances the ability of recognizing the interest points in the image.

Model Matching
After extracting the interest image based on the image algorithm of RGB-D camera, a description matching process needs to be performed in order to find the 3D correspondence between the detected model object and the recorded model. The way to realize the precise positioning of moving objects with known 3D models using the monocular gray image is an important issue in the field of machine vision and a primary task of visual detection and target tracking based on 3D model. The key to its positioning lies in establishing a correspondence relationship between the target image and the model [21].
The Hausdorff distance can be used to measure the degree of matching between two sets of points. Since it does not need to establish an exact point-to-point correspondence between the model and the image, it is more fault tolerant than other matching methods [22]. In this section, the simple and feasible boundary extraction is used to filter out the influence of noise points from the boundary on the model matching by MHD (modified Hausdorff distance) algorithm. At the same time, in order to avoid looking for visible point in image and model projection pixel points, the displacement and the rotation of the model are directly defined in 3D space, and a global matching optimization function between target image and 3D model planar projection is established with the coordinate transformation and the Hausdorff distance. Finally, simulated annealing algorithm is used to obtain the optimal solution of the pose to improve the accuracy of matching.
The derivation process of matching optimization function based on Hausdorff distance is described below. The establishment of model and solution of matching parameters can be found in [21]. Let = 1 ≤ ≤ be the boundary contour point set extracted from the target interest point, = { |1 ≤ ≤ } be the line segment collection for wireframe models, and = is the

Model Matching
After extracting the interest image based on the image algorithm of RGB-D camera, a description matching process needs to be performed in order to find the 3D correspondence between the detected model object and the recorded model. The way to realize the precise positioning of moving objects with known 3D models using the monocular gray image is an important issue in the field of machine vision and a primary task of visual detection and target tracking based on 3D model. The key to its positioning lies in establishing a correspondence relationship between the target image and the model [21].
The Hausdorff distance can be used to measure the degree of matching between two sets of points. Since it does not need to establish an exact point-to-point correspondence between the model and the image, it is more fault tolerant than other matching methods [22]. In this section, the simple and feasible boundary extraction is used to filter out the influence of noise points from the boundary on the model matching by MHD (modified Hausdorff distance) algorithm. At the same time, in order to avoid looking for visible point in image and model projection pixel points, the displacement and the rotation of the model are directly defined in 3D space, and a global matching optimization function between target image and 3D model planar projection is established with the coordinate transformation and the Hausdorff distance. Finally, simulated annealing algorithm is used to obtain the optimal solution of the pose to improve the accuracy of matching.
The derivation process of matching optimization function based on Hausdorff distance is described below. The establishment of model and solution of matching parameters can be found in [21]. Let T = t j 1 ≤ j ≤ I be the boundary contour point set extracted from the target interest point, M = {m i |1 ≤ i ≤ L} be the line segment collection for wireframe models, and R = R 2 R 1 R 0 is the In the above formula, d(t j ,m i ) is defined as the distance from point t j to line segment m j . Given point t j and line segment m j (determined by endpoints b 1 and b 2 ), then As shown in Figure 7: As shown in Figure 7: The figure above is a visual representation of formula (11), shows the distance from point j t t e j m under different conditions. As shown above from Equations (9)-(11), the objective functio mula (9) is a continuous nondifferentiable multivariable function and there are multiple loca nimum points in search space, it should be noted that in the actual calculation process, som rameters involved in the formula can be given according to the situation to reduce unnecessar mplicated operations. In this paper, the simulated optimal annealing algorithm is used to obtai global optimal solution of the objective function the algorithm is simple, versatile, robust, an itable for parallel processing. The calculation process is a series of iterative processes, as shown i ure 8.  The figure above is a visual representation of formula (11), shows the distance from point t j to line m j under different conditions. As shown above from Equations (9)-(11), the objective function formula (9) is a continuous nondifferentiable multivariable function and there are multiple local minimum points in search space, it should be noted that in the actual calculation process, some parameters involved in the formula can be given according to the situation to reduce unnecessary complicated operations. In this paper, the simulated optimal annealing algorithm is used to obtain the global optimal solution of the objective function the algorithm is simple, versatile, robust, and suitable for parallel processing. The calculation process is a series of iterative processes, as shown in Figure 8.  (11), shows the distance from poin under different conditions. As shown above from Equations (9)- (11), the objective f (9) is a continuous nondifferentiable multivariable function and there are multip m points in search space, it should be noted that in the actual calculation proces ers involved in the formula can be given according to the situation to reduce unne ated operations. In this paper, the simulated optimal annealing algorithm is used to al optimal solution of the objective function the algorithm is simple, versatile, robu for parallel processing. The calculation process is a series of iterative processes, as sh . rder to avoid duplication of content, matching results of the model are presented ent section.
Estimation Figure 8. Simulated optimal annealing algorithm calculation process.
In order to avoid duplication of content, matching results of the model are presented in the experiment section.

Pose Estimation
In the visual servoing system of this paper, RGB-D camera or planar camera is fixed at the end of the robot arm. After capturing the current image, it then needs to be matched with the target image. The angle and the speed of the robot arm are obtained by inverse kinematics, which requires a composite Jacobian matrix that correlates the speed of the robot and the image change. Inspired by the registration of point cloud images [23], this section uses the RANSAC (random sample consensus) algorithm to provide an initial transformation matrix for RGB-D image fine registration. The RANSAC is an iterative algorithm for estimating the parameters of mathematical models, which are used to determine the corresponding points that satisfy a particular mathematical model, and obtain the initial transformation matrix accordingly. After applying the initial transformation matrix to RGB-D images, the LM-ICP is applied to the transformed RGB-D images and scenes, which is an improved version of the ICP algorithm. A common optimization algorithm for nonlinear optimization of objective functions is the least squares optimization algorithm. The final solution can be obtained using the SVD algorithm [24]. However, all norm optimization problems are sensitive to discrete points due to square of residuals. Therefore, the Huber loss function is used as a cost function for the corresponding point set: p = p 0 , p 1 , p 2 , · · · , p n−1 , p n , q = q 0 , q 1 , q 2 , · · · , q n−1 , q n e 2 (n) = n 2 /2 |n| ≤ k k|n| − n 2 /2 |n| ≥ k (12) where n = (R·p − t) − q 2 is the distance between the corresponding points and k is the distance threshold, the Huber loss function is smooth and differentiable here. Similar to the objective function of point cloud registration, the expression is: Get the optimalR,t by the LM-ICP algorithm [25], if (α, β, γ) and (t x , t y , t z ) are the three rotation angle translation components of the coordinate axis, respectively, the end of the 6 DOF manipulator can be expressed as (α, β, γ, t x , t y , t z ), and the optimal transformation matrix can be expressed as: It can be derived (α, β, γ) = (arctan(r 32 /r 33 ), arcsin(−r 31 ), arctan(r 21 /r 11 )).
As mentioned above, α, β, γ are the rotation angles of the three axes. Calibration of the hand-eye relationship between the camera and the robot are needed before experiment.

Control Law of Visual Servoing
The aim of all vision-based control schemes is to minimize an error e(t), which is typically defined by e(t) = s(m(t), a) − s * This control law comes from [19], the parameters in Equation (18) are defined as follows: The vector m(t), is a set of image measurements (e.g., the image coordinates of interest points or the image coordinates of the center of an object). These image measurements are used to compute a vector of k visual features, s(m(t), a), in which a is a set of parameters that represent potential additional knowledge about the system (e.g., coarse camera intrinsic parameters or 3D models of objects). The vector s * contains the desired values of the features is constant, and changes in s depend only on camera motion. Further, we consider here the case of controlling the motion of a camera with six degrees of freedom (6 DOF); e.g., a camera attached to the end effector of a six degree-of-freedom arm.
In this paper, we describe a position-based visual servo control (PBVS), which consists of a set of 3D parameters, which must be estimated from image measurements.
Once s is selected, the design of the control scheme can be quite simple. Then, we need to design a velocity controller, to solve this problem, we should know the relationship between the time variation of s and the camera velocity. Let the spatial velocity of the camera be denoted by V c = (v c , ω c ), with v c the instantaneous linear velocity of the origin of the camera frame and ω c the instantaneous angular velocity of the camera frame. The relationship between . s and v c is given by: . s = L s v c (19) in which L s ∈ R k×6 is named the interaction matrix related to s. The term feature Jacobian is also used somewhat interchangeably in the visual servo literature. Using (18) and (19), we can get the relationship between camera velocity and the time variation of the error: . e = L e v c (20) where L e = L s . Considering v c as the input to the robot controller, and if we would like, for instance, to try to ensure an exponential decoupled decrease of the error (i.e., . e = −λe), we obtain using (20): where L + e ∈ R 6×k÷ is chosen as the Moore-Penrose pseudoinverse of L e . In the next experiment part, the computed velocity of image Jacobian is v c .

Experiment Results
In order to verify the effectiveness of the proposed RGB-D image processing algorithm in visual servoing system, this section uses an Intel Realsense SR300 (hereinafter referred to as SR300) camera as RGB-D sensor. First, the performances of the traditional camera and the RGB-D camera in target recognition and pose estimation under the visual servoing system are compared. Then, based on SR300 camera, the image algorithm proposed in this paper is compared with the general processing algorithm [10] in terms of system convergence speed and accuracy. The entire experiment contains three aspects: 1. Through the ordinary planar camera image algorithm and the image processing algorithm proposed in this paper, change of the manipulator translation error and rotation error in the visual servo process are used to illustrate the effect of RGB-D algorithm on servo convergence accuracy and speed; 2. The improved RGB-D image processing algorithm based on SR300 is compared with the traditional RGB-D image processing algorithm, which shows the improvement of image recognition accuracy and reduces the image processing load; 3. Based on the improved RGB-D image processing algorithm proposed in this paper, static and dynamic targets are tracked and tested, respectively. In order to effectively illustrate the effective recognition and extraction matching ability of the proposed image algorithm in the nonstructural environment, the object with low contrast is selected as the target of interest. The experimental system is shown in Figure 9.
In the experimental system: All the parameters involved in this experiment are set according to the previous description. The robot is named Panda produced by Franka Emika; the RGB-D Sensor is Intel RealSense SR300 camera, connected to the image processing system via USB3.0 and the planar camera is a Basler AC-640 connected to the image processing system via a Gige network. The interest target is a pink rectangular wooden block; mobile target carrier AGV is a homemade car based on Raspberry Pi; the image processing platform is a Linux 16.04 LTS operating system based on PREEMPT_RT kernel, equipped with Intel(R) Core (TM) i7-8700 CPU and 8 GB installed memory. Figure 9a is a visual servoing system based on Intel SR300, and Figure 9b is a visual servoing system based on Basler AC-640. The results are presented as follows. In the experimental system: All the parameters involved in this experiment are set according to the previous description. The robot is named Panda produced by Franka Emika; the RGB-D Sensor is Intel RealSense SR300 camera, connected to the image processing system via USB3.0 and the planar camera is a Basler AC-640 connected to the image processing system via a Gige network. The interest target is a pink rectangular wooden block; mobile target carrier AGV is a homemade car based on Raspberry Pi; the image processing platform is a Linux 16.04 LTS operating system based on PREEMPT_RT kernel, equipped with Intel(R) Core (TM) i7-8700 CPU and 8 GB installed memory. Figure 9a is a visual servoing system based on Intel SR300, and Figure 9b is a visual servoing system based on Basler AC-640. The results are presented as follows.

Comparison between Planar Camera Algorithm and Proposed RGB-D Algorithm
In order to test the effectiveness of our proposed algorithm, the effects of image processing results on the convergence of visual servo system are illustrated by different cameras under different algorithms. Figures 10 and 11 respectively show the change of the translation and rotation direction errors of the manipulator at different initial positions under the planar camera (AC-640) algorithm and the RGB-D (SR300) image processing algorithm proposed herein, wherein the abscissa is the number of iterations performed by the servo process, the ordinate is the translation change, and the rotation change value of the corresponding iterations, the unit is in m/t and rad/t (t represents an iteration cycle). It can be seen from Figures 10 and 11 that all the eight positions results show that the RGB-D algorithm proposed in this paper can effectively increase the image processing speed and reduce the servo cycle time (planar algorithm convergence around 600 times, and our algorithm is around 350-400 times) without affecting the system convergence accuracy.

Comparison between Planar Camera Algorithm and Proposed RGB-D Algorithm
In order to test the effectiveness of our proposed algorithm, the effects of image processing results on the convergence of visual servo system are illustrated by different cameras under different algorithms. Figures 10 and 11 respectively show the change of the translation and rotation direction errors of the manipulator at different initial positions under the planar camera (AC-640) algorithm and the RGB-D (SR300) image processing algorithm proposed herein, wherein the abscissa is the number of iterations performed by the servo process, the ordinate is the translation change, and the rotation change value of the corresponding iterations, the unit is in m/t and rad/t (t represents an iteration cycle). It can be seen from Figures 10 and 11 that all the eight positions results show that the RGB-D algorithm proposed in this paper can effectively increase the image processing speed and reduce the servo cycle time (planar algorithm convergence around 600 times, and our algorithm is around 350-400 times) without affecting the system convergence accuracy. Sensors 2020, 20, x FOR PEER REVIEW 12 of 16

Comparison between Traditional RGB-D Algorithm and Proposed RGB-D Algorithm
This section mainly compares the performance of the traditional RGB-D camera-based image algorithm with the SR300-based image algorithm in the visual servoing system. As shown in Figure  12, the RGB-D camera successfully identifies the target under both algorithms, completes the image processing process such as 3D modeling and pose estimation, and performs a visual servo experiment under the same initial and target positions. Figures 12 and 13 show the end speed changes of the robot in the same servo task under the traditional algorithm and the proposed algorithm, respectively. The abscissa indicates the number of

Comparison between Traditional RGB-D Algorithm and Proposed RGB-D Algorithm
This section mainly compares the performance of the traditional RGB-D camera-based image algorithm with the SR300-based image algorithm in the visual servoing system. As shown in Figure  12, the RGB-D camera successfully identifies the target under both algorithms, completes the image processing process such as 3D modeling and pose estimation, and performs a visual servo experiment under the same initial and target positions. Figures 12 and 13 show the end speed changes of the robot in the same servo task under the traditional algorithm and the proposed algorithm, respectively. The abscissa indicates the number of

Comparison between Traditional RGB-D Algorithm and Proposed RGB-D Algorithm
This section mainly compares the performance of the traditional RGB-D camera-based image algorithm with the SR300-based image algorithm in the visual servoing system. As shown in Figure 12, the RGB-D camera successfully identifies the target under both algorithms, completes the image processing process such as 3D modeling and pose estimation, and performs a visual servo experiment under the same initial and target positions. the image processing time, reduce the computer load, and significantly reduce the system convergence time (to facilitate the display of speed change smoothness, reduce the initial image, and target image pose difference and convergence threshold). It can be seen from the speed changes of the arm in Figures 12 and 13. The servo system has the same convergence precision under the two RGB-D image algorithms, which indicates the high precision and high speed properties of the proposed image algorithm in the visual servo system.    Figures 12 and 13 show the end speed changes of the robot in the same servo task under the traditional algorithm and the proposed algorithm, respectively. The abscissa indicates the number of system servo iterations, and ordinate indicates the robot end speed change. It can be seen from the abscissa of the two graphs that the traditional RGB-D camera-based servo system converges between 500 and 600 iterations (single iteration is around 55 ms) at the same target distance. The iterative convergence of the servo system based on SR300 camera proposed in this paper is about 300 iterations (about 30 ms in a single iteration). In addition, the speed changes rapidly at beginning of the servo task, indicating that the image processing algorithm based on SR300 camera can effectively reduce the image processing time, reduce the computer load, and significantly reduce the system convergence time (to facilitate the display of speed change smoothness, reduce the initial image, and target image pose difference and convergence threshold). It can be seen from the speed changes of the arm in Figures 12 and 13. The servo system has the same convergence precision under the two RGB-D image algorithms, which indicates the high precision and high speed properties of the proposed image algorithm in the visual servo system.

Effect of Image Algorithm on Visual Servoing Performance
In this section, the traditional industrial camera is replaced by the RGB-D camera SR300 in order to obtain a convenient, accurate, and economical visual servoing control system. It is necessary to verify the overall performance of the visual servoing system using RGB-D camera as sensor, in the meantime the image error change under the planar camera is provided. The experimental hardware part is shown in Figure 9. The following is an eye-in-hand (camera fixed at end of the arm) PBVS system, in which the identification matching of the static target and the recognition and the tracking effects of the dynamic target object are respectively verified. Figure 14a,b are the static target and dynamic target image changes under planar camera, respectively. ∆ x , ∆ y , ∆ z represents translation image error, θ x , θ y , θ z represents rotation image error. It can be seen from Figure 14 that the system successfully converges under static target. However, under dynamic target, the robot arm always follows the target moving, because its iteration period is long and the convergence speed is slow, finally fails to achieve system convergence. Figure 14c is identification and tracking results of static target by the visual servoing system with SR300 as sensor. It can be seen from Figure 14c that the RGB-D camera guides the robot arm slowly according to inverse kinematics algorithm after effectively identifying the target object. The initial position is close to the target object and then adjusted after reaching the set depth position. Finally, the target position is reached under the guidance of the visual image. The convergence speed is obviously faster than that of the planar camera, which effectively validates the previous viewpoints of this paper. Figure 14d shows the robot arm tracking the moving target object under SR300. After the servo starts, the robot arm quickly tracks the moving target object (2500-6000 ms), and finally completes the servo convergence to the moving target, which again verifies the viewpoint of this paper.

Effect of Image Algorithm on Visual Servoing Performance
In this section, the traditional industrial camera is replaced by the RGB-D camera SR300 in order to obtain a convenient, accurate, and economical visual servoing control system. It is necessary to verify the overall performance of the visual servoing system using RGB-D camera as sensor, in the meantime the image error change under the planar camera is provided. The experimental hardware part is shown in Figure 9. The following is an eye-in-hand (camera fixed at end of the arm) PBVS system, in which the identification matching of the static target and the recognition and the tracking effects of the dynamic target object are respectively verified. Figure 14a,b are the static target and dynamic target image changes under planar camera, respectively. ∆ , ∆ , ∆ represents translation image error, Ɵ , Ɵ , Ɵ represents rotation image error. It can be seen from Figure 14 that the system successfully converges under static target. However, under dynamic target, the robot arm always follows the target moving, because its iteration period is long and the convergence speed is slow, finally fails to achieve system convergence. Figure  14c is identification and tracking results of static target by the visual servoing system with SR300 as sensor. It can be seen from Figure 14c that the RGB-D camera guides the robot arm slowly according to inverse kinematics algorithm after effectively identifying the target object. The initial position is close to the target object and then adjusted after reaching the set depth position. Finally, the target position is reached under the guidance of the visual image. The convergence speed is obviously faster than that of the planar camera, which effectively validates the previous viewpoints of this paper. Figure 14d shows the robot arm tracking the moving target object under SR300. After the servo starts, the robot arm quickly tracks the moving target object (2500-6000 ms), and finally completes the servo convergence to the moving target, which again verifies the viewpoint of this paper.

Conclusions
In this paper, a new image recognition algorithm based on RGB-D camera is proposed. By improving the binarization algorithm based on normalized cuts, it is possible to effectively segment

Conclusions
In this paper, a new image recognition algorithm based on RGB-D camera is proposed. By improving the binarization algorithm based on normalized cuts, it is possible to effectively segment the scene in a complex environment, which simplifies the workload of 3D reconstruction and reduces the computer load (single visual servoing iteration time less than 30 ms). Model matching and pose estimation based on Hausdorff distance are realized by combining the target depth information after smoothing to enhance and restore the target image. The proposed RGB-D image processing algorithm can effectively improve the image matching efficiency, and the proposed image processing algorithm is applied in the PBVS system based on SR300 camera. The accuracy of the proposed algorithm is demonstrated by comparing with the performance of the traditional industry camera in manipulator translation and rotation error in the visual servoing system. Then, the traditional RGB-D image processing algorithm is compared with the proposed image processing algorithm under SR300 camera. The results demonstrate that the present optimization algorithm is characterized by image processing speed and system convergence precision. Finally, in the background of stationary and moving target tracing, the overall performance of the proposed RGB-D image algorithm and planar image algorithm in visual servoing system are verified and analyzed. The positive impact of the proposed RGB-D image algorithm on the accuracy, speed, etc., capability of the system is demonstrated.
Despite this, there are some points that still need to be improved during the RGB-D image processing. For example, when the target object or the interest point is too small, the environmental factors will have significant influence. Although, the target object can be effectively identified in a complex environment, if the environment filled with objects similar in color, shape, texture, or size, the image algorithm will be misjudged. Thus, the recognition accuracy needs to be improved. The accuracy is also related to the resolution of the RGB-D camera hardware. The response speed of the visual servo system is affected by camera frame rate, image processing algorithm, and system hardware configuration. The optimization of image processing and control algorithms and hardware upgrades (such as the introduction of cloud/fog/edge computing structures [4] in image processing) will further improve the system performance.