The Integration of the Image Sensor with a 3-DOF Pneumatic Parallel Manipulator

The study aims to integrate the image sensor for a three-axial pneumatic parallel manipulator which can pick and place objects automatically by the feature information of the image processed through the SURF algorithm. The SURF algorithm is adopted for defining and matching the features of a target object and an object database. In order to accurately mark the center of target and strengthen the feature matching results, the random sample and consensus method (RANSAC) is utilized. The ASUS Xtion Pro Live depth camera which can directly estimate the 3-D location of the target point is used in this study. A set of coordinate estimation calibrations is developed for enhancing the accuracy of target location estimation. This study also presents hand gesture recognition exploiting skin detection and noise elimination to determine the active finger count for input signals of the parallel manipulator. The end-effector of the parallel manipulator can be manipulated to the desired poses according to the measured finger count. Finally, the proposed methods are successfully to achieve the feature recognition and pick and place of the target object.


Introduction
In recent years, more and more countries have developed various kinds of robots to render human's lives much more convenient. Abundant literature on robots has been published and used for several decades. For instances, robots are widely adopted in automobile, mechanical, aerospace, medical applications. In this research, the industrial manipulator, the parallel manipulator, will be presented and implemented. This kind of manipulator possesses a high ratio of rigidity to weight, high stiffness, high accuracy and high response, so parallel manipulators have become more popular in diverse industries to handle complex and harsh tasks. In most robot application research, the interaction in the workspace between robots and workpieces is a critical issue. Especially, position mismatch may cause a failure of the functioning. In recent years, visual systems have become the most outstanding method applied in the robot-vision system. To achieve such vision-guided system, the robot should be able to recognize the target object and determine the pose of the object so as to grasp it. In 1988, the Harris corner detector was suggested for the feature detector [1]. Furthermore, the robot needs to modify its motion trajectory according to the target object's poses. In 2011, the 3D parallel mechanism robot with a stereo vision measurement system was presented by Chiang et al. [2,3]. The stereo vision measurement system is a noncontact measuring strategy using two parallel CCDs to capture the 3D poses of the end-effector instead of the contact displacement sensors. The system can determine the location of the end-effector in the three-dimensional Cartesian coordinate system. In system. In 2016, the 3D visual data-driven spatiotemporal deformations for non-rigid object gasping using robot hands was introduced by Mateo et al. [4]. The experiments show that the proposed method can grasp several objects in various configurations.
Recently, ASUS (Taipei, Taiwan) launched the ASUS Xtion Pro Live camera, a 3D camera system which consists of both an RGB sensor and a depth sensor for capturing color images and per-pixel depth information simultaneously. This device can largely resolve the major problem which is using the images from a 2D camera system to reconstruct the 3D object information in the vision-guided robot. Furthermore, Human-Robot Interaction (HRI) plays a critical role in accomplishing interactive tasks between human and robots. Many researches focus on kinematics, communication, computer vision and control systems, making HRI an inherently interdisciplinary endeavor. Gesture-based interfaces hold the promise of making HRI more natural and efficient [5,6].
This paper combines the depth camera and the 3-DOF pneumatic parallel manipulator, instead of the stereo vision system which is more expense and time consuming, for estimating the 3D location of objects. In addition, the gesture is used as a signal for the manipulator to grasp the desired bodies. The HRI renders the entire system friendly. In a nutshell, a 3-DOF pneumatic parallel manipulator with an image sensor system is successfully developed and implemented.

Mechanism
The proposed parallel manipulator is a 3-DOF parallel manipulator by the pneumatic servo system. Figure 1 shows a photograph of device. Three limbs driven by rod-less pneumatic actuators are assembled and connected to the fixed base in the way that the geometric structure of the manipulator is in an inverted pyramidal shape. The three sliders are translated along the linear guideways by three 1-DOF prismatic joints driven by the pneumatic rod-less cylinders. The moving platform is linked to each slider by 3-DOF spherical joints. Mobility analysis by the Grübler-Kutzbach formula verifies that the proposed manipulator is a 3-DOF mechanism with its moving platform possessing only translational motion. Furthermore, the 3D camera system, an ASUS Xtion Pro Live depth camera, is set up on the A axis of the parallel manipulator for non-contact measurements. The camera system can directly capture the 3D information of the object by color images and the depth position of each pixel [7].

Test Rig Layout
The structure of the 3-DOF parallel manipulator which includes the geometric structure and the linkage configuration is illustrated in Figure 2. The test rig layout of the 3-DOF pneumatic parallel

Test Rig Layout
The structure of the 3-DOF parallel manipulator which includes the geometric structure and the linkage configuration is illustrated in Figure 2. The test rig layout of the 3-DOF pneumatic parallel manipulator developed in this research is shown in Figure 3. The upper Figure 3 indicates the pneumatic servo system for driving the 3-DOF parallel manipulator. The overall pneumatic servo system mainly contains an air pump, three proportional directional flow control valve (model MPYE-M5, Festo, Esslingen am Neckar, Germany) and three pneumatic rodless cylinders (Festo model DGC-25-500). In addition, for gauging the real position information of each slider, the position sensor with 1 µm resolution is utilized and attached to each pneumatic actuator. Two pressure sensors are also installed on each cylinder to measure the pressures of the two cylinder chambers.
Sensors 2016, 16,1026 3 of 17 manipulator developed in this research is shown in Figure 3. The upper Figure 3 indicates the pneumatic servo system for driving the 3-DOF parallel manipulator. The overall pneumatic servo system mainly contains an air pump, three proportional directional flow control valve (model MPYE-M5, Festo, Esslingen am Neckar, Germany) and three pneumatic rodless cylinders (Festo model DGC-25-500). In addition, for gauging the real position information of each slider, the position sensor with 1 μm resolution is utilized and attached to each pneumatic actuator. Two pressure sensors are also installed on each cylinder to measure the pressures of the two cylinder chambers.    manipulator developed in this research is shown in Figure 3. The upper Figure 3 indicates the pneumatic servo system for driving the 3-DOF parallel manipulator. The overall pneumatic servo system mainly contains an air pump, three proportional directional flow control valve (model MPYE-M5, Festo, Esslingen am Neckar, Germany) and three pneumatic rodless cylinders (Festo model DGC-25-500). In addition, for gauging the real position information of each slider, the position sensor with 1 μm resolution is utilized and attached to each pneumatic actuator. Two pressure sensors are also installed on each cylinder to measure the pressures of the two cylinder chambers.   Both the measured position signals (y A , y B , y C ) and chamber pressure signals (P 1,A , P 2,A , . . . P 2,C ) are back to a PC-based controller through the counters and A/D converters on the DAQ card. The input command voltage for the servo valve is given from the analogue output ports on the DAQ card via the D/A converters. The control hardware system which adopts the Matlab Simulink and Mathworks can easily design and realize in the real-time system. The overall algorithms are built up using Matlab Simulink through embedded Matlab function blocks. Furthermore, in Mathworks the Real Time Windows Target (RTWT) can automatically translate the Simulink model into C codes. Also, the control system is implemented on a Windows-based personal computer with 1 kHz of sampling frequency to implement the real-time control system.

Object Recognition
In this paper, a SURF algorithm, a fast detector and descriptor, is utilized and developed to compute and detect in reducing the feature complexity and enhancing the robustness.

Interest Point Detection
The points of interest are detected by the Hessian-matrix approximation technique. The "Fast-Hessian" detector proposed by Viola and Jones can largely reduce the computational time to detect the object rapidly [8]. Also, Simard proposed a fast convolution algorithm for integral images into the general framework of boxlets [9].

Integral Image
At X = (x,y) T , the integral image is the sum of all pixels in a rectangular area set up by the origin and X. The integral images are easily and quickly to compute in the box type convolution filters. Choosing positions in the scale, a constant number of entries in a single integral image should be focused on. Also, the image size will mainly dominate the calculation time.

Hessian Matrix Based Interest Points
The advantage of the SURF feature detector with the Hessian matrix is its accuracy performance. The Hessian matrix H(X,σ) in a location X = (x,y) T of an image I with the scale σ can be expressed as: where G xx (X,σ), G xy (X,σ) and G yy (X,σ), the convolution of the Gaussian second order derivative with the image I in a location X, are B 2 Bx 2 g pσq , B 2 BxBy g pσq , and B 2 By 2 g pσq . When G xx (X,σ) and G yy (X,σ) are positive, and G xy (X,σ) is negative, the maximum will occur. In addition, D xx , D yy and D xy are 9ˆ9 box filters. The determinant of approximation is expressed as: where 0.9 is the relative weight of the filter responses for balancing the Gaussian kernel errors.

Scale Space Representation
Feature of interest points are located in various scales and an image pyramid can realize scale spaces. Lowe [10] proposed that cutting pyramid layers can find the edges and blobs of images. The scale space can separate into octaves which denote filter response maps from convolving the same image in different size filter. Each octave has a constant ratio for scale levels, so the layer can be determined by calculating determinant of approximated Hessian matrix of the same input image in growing size filter. Figure 4 shows the relation between each octave and various filter sizes. Note that the octaves are overlapping in order to cover all possible scales seamlessly. The layer denotes a series of filter response maps obtained by calculating determinant of approximated Hessian matrix of the same input image with a filter of increasing size in each octave.  If the intensity of the central pixel (marked with a cross) is higher than the intensities of its surrounding pixels, including eight pixels around feature point and nine pixels in first and third layer (27 pixels totally), it is considered as a local maximum [11].

Point of Interest Localization
Finding the point of interest, the blob responses of the same neighborhood (denoted as H) be taken in each dimension around the detected maximum as described above. Then, locating the maxima to sub-pixel/ sub-scale accuracy through a 3D quadratic to the scale space blob-response map.
where X = (x,y,σ) T are the coordinates of the scale-space. H(X) means the blob-response at the location X. The quadratic coefficients can be approximated by a 2nd order Taylor series approximation of the neighboring samples: Substituting the above expression into Equation (3): │ ( )│ ≥ 0.03 we regard it as high contrast point and update best interest points = + .
However, │ ( )│ > 0.03 has to be discarded because of low contrast.

Feature Points Matching
Matching interest points of two images will occur in the smallest Euclidean distance: and are two feature points in two images. However, there are still some mismatches in two images. For image transformation, mapping each to , the homography matrix H can be written in Equation (7) According to [12] the RANSAC algorithm is the robust estimation technique to attain the estimated parameters for homographies. The putative correspondences and the inlier correspondences can be adopted in the RANSAC algorithm [13]. Four correspondences are to define If the intensity of the central pixel (marked with a cross) is higher than the intensities of its surrounding pixels, including eight pixels around feature point and nine pixels in first and third layer (27 pixels totally), it is considered as a local maximum [11].

Point of Interest Localization
Finding the point of interest, the blob responses of the same neighborhood (denoted as H) be taken in each dimension around the detected maximum as described above. Then, locating the maxima to sub-pixel/ sub-scale accuracy through a 3D quadratic to the scale space blob-response map.
where X = (x,y,σ) T are the coordinates of the scale-space. H(X) means the blob-response at the location X. The quadratic coefficients can be approximated by a 2nd order Taylor series approximation of the neighboring samples:X Substituting the above expression into Equation (3): HpXq ě 0.03 we regard it as high contrast point and update best interest points X best " X`X. However, HpXq ą 0.03 has to be discarded because of low contrast.

Feature Points Matching
Matching interest points of two images will occur in the smallest Euclidean distance: P i and Q i are two feature points in two images. However, there are still some mismatches in two images. For image transformation, mapping each x i to x According to [12] the RANSAC algorithm is the robust estimation technique to attain the estimated parameters for homographies. The putative correspondences and the inlier correspondences can be adopted in the RANSAC algorithm [13]. Four correspondences are to define a homography and the sample numbers are based on the outliers from each consensus state. The detail process can be described as follows: 1.
Check whether these points are collinear, if so, redo the above step.

3.
Compute the homography H curr by normalized DLT from the four points pair.

4.
For each putative correspondence, calculate Euclidean distance between two points Count the number of inliers m which has the distance d i ă T (threshold). 6.
Repeat above steps until sufficient number of inlier pairs are counted. 7.
Update best H " H curr and record all the inliers. 8.
Using normalized DLT algorithm to recompute the homography from all consistent correspondences (inliers).
After applying the RANSAC algorithm, we can see that this efficiently eliminates those inaccurate correspondences. Because homography has the property of being scale-and rotation-invariant, we can highlight precisely the targets in the current image plane. Once the correct homography H be calculated, we can find the desired object in complicated backgrounds by averaging four corners of the reference image after applying a homogenous transformation. Figure 5 shows the hand gesture recognition process. The gesture can be determined via finger numbers for controlling the manipulator to grasp the specified objects. After applying the RANSAC algorithm, we can see that this efficiently eliminates those inaccurate correspondences. Because homography has the property of being scale-and rotationinvariant, we can highlight precisely the targets in the current image plane. Once the correct homography be calculated, we can find the desired object in complicated backgrounds by averaging four corners of the reference image after applying a homogenous transformation. Figure 5 shows the hand gesture recognition process. The gesture can be determined via finger numbers for controlling the manipulator to grasp the specified objects.

Skin Color Classification
Although the RGB model can reduced the large time needed for computer graphics design, it is still hard to execute image processing algorithms due to the fact the RGB color components are extremely correlated. In order to enhance the allowance for image intensity, RGB images can be transformed into a HSI color space, so intensity and chromaticity can be separated. Equation (8) is for RGB image transfer to HSI color space [14]: The RGB model of the image from the webcam can be converted to HSI color space because skin color is easily identified. The hue value should be between 0.4 and 0.6 and the saturation value also should be between 0.1 and 0.9. Figure 6 shows the results of skin color segmentation.

Skin Color Classification
Although the RGB model can reduced the large time needed for computer graphics design, it is still hard to execute image processing algorithms due to the fact the RGB color components are extremely correlated. In order to enhance the allowance for image intensity, RGB images can be transformed into a HSI color space, so intensity and chromaticity can be separated. Equation (8) is for RGB image transfer to HSI color space [14]: S " 1´3 R`G`B rminpR, G, Bqs I " 1 3 pR`G`Bq I f B is greater than G, then H " 360 o´H (8) The RGB model of the image from the webcam can be converted to HSI color space because skin color is easily identified. The hue value should be between 0.4 and 0.6 and the saturation value also should be between 0.1 and 0.9. Figure 6 shows the results of skin color segmentation.

Noise Rejection
In a general environment situation, we can't guarantee the image background will be clear. There will be some skin-like objects in the image, which produce unexpected noise. In that case, we use an area condition to filter out noises. First, we calculate the pixel area of each connected component ( , ) by Equation (10) as follows: After applying area filter method, the result is shown in Figure 7.

Distance Transform
The distance transform means that the distance from the boundary to a pixel in the hand region increases as the pixel is away from the boundary [15]. Using this distance value, the centroid of the palm region can be calculated. Figure 8 (left) shows the image of the hand after applying the distance transform. The right image of Figure 8 demonstrates the enlarged view of the region within the red rectangle. The white color in the center is intense and the color fades when the distance increases. The pixels near the boundary have lower values for distance and the pixels away from the boundary have higher values for distance. This middle region which has the highest value for the distance is considered as the centroid.

Noise Rejection
In a general environment situation, we can't guarantee the image background will be clear. There will be some skin-like objects in the image, which produce unexpected noise. In that case, we use an area condition to filter out noises. First, we calculate the pixel area of each connected component B pi, jq by Equation (10) as follows: After applying area filter method, the result is shown in Figure 7.

Noise Rejection
In a general environment situation, we can't guarantee the image background will be clear. There will be some skin-like objects in the image, which produce unexpected noise. In that case, we use an area condition to filter out noises. First, we calculate the pixel area of each connected component ( , ) by Equation (10) as follows: After applying area filter method, the result is shown in Figure 7.

Distance Transform
The distance transform means that the distance from the boundary to a pixel in the hand region increases as the pixel is away from the boundary [15]. Using this distance value, the centroid of the palm region can be calculated. Figure 8 (left) shows the image of the hand after applying the distance transform. The right image of Figure 8 demonstrates the enlarged view of the region within the red rectangle. The white color in the center is intense and the color fades when the distance increases. The pixels near the boundary have lower values for distance and the pixels away from the boundary have higher values for distance. This middle region which has the highest value for the distance is considered as the centroid.

Distance Transform
The distance transform means that the distance from the boundary to a pixel in the hand region increases as the pixel is away from the boundary [15]. Using this distance value, the centroid of the palm region can be calculated. Figure 8 (left) shows the image of the hand after applying the distance transform. The right image of Figure 8 demonstrates the enlarged view of the region within the red rectangle.

Noise Rejection
In a general environment situation, we can't guarantee the image background will be clear. There will be some skin-like objects in the image, which produce unexpected noise. In that case, we use an area condition to filter out noises. First, we calculate the pixel area of each connected component ( , ) by Equation (10) as follows: After applying area filter method, the result is shown in Figure 7.

Distance Transform
The distance transform means that the distance from the boundary to a pixel in the hand region increases as the pixel is away from the boundary [15]. Using this distance value, the centroid of the palm region can be calculated. Figure 8 (left) shows the image of the hand after applying the distance transform. The right image of Figure 8 demonstrates the enlarged view of the region within the red rectangle. The white color in the center is intense and the color fades when the distance increases. The pixels near the boundary have lower values for distance and the pixels away from the boundary have higher values for distance. This middle region which has the highest value for the distance is considered as the centroid. The white color in the center is intense and the color fades when the distance increases. The pixels near the boundary have lower values for distance and the pixels away from the boundary have higher values for distance. This middle region which has the highest value for the distance is considered as the centroid.

Morphology
The width of the hand region will be approximately twice the distance from centroid to the nearest boundary pixel as shown in Figure 9.

Morphology
The width of the hand region will be approximately twice the distance from centroid to the nearest boundary pixel as shown in Figure 9. The width of each finger is approximately one fourth of the width of the hand. Now a suitable structuring element that can erode the fingers completely is chosen and erosion is performed on the segmented hand region.
After erosion only a part of the palm region is left behind and the finger region is completely eroded. Further the palm region which remains after erosion is dilated using the same structuring element and this give the region which is larger than the dilated palm region. The result of is shown in Figure 10: The dilated palm region is from the original binary image to the finger area alone as shown in Figure 11.
The finger numbers represent the gesture is found by the image . Figure 9. Image of the hand width.
The width of each finger is approximately one fourth of the width of the hand. Now a suitable structuring element S that can erode the fingers completely is chosen and erosion is performed on the segmented hand region.
R p1 " I a S After erosion only a part of the palm region R p1 is left behind and the finger region is completely eroded. Further the palm region which remains after erosion R p1 is dilated using the same structuring element and this give the region R p2 which is larger than the dilated palm region. The result of R p2 is shown in Figure 10:

Morphology
The width of the hand region will be approximately twice the distance from centroid to the nearest boundary pixel as shown in Figure 9. The width of each finger is approximately one fourth of the width of the hand. Now a suitable structuring element that can erode the fingers completely is chosen and erosion is performed on the segmented hand region.
After erosion only a part of the palm region is left behind and the finger region is completely eroded. Further the palm region which remains after erosion is dilated using the same structuring element and this give the region which is larger than the dilated palm region. The result of is shown in Figure 10: The dilated palm region is from the original binary image to the finger area alone as shown in Figure 11.
The finger numbers represent the gesture is found by the image . Figure 10. Left image is hand region binary image, the right image is R p2 .
The dilated palm region R p2 is from the original binary image I to the finger area F R . alone as shown in Figure 11.
The finger numbers represent the gesture is found by the image F R . Figure 11. The image processing results.

3D Object Localization
After applying the image processing algorithm described in the previous sections, we can recognize desired feature points in RGB color images and depth images. The problem we are dealing with is how to estimate the feature point location in 3D world coordinates (the manipulator endeffector frame).

Calibration of Depth Camera
Bouguet adapted the calibration method of Zhang [16] which employs a chessboard to be the calibration pattern. Figure 12 shows the corner extraction process. "+" is for image points and "o" is for re-projected grid points. After obtaining the depth camera's image, the intrinsic parameters can be calculated by the camera calibration toolbox. Table 1

3D Object Localization
After applying the image processing algorithm described in the previous sections, we can recognize desired feature points in RGB color images and depth images. The problem we are dealing with is how to estimate the feature point location in 3D world coordinates (the manipulator end-effector frame).

Calibration of Depth Camera
Bouguet adapted the calibration method of Zhang [16] which employs a chessboard to be the calibration pattern. Figure 12 shows the corner extraction process. "+" is for image points and "o" is for re-projected grid points.

3D Object Localization
After applying the image processing algorithm described in the previous sections, we can recognize desired feature points in RGB color images and depth images. The problem we are dealing with is how to estimate the feature point location in 3D world coordinates (the manipulator endeffector frame).

Calibration of Depth Camera
Bouguet adapted the calibration method of Zhang [16] which employs a chessboard to be the calibration pattern. Figure 12 shows the corner extraction process. "+" is for image points and "o" is for re-projected grid points. After obtaining the depth camera's image, the intrinsic parameters can be calculated by the camera calibration toolbox. Table 1  After obtaining the depth camera's image, the intrinsic parameters can be calculated by the camera calibration toolbox. Table 1 illustrates the depth camera's intrinsic parameters.

Object 3D Location via Depth Camera
The depth camera returns a raw depth data x which has 11 bits resolution, and depth information ranges from 0 to 2047. The depth distance Z can be obtained from the raw depth data converted into depth image by the camera. The following equations show the depth distance as [17]: where: Once the depth distance from the camera and the intrinsic parameters of the camera model are known, we can estimate 3D location of desired feature points in depth images. According to [16], the accuracy of 3D object localization can be determined as follows: Zpv´c y q f y (15) where (X, Y, Z) is the 3D location of the feature point, (c x ,c y ) is the distance from the optic axis, and (u,v) is the homogenous pixel coordination. Figure 13 shows the relation coordination between the end-effector and the depth camera. This calibration requires a red color maker as feature point attached to the end-effector.

Hand-Eye Coordinates Calibration
The transformation between the the Xtion Pro Live depth camera coordinates and the manipulator end-effector reference frame can be written as follows: where P cam " rx c y c z c 1s T is a position frame of the maker in the depth camera. Thus, the parameter , is a position of the maker attached on end-effector in the end-effector reference frame. Then, the maker are attached at the center of the gripper. A homogeneous matrix, H cam end´e f f includes 12 parameters from the depth camera coordination to the robot end-effector reference frame. Therefore, we can rewrite the Equation (16) as follows: H cam end´e f f P cam " For solving the twelve unknown parameters, nine rotational operators and three translational operators, ten different end-effector positions will be considered and mapped in Equation (17) in the experiments. Also, the , the fixed relationship between the depth camera coordinates and the end-effector reference frame, is definitely the time invariant matrix, so altering the manipulator to desired poses and using the Xtion Pro Live to extract red feature points on the end-effector, the following transition matrix is described according to least squares method computation:

Experiments
In the previous chapter, the Speed-Up Robust Feature detection with RANSAC algorithm and the finger counting Human-Robot Interaction as well as the coordinate transformation have been analyzed and derived. In this chapter, the SURF object recognition algorithm will be confirmed before finding the desired pokers and estimating their location of each center of pattern in the manipulator reference frame. In next step, we use finger counting HRI to command the manipulator to grasp the selected target. After knowing location of targets and placing locations where we set, the program will automatically generate a 5th order trajectory for the end-effector to pick and place in a three dimensional system. The equation of the 5th order trajectory is as follows: , are the position, the velocity and the acceleration at t = 0. , are the position, the velocity and the acceleration at t = tf and tf is the terminal time of the 5th order trajectory. The whole experiment process is illustrated in Figure 14 and the overall manipulator control scheme is illustrated in Figure 15. For solving the twelve unknown parameters, nine rotational operators and three translational operators, ten different end-effector positions will be considered and mapped in Equation (17) in the experiments. Also, the H cam end´e f f , the fixed relationship between the depth camera coordinates and the end-effector reference frame, is definitely the time invariant matrix, so altering the manipulator to desired poses and using the Xtion Pro Live to extract red feature points on the end-effector, the following transition matrix is described according to least squares method computation:

Experiments
In the previous chapter, the Speed-Up Robust Feature detection with RANSAC algorithm and the finger counting Human-Robot Interaction as well as the coordinate transformation have been analyzed and derived. In this chapter, the SURF object recognition algorithm will be confirmed before finding the desired pokers and estimating their location of each center of pattern in the manipulator reference frame. In next step, we use finger counting HRI to command the manipulator to grasp the selected target. After knowing location of targets and placing locations where we set, the program will automatically generate a 5th order trajectory for the end-effector to pick and place in a three dimensional system. The equation of the 5th order trajectory is as follows: .
x d f¯t 2 f s; ..
x d f¯t The whole experiment process is illustrated in Figure 14 and the overall manipulator control scheme is illustrated in Figure 15.  We use poker K, Q and J patterns to construct the database and applied six scale levels in the 3th octave for feature extraction. The king of hearts result is shown in Figure 16. The green crosses denote feature points locations and circles are feature points found in different scale space with 6 s radius. Figures 17 and 18 illustrate the results of the RANSAC algorithm applied to find the inlier correspondences and recognized patterns.  We use poker K, Q and J patterns to construct the database and applied six scale levels in the 3th octave for feature extraction. The king of hearts result is shown in Figure 16. The green crosses denote feature points locations and circles are feature points found in different scale space with 6 s radius. Figures 17 and 18 illustrate the results of the RANSAC algorithm applied to find the inlier correspondences and recognized patterns. We use poker K, Q and J patterns to construct the database and applied six scale levels in the 3th octave for feature extraction. The king of hearts result is shown in Figure 16. The green crosses denote feature points locations and circles are feature points found in different scale space with 6 s radius. Figures 17 and 18 illustrate the results of the RANSAC algorithm applied to find the inlier correspondences and recognized patterns.          After the poker pattern is recognized by the SURF feature point detection with the RANSAC algorithm and the user selects the targets for grasping by counting active fingers, the depth camera will estimate the center of each targets in the end-effector frame by the coordinate transform from the camera frame. Once the pick and place locations are calculated, the program will automatically generate the customized 5th order trajectory of the end-effector for path tracking control. The experiments are from (X, Y, Z) = (´150,´100,´150) mm back to (0, 0, 0) mm in 2 s. The red line of Figure 19 illustrates the estimated trajectory of the end-effector calculated by the forward kinematics and experimental tracking responses of three actuators. Figure 20 demonstrates the trajectory tracking error of the end-effector for 3-DOF pneumatic parallel manipulator. Figures 21-23 show the experimental results of each actuator's responses, respectively.
After grasping the target, we need to determine the location to place it. The placment location is as follows: After the poker pattern is recognized by the SURF feature point detection with the RANSAC algorithm and the user selects the targets for grasping by counting active fingers, the depth camera will estimate the center of each targets in the end-effector frame by the coordinate transform from the camera frame. Once the pick and place locations are calculated, the program will automatically generate the customized 5th order trajectory of the end-effector for path tracking control. The experiments are from (X, Y, Z) = (−150, −100, −150) mm back to (0, 0, 0) mm in 2 s. The red line of Figure 19 illustrates the estimated trajectory of the end-effector calculated by the forward kinematics and experimental tracking responses of three actuators. Figure 20 demonstrates the trajectory tracking error of the end-effector for 3-DOF pneumatic parallel manipulator. Figures 21-23 show the experimental results of each actuator's responses, respectively.    The king of diamonds and the jack of spades are chosen, so the finger counting result must be one and two to select the desire patterns. By using the coordinate transformation, the center points of the poker cards are shown as: After grasping the target, we need to determine the location to place it. The placment location is as follows: After the poker pattern is recognized by the SURF feature point detection with the RANSAC algorithm and the user selects the targets for grasping by counting active fingers, the depth camera will estimate the center of each targets in the end-effector frame by the coordinate transform from the camera frame. Once the pick and place locations are calculated, the program will automatically generate the customized 5th order trajectory of the end-effector for path tracking control. The experiments are from (X, Y, Z) = (−150, −100, −150) mm back to (0, 0, 0) mm in 2 s. The red line of Figure 19 illustrates the estimated trajectory of the end-effector calculated by the forward kinematics and experimental tracking responses of three actuators. Figure 20 demonstrates the trajectory tracking error of the end-effector for 3-DOF pneumatic parallel manipulator. Figures 21-23 show the experimental results of each actuator's responses, respectively.

Conclusions
In this paper, the developed SURF and HRI image algorithm is integrated with a 3-DOF pneumatic parallel manipulator so that manipulator can define objects by the feature information of the image through the SURF algorithm with scale-and rotation-invariants, and then it can automatically move to the object, grasp it, and finally move to the desired location.
In the feature matching, we match all feature correspondences by means of image plane transformation (homography) solved by RANSAC outlier rejection. Therefore, the center of object in the image coordinates can be estimated by the average of the four corners of the reference image.
Xtion Pro Live was introduced and implemented for measuring the 3-D locations of target points. Furthermore, we developed a coordinate transform calibration method for eye-to-hand calibration using the least squares and pseudo inverse methods.
The gesture recognition for counting active fingers was used to select the desired object to be grasped. When each pick and place location is confirmed in the end-effector reference frame, the program will generate the 5th order trajectories for the path tracking control.
All of the theorems in this paper are derived and verified in the experiments. The three-axial pneumatic parallel manipulator can recognize each target pattern in a workspace then pick and place it successfully.