Hand–Eye Calibration Using a Tablet Computer

: Many approaches have been developed to solve the hand–eye calibration problem. The traditional approach involves a precise mathematical model, which has advantages and disadvantages. For example, mathematical representations can provide numerical and quantitative results to users and researchers. Thus, it is possible to explain and understand the calibration results. However, information about the end-effector, such as the position attached to the robot and its dimensions, is not considered in the calibration process. If there is no CAD model, additional calibration is required for accurate manipulation, especially for a handmade end-effector. A neural network-based method is used as the solution to this problem. By training a neural network model using data created via the attached end-effector, additional calibration can be avoided. Moreover, it is not necessary to develop a precise and complex mathematical model. However, it is difﬁcult to provide quantitative information because a neural network is a black box. Hence, a method with both advantages is proposed in this study. A mathematical model was developed and optimized using the data created by the attached end-effector. To acquire accurate data and evaluate the calibration results, a tablet computer was utilized. The established method achieved a mean positioning error of 1.0 mm


Introduction
Robots utilizing vision systems have been introduced at production sites to automate many assembly processes and supply industrial parts. For a robot to pick an object, which is identified using a camera, calibration is required in advance because transforming its position in the image coordinate system to the robot-based coordinate system is necessary. This is known as hand-eye calibration. Many studies have addressed this problem. One of the major approaches is to develop a precise mathematical model and use a calibration board, such as a checkerboard. This is because the feature points of a checkerboard are easy to detect from a captured image, and several classical studies have adopted it as a calibrator [1]. Using the captured images, the mathematical model is optimized.
However, this traditional approach has advantages and disadvantages. For example, mathematical models can provide numerical and quantitative results to users and researchers. Hence, they can understand and analyze why the calibration results are good. However, information about the end-effector, such as the position attached to the robot and its dimensions, is not considered in the calibration process. If there is no CAD model, additional calibration is required for accurate manipulation, especially for a handmade end-effector. One method to solve it is a neural network-based method [2]. By training a neural network model, it can directly transfer a position in the image coordinate system to the robot-based coordinate system. Because the training data are created by using an attached end-effector to the robot hand, developing a complex mathematical model and additional calibration for the end-effector are not necessary. However, it is difficult to understand and analyze the reason for the good calibration results because the neural network is a black box.
Therefore, a method with both advantages is proposed in this study. A mathematical model was developed and optimized using the data created through the same procedure as the neural network-based method. Because the optimized mathematical model can provide numerical and quantitative data and transfer a position in the image coordinate system to the robot-based coordinate system considering the offset of the attached end-effector, detailed information, such as a CAD model and an additional calibration to obtain the offset, are not required. Hence, the proposed method overcomes the disadvantages of both of these approaches and also involves some unique advantages. To acquire accurate data and measure the positioning error to check the calibration performance, a tablet computer was used.

Related Works
Many approaches have been proposed to solve the hand-eye calibration problem [3,4]. The most basic mathematical model is AX = XB [5], where A and B are homogeneous transformation matrices (HTMs) that represent the relative motions of the robot and an attached camera, and X is an estimated HTM that represents the relationship between the hand and a camera. Based on this model, Motai et al. proposed a method considering the distortion of a camera lens [6]. By estimating camera parameters from multiple viewpoints, active viewpoints can be generated to obtain three-dimensional (3D) models of objects. Many methods have been proposed to solve the unknown parameters of X. According to reference [4], two approaches are represented in the relevant literature, separation, and simultaneous methods. In the former, a rotation matrix of X and a translation vector are solved separately [7][8][9][10][11]; in the latter, both are solved simultaneously [12][13][14][15]. In addition to AX = XB, another model (AX = YB [16]) is used. Hand-eye calibration methods have been developed based on these mathematical models and approaches to solve the unknown parameters of X.
Mišeikis et al. proposed a rapid automatic calibration method using 3D cameras [17]. Even if the cameras and robots being calibrated are repositioned, this method can recalibrate rapidly. Koide et al. proposed a method based on reprojection error minimization [18]. Unlike traditional approaches, their method does not need to explicitly estimate the camera pose for each input image. Pose graph optimization is performed to deal with different camera models. Cao et al. proposed an approach using a neural network for error compensation [18]. Because any device is not necessary for compensation, this method has a low cost.
These related studies employed a mathematical model to achieve hand-eye calibration. Hence, this approach can provide numerical and quantitative information on the calibration results to users and researchers. However, information on the end-effector, such as the position attached to the robot and its dimensions, is not considered in the calibration process. If there is no detailed information, such as a CAD model, additional calibration is required for accurate manipulation, especially for a handmade end-effector. Hua's approach provides an effective solution [2]. By training a neural network model using the training data, which have various errors and noises in a real environment, robust transformation from the image coordinate system to the robot-based coordinate system is directly possible. Because the training data are created using the attached end-effector, additional calibration is unnecessary. In addition, the neural network model has high representative power. Therefore, the development of a precise and complex mathematical model is not required. However, this approach cannot provide quantitative information because the neural network model is a black box. Therefore, it is difficult to understand and explain the calibration results.
In this paper, a method with both advantages was proposed. A mathematical model was developed and the data to optimize the model were created through the same procedure as the neural network-based method. To acquire accurate data and evaluate the calibration results, a tablet computer was used.

Overview
The developed system is shown in Figure 1. A clear plastic box is attached to the tip flange of the hand of a robot. An RGB-D camera and a tablet pen holder, which is fabricated by a 3D printer, are attached. Because an Intel SR300 camera is used, consideration of image distortion is not necessary [19]. In addition, the intrinsic parameters of the camera can be obtained easily by a software development kit (SDK). Both the pen and camera are mounted at positions different from the rotation center of the robot hand because the end-effector is handmade. Of course, there is no CAD model of it. The pen rotates when the hand rotates. It is necessary to calibrate in this scenario.  For hand-eye calibration using this developed system, some preparations are necessary, similar to the study of Hua [2]. Figure 2 shows a data processing procedure performed by the proposed method. First, nine landmarks (i.e., targets to touch by the robot's hand with the tablet pen) are displayed on the tablet, as shown in Figure 1a. The interior area surrounded by landmarks is the considered workspace. Second, the tablet display is captured by the attached RGB-D camera, and all positions of the landmarks in the image coordinate system are obtained, as shown in Figure 3a. Third, the robot hand is manually operated to enable the pen and one landmark to touch each other, and the hand position in the robot-based coordinate system is acquired. This data acquisition is repeated to obtain all landmarks. Following this, the rotation angle of the sixth axis is gradually rotated; therefore, the first and ninth dots are 0 • and 180 • , respectively. Specifically, the sixth axis is rotated by 22.5 • . This is because the attached camera and pen are not aligned to the sixth axis of the robot hand, and calibrating in this scenario is necessary. Figure 3b shows an example of the acquired data. Because the attached pen is not aligned and the hand rotates, the data distribution in Figure 3b is different from that in Figure 3a. The parameters of the homogeneous transformation matrices (HTMs) are optimized by two-stage optimization. Using the optimized matrices, the positions to touch in the image coordinate system (Figure 3a) are converted to those in the robot-based coordinate system (Figure 3b). To evaluate the calibration performance, the robot hand with the attached pen touches the nine displayed landmarks, and the mean touching error is calculated after the optimized parameters are introduced in the robot.

Coordinate System and Homogeneous Transformation Matrix (HTM) JS
The used DENSO VP-6242 robot [20] has a range of motion with six degrees of freedom (DoF). JS The coordinate systems are shown in Figure 4 and 5 JS . Σ b , Σ h , Σ c , Σ i , and Σ t denote the robot base, hand, camera, image, and tablet computer coordinate systems. b T h and h T c represent HTMshomogeneous transformation matrices JS , which have rotation and translation parts, as expressed below.
Captured image of the tablet display

Coordinate System and Homogeneous Transformation Matrix (HTM)
The used DENSO VP-6242 robot [20] has a range of motion with six degrees of freedom (DoF). The coordinate systems are shown in Figures 4 and 5. Σ b , Σ h , Σ c , Σ i , and Σ t denote the robot base, hand, camera, image, and tablet computer coordinate systems. b T h and h T c represent HTMs, which have rotation and translation parts, as expressed below. where , and R z (γ c ) are 3 × 3 rotation matrices of the x, y, and z axes, respectively. t i represents the transformation from Σ i to Σ c . It can be achieved using the pinhole camera model.
where (u m , v m ) represents the mth black dot in Σ i . Let f x and f y be the focal lengths of the x and y axes, respectively. c x and c y are the coordinates of the principal points of the x and y axes, respectively. x c m and y c m are the transformed positions in Σ b . z c m is the distance from the camera to the mth black dot. It is measured by the RGB-D camera. Thus, Finally, the position in Σ b ((x bi m , y bi m , z bi m ) ) can be obtained from the following equation: In the developed system, the above equation is insufficient because the tablet pen and the camera are not aligned to the rotation axis of Σ h . Hence, the offset ((x , y , z ) ) should be considered, which can be calculated as follows: As shown in Figure 4, t p = (x p , y p , z p ) is the translation vector from Σ h to the tip of the tablet pen. θ h represents the rotation angle of the z-axis in Σ h . By combining Equations (5) and (7), the final equation is Captured image of the tablet display

Transformation from Σ t to Σ b
All positions of the black dots displayed on the tablet computer can be transformed from Σ t to Σ b . The position in px should be converted to mm as the first step using the following equation: where l mm , l px , PPI, and s are the converted result in mm, position in pixel, pixel per inch, and the display scale of the tablet computer, respectively. The PPI and s depend on a used tablet computer. The transformation from where (x bt m , y bt m , z bt m ) is the transformed result of the mth black dot, which is converted from px to mm ((x t m , y t m , 0) ) using Equation (9) and γ t are rotation angles of the x, y, and z axes of Σ t , respectively.

Representation by DH Method
The relationships of each coordinate system can be represented by the Denavit-Hartenberg (DH) method [21]. In contrast to the HTMs with the six-DoF mentioned above, four parameters a n−1 , α n−1 , d n , and θ n are used in this method. Here, let x n and z n be the x and z axes of the nth link, respectively. These four parameters, respectively, denote the length of a common normal between the n − 1th and the nth links (link length), the angle of rotation around the x n−1 from the z n−1 to z n (link twist), the distance from the intersection of x n−1 and z n to the origin of the ith link's frame (link offset), and the angle of rotation around the z n from the x n−1 to x n (joint angle). Using this method, Equation (5) can be rewritten.
The h T DH c is an HTM in the DH method that represents the relationship between Σ h and Σ c . The b T h can also be represented by the DH method; however, it can be acquired from the robot controller and it is considered a known quantity. The R xyz is a rotation matrix for each axis in the 3D space. Similarly, the relationship between the Σ b and Σ t can be rewritten as follows.
In addition to this approach, many other methods that represent the relationships of the coordinate system have been reported. In this study, a comparison of representations used by the 6-DoF HTM and DH methods was focused on.

Parameters to be Optimized
The known and unknown parameters that should be optimized are listed (Table 1). In Equation (1), (x h , y h , z h ) are known because they can be obtained from the robot controller. In Equation (2), (x c , y c , z c ) can be approximately measured by hand. However, the manual measurement has an error and affects the final positioning error of the robot hand. Thus, they were optimized in this study. Because the robot hand is operated by causing the pen and the tablet display to touch each other, the optimization of z c is unnecessary. In the same equation, α c , β c , and γ c are optimized. In Equation (3), f x , f y , c x , and c y are known because they can be obtained from the software development kit (SDK) of the RGB-D camera. z c m is also known because the camera can measure the distance. In Equation (6), θ h is known because a user sets the angle to rotate the hand. (x p , y p , z p ) can be measured by hand; however, they should be optimized because of the above reasons. Similarly, z p can be ignored. In Equation (11), α t , β t , γ t , x t , y t , and z t are unknown. However, z t can be ignored. Therefore, optimizing the 12 parameters is necessary for the six-DoF HTMs. For the DH method, the parameters of θ c 1 , a c 1 , θ t 1 , and a t 1 must be optimized. The d c 1 and d t 1 can be ignored for the same reason regarding z c , z p , and z t . Table 1. Known and unknown (should be optimized) parameters.

Equation Number
Known Unknown (Six-DoF HTM) Unknown (DH Method)

Two-Stage Optimization
To optimize the 12 unknown parameters and further minimize the positioning error of the robot hand, the developed method introduces a two-stage optimization. In the first stage, the 12 parameters are optimized based on the mathematical model described in Section 3.2. Using the optimized parameters, the positions of the black dots in Σ i (Figure 3a) can be converted to Σ b (Figure 3b). Thus, the robot hand with the tablet pen can touch the black dots of the tablet display. To further minimize the error, affine transformation-based optimization is introduced in the second stage.

First Optimization
Many optimization algorithms can be used. In this study, differential evolution (DE) [22] is adopted because of its ease of use. In DE, search points in a search space are referred to as individuals. Each individual includes a set of optimized parameters encoded as a vector. After the fitness of each individual is calculated using a fitness function, new individuals are generated for the next generation based on the calculated fitness, mutation, and crossover strategies. By iterating these procedures, the individuals gradually converge to an optimal solution.
In this study, the following equations are used for the fitness function.
where F 1st is the fitness function, and it consists of f 1st 1 and f 1st 2 . f 1st 1 represents the mean Euclidean distance of the transformed nine black dots from Σ i to Σ b by Equation (5) and from Σ t to Σ b by Equation (10). This function is set because the transformed black dots by the different HTMs should match each other in Σ b if the unknown parameters in Equations (2) and (11) are optimized correctly.
To optimize the remaining unknown parameters in Equation (6), f 1st 2 is introduced. If all unknown parameters are optimized correctly, x bi m and y bi m , which are calculated from Equation (8), as well as the hand positions in Σ b (Figure 3b), should match each other. However, when the data of the hand positions are the generated errors that occur, the generated data cannot be used as the perfect ground truth. The reason these errors occur is the difficulty in operating the robot hand manually to ensure that each center of the displayed landmarks and the tip of the tablet pen touch each other perfectly (with no distance error). To minimize the error as much as possible in the optimization process, Equation (20) is introduced. Let x r m and y r m be the created mth position of the robot hand in Σ b , where the pen and the mth black dot touch each other with a small distance error. ∆x r m and ∆y r m are the distance errors between the mth landmark and the touched position in Σ t . They can be obtained easily from the tablet computer in px. The unit can be converted to mm using Equation (9). This conversion is necessary before applying Equation (20).

Second Optimization
By the first optimization, a good calibration result of the six-DoF HTMs is confirmed, as shown in Figure 6. To further minimize the error, affine transformation matrices are optimized in the second stage to match the two data distributions more. For this purpose, the following fitness function is set: where F 2nd is the fitness function, which consists of f 2nd 1 and f 2nd 2 . They are almost the same as f 1st 1 and f 1st 2 . The difference is that affine transformation for a target position ((A n (x tgt ), A n (y tgt )) ) is introduced. Let X n and Y n be the amounts of translation of the x and y axes, respectively. X cr n and Y cr n represent the centers of rotation of the x and y axes, respectively. In this study, the position of the fourth black dot was the center of rotation. θ n is the angle of rotation. S x n and S y n are the scaling factors of the x and y axes, respectively. The unknown parameters to be optimized are X n , Y n , θ n , S x n , and S y n . Because n ∈ 1, 2 (A 1 and A 2 ), ten parameters should be optimized to match the two data distributions as much as possible. Figure 7 shows examples using the optimized affine transformation matrices and the six-DoF HTMs. The error decreases in both results. In the experiments, this effectiveness was evaluated quantitatively.

Used Robot and Devices
In the experiments, a six-axis robot (DENSO VP-6242), a tablet computer (Microsoft Surface Pro 7), a tablet pen (Surface Pen), and an RGB-D camera (Intel SR300) were used. The positional repeatability of the robot was ±0.02 mm [20]. The resolution of the tablet was 267 ppi.

Data Creation
For the two-stage optimization, positions of the nine (m ∈ [0, 8]) displayed black dots in Σ i ((u m , v m )) and the corresponding positions of the robot hand in Σ b ((x r m , y r m )) are necessary. To create both data, first, nine black dots were displayed with one pixel (Figure 8a). Second, the tablet display was captured by the attached RGB-D camera. Third, the captured image was binarized, as shown in Figure 8b. Because one blob with a few pixels was obtained for each dot, the averaged coordinates are used as (u m , v m ). Moreover, the corresponding depth data (z c m ) of (u m , v m ) were acquired from the RGB-D camera. Subsequently, the robot hand was operated such that the tip of the pen touched the displayed dots to create data (x r m , y r m ) as the ground truth. At this time, the touching error (∆x r m , ∆y r m ) was obtained from the tablet computer to compensate for the error, as described in Section 3.6.1. Table 2 provides the created data.    Table 3 presents the set values for the known parameters. (x h , y h , z h ) represent the initial position of the robot hand to capture the tablet display. This position was determined by the author. f x , f y , c x , and c y were acquired from the SDK of the RGB-D camera [19]. θ h is the rotation angle of the z-axis in Σ h to touch each displayed dot.  Table 4 provides the set values for the hyperparameters of the DE. Let N and G be the population and the generation sizes, respectively. To avoid premature convergence, a sufficiently large size was set. As the crossover probability (CR) and the scaling factor (F), 0.9 and 0.5 were set, respectively. The binomial crossover and DE/rand/1 were adopted for the crossover and mutation strategies, respectively. Because the DE performance depended on a random seed, five different random seeds were used and compared in the two-stage optimization. Tables 5 and 6 present the search ranges of the 12 optimized parameters for the 6-DoF HTMs and the DH method.    Table 7 summarizes the optimization results of the six-DoF HTMs using the five different random seeds. There are signed and unsigned values in α t and β t although all F 1st are the same. Hence, this optimization problem is multimodal. Because DE can exhibit good performance in a multimodal problem [22], using this algorithm is reasonable. All f 1st 1 and f 1st 2 were 0.64 and 1.04 mm, respectively. The absence of any error is attributed to the poor representation of the developed mathematical model or the inclusion of the measurement error in the depth information (z c m ). Table 8 describes the optimization results of the DH method. Similar to the result of the six-DoF HTMs, all F 1st were the same. However, the values of some parameters were not identical. Thus, this optimization problem was also multimodal. The acquired values of F 1st were larger than the results of the six-DoF HTMs because the DH parameters were illconditioned. According to reference [21], adjacent joint axes of a real robot and end-effector are not perfectly parallel in practice owing to manufacturing tolerances and various types of errors. Therefore, the link length (a n−1 ) can become extremely large. However, owing to the difficulty of precise prediction, the adjacent joint axes were assumed to be perfectly parallel in this experiment. This could cause F 1st values that are larger than those of the six-DoF HTMs.  Using the optimized values of seed 1 of the 6-DoF HTMs, the mean touching error was measured by making the robot hand touch all displayed dots. Table 9 presents the result. Equation (9) with s = 2 and PPI = 267 was used to convert the px to mm because Microsoft Surface Pro 7 was used. A mean touching error of 1.25 mm was achieved.

Second-Stage Optimization
Using the optimized parameters of seed 1 in the first-stage optimization, parameters for the two affine transformation matrices are optimized in the second-stage optimization. Table 10 summarizes the results for the six-DoF HTMs. Because the F 2nd decreased compared to F 1st , the second optimization contributes to reducing the error. Although different random seeds are set, all results are identical. Hence, the possibility of premature convergence is low, showing that these optimization results are reliable.  Table 11 presents the results of the DH method. Similar to the six-DoF HTMs, all values are the same. Because the result of the first-stage optimization is worse, the result of the second-stage optimization was also worse.
DH method represents a relationship of reference frames using four parameters, whereas HTM, which is often used in hand-eye calibration, uses six. Thus, the lower computational cost of the DH method is among its notable advantages. However, it does involve a few disadvantages as mentioned in the reference [21]. As described above, DH parameters have ill-conditioned behavior because the link length becomes extremely large when adjacent joint axes are not perfectly parallel. Moreover, link frames must be assigned such that valid DH parameters exist and arbitrary assignment is impossible. In contrast, six-DoF HTMs can be assigned. Thus, they are easy to use. As shown in the results of the two-stage optimizations, six-DoF HTMs achieve better results. Thus, this representation is suitable for hand-eye calibration. Using all optimized parameters for the six-DoF HTMs, the mean touching error is measured. Hand positions to touch are calculated using the below equations.
x bi m y bi Because the robot hand always touches the tablet display, the calculation of x bi m is unnecessary. Table 12 presents the results. Compared to the previous result, the mean touching error decreases. Thus, the affine transformation-based error minimization is effective. Because the mean touching errors in all trial numbers decrease, this method is stable.

Conclusions
In this study, a method that has the advantages of a traditional hand-eye calibration approach and a neural network-based approach is proposed. Simple mathematical models of six-DoF HTMs and the DH method were developed and optimized using the data, which were created using the attached end-effector. Therefore, the proposed method can provide numerical and quantitative results to users and researchers. Additional calibration for the end-effector can be avoided, although there is no CAD model because the data are created using the attached end-effector. Two-stage optimization was introduced to optimize the mathematical models. In the first-stage optimization, 12 parameters of the transformation matrices, which converted a position in the image coordinate system to that in the robot-based coordinate system to touch, were optimized. To further minimize the error, ten parameters of the two affine transformation matrices were optimized in the second-stage optimization. Using these optimized parameters, a mean touching error of 1.0 mm was achieved by the six-DoF HTMs. Because the proposed method can optimize the mathematical model using the data generated by the attached end-effector without detailed information, such as CAD diagrams, this method incorporates the advantages of both approaches, in contrast to conventional systems.
For smaller errors, developing a new method will be a future research direction. Additionally, similar to existing calibration methods, the proposed method requires recalibration if a different end-effector with different dimensions is attached. As this is a tedious process, a method that utilizes the first calibration result must be developed in future research to reduce the efforts required to perform the recalibration procedure.