Vision-Guided Hand–Eye Coordination for Robotic Grasping and Its Application in Tangram Puzzles

In this study we present an autonomous grasping system that uses a vision-guided hand– eye coordination policy with closed-loop vision-based control to ensure a sufficient task success rate while maintaining acceptable manipulation precision. When facing a diversity of tasks with complex environments, an autonomous robot should use the concept of task precision, including the accuracy of perception and precision of manipulation, as opposed to just the grasping success rate typically used in previous works. Task precision combines the advantages of grasping behaviors observed in humans and a grasping method applied in existing works. A visual servoing approach and a subtask decomposition strategy are proposed here to obtain the desired level of task precision. Our system performs satisfactorily on a tangram puzzle task. The experiments highlight the accuracy of perception, precision of manipulation, and robustness of the system. Moreover, the system is of great significance for improving the adaptability and flexibility of autonomous robots.


Introduction
In this study, we outline a general framework of robot control to solve a class of intelligent grasping tasks. For such tasks, the environment is variable, and so the framework of solving tasks cannot be defined in advance with a fixed sequence of motions. For example, in order for a home-service robot to accomplish a cup-carrying task, it must be able to adapt to changes in the color, lighting, position, shape, background, and other parameters of a cup and its surrounding environment. The robot needs to reliably recognize the target object, accurately locate the object, and have a closed-loop control process. These three requirements are difficult for an intelligent robot to satisfy because they present challenging computer vision task problems, such as invariant recognition under a complex background, optical measurement, and adaptive control with visual feedback.
In many existing works on robot grasping, the robot first perceives the scene and recognizes appropriate grasp locations, then plans a path to those locations, and finally follows the planned path to those locations [1,2]. These tasks are respectively called perception, planning, and action. However, the grasping behaviors observed in humans are dynamical processes that interleave sensing and manipulation at every module [3]. Robot hand-eye coordination is a typical feedback-based, closed-loop control process similar to grasping behaviors observed in humans. The core function of the process is to guide the motion of the robot based on the relative displacement between the target location and the end-effector of the robot [4,5]. This eliminates the work of calibration and transformation between multiple coordinate systems.
To achieve vision-guided hand-eye coordination, the core function must be redesigned to sense changes in both the environment and the target object and then provide a corresponding control strategy. We propose task precision as the core function, which can be calculated by the relative distance between the target object and the end-effector of the robot. The system is expected to have the ability to (1) recognize both the end-effector of the robot and the target object (recognition ability), (2) accurately describe their contours to determine their relative displacement (locating ability), and (3) track the movement of the robot towards the end-effector and the target object (tracking ability).
Humans have the ability to precisely solve tangram puzzles (see Section 4.1 for more details on this task). However, since the game background and target layout are both changeable, it is difficult for a robot to achieve hand-eye coordination and solve tangram puzzles. The vision-guided hand-eye coordination system in this study can complete the autonomous decision-making process like a human can.
The paper is organized as follows. Related robotic grasping research is introduced in Section 2. The framework of the system is reported in Section 3. The application of the system to solving a tangram puzzle is described in Section 4. The experimental results are presented in Section 5, and the conclusion is given in Section 6. Our robot environment is described in Appendix A.
The main contributions of the work are as follows: (1) Vision-guided hand-eye coordination is analyzed, and a new concept of task precision in robot grasping tasks is proposed, taking inspiration from the original closed-loop hand-eye coordination method.
(2) The perception, planning, and action stages in existing works are integrated into a module of our system so that we can choose different policies when facing different grasping tasks. (3) Our experimental evaluation demonstrates the effectiveness of this approach, which can run with only a laptop, in terms of both a sufficient task success rate and acceptable manipulation with precision.

Hand-Eye Coordination
Hand-eye coordination can be divided into two main types, calibrated and uncalibrated. Hill and Park first achieved closed-loop control in a calibrated hand-eye coordination method [4]. Calibrated hand-eye coordination methods [4,[6][7][8] obtain the corresponding relationship between the camera and the robot base by using the threedimensional space transformation principle and the known camera parameters. This kind of method requires a lot of labor; the accuracy of the parameters depends on the calibration, and recalibration is necessary when camera parameters change. Therefore, this kind of method is not suitable for the manipulator environment with open operation.
Many studies have been conducted on the uncalibrated hand-eye coordination of robots. Herve [9] demonstrated the feasibility of mapping from image space to robot space. Meanwhile, Su et al. proposed an uncalibrated robot hand-eye coordination system that improved the robot's environmental adaptability [5,10,11]. Levine et al. used deep learning in large-scale datasets to achieve hand-eye coordination [12,13]. The learning-based method uses multiple sensors and a large amount of pretraining data [14,15], while the other methods require a manual presentation of some of the first trial motions. Therefore, this type of approach is not suitable for open environment tasks. Although these methods simplify the challenging task of hand-eye coordination calibration and location, it becomes more challenging to guide the robot to make correct motion decisions on practical problems. For instance, if the goal of a robot with uncalibrated hand-eye coordination is to play chess, the robot needs to know where the pieces are placed before it can pick them up and place them in the correct position. The robot does not determine for itself where to place a piece.

Robotic Grasping
With the development of hardware technology, it has become possible to use mature three-dimensional laser ranging technology and point cloud processing to achieve threedimensional reconstruction and to complete the task of robot grasping. Several works on this topic have used open-loop planners to determine the best location at which to grasp [1,2,16]. In contrast, our system uses vision-guided hand-eye coordination, which enables closed-loop vision-based control. Therefore, our system can respond to dynamic disturbances and deal with complex environments.
In recent years, there have also been many studies on closed-loop grasping [12,[17][18][19][20][21][22][23][24][25][26][27][28]. For example, the Google team proposed a vision-based deep reinforcement learning algorithm to realize robotic grasping, which enables closed-loop control [10]. The robot can grasp objects unknown to the model with a grasping success rate of 96%. These works all focus on the grasping success rate. However, most tasks faced by intelligent robots have hard constraints such as a variable environment, time sensitivity, and the need for precise manipulation, so only focusing on the grasping is not enough, as in the example of playing chess above. Moreover, solving a tangram puzzle requires not only grasp success (how to grasp) but also task success (how to rotate and put down). Our system, which uses task precision rather than the grasping success rate, has strong adaptability and intelligence for tasks and maintains acceptable manipulation precision.

Visual Feedback
The classic choice for visual feedback is visual features. In recent years, deep learning technology has seen a great breakthrough in automatic feature extraction with the development of the convolutional neural network (CNN) to Faster R-CNN and Mask R-CNN [29][30][31]. These algorithms can effectively complete instance segmentation and can be used for "object detection" and "object key point detection". For example, the algorithms can be used to identify the tangram and its contours in a 2D picture. These algorithms are usually evaluated by recall, precision, and pattern classification indicators. The precision attained so far, however, is far from sufficient to guide the robot through delicate movements. Moreover, the good performance of these algorithms is generally based on large datasets and sufficient machine learning training in advance, so the cost of data acquisition, manual annotation, high-performance hardware, and other aspects of the final product is very high. In contrast, our system uses simple contour-based methods that require no training, and it adapts to complex backgrounds with enough precision.

System Structure
In this study, we used a vision-guided hand-eye coordination system as shown in Figure 1. The camera first acquires images of the task environment. An image is extracted by the visual processing module. The problem-solving module confirms the current action according to the environmental information and determines the best task to carry out. The motion planning module guides the robot's manipulation. The robot itself is composed of a controller and a body, forming a closed loop. Thus, the whole system forms a typical double closed-loop system. AI 2020, 1, FOR PEER REVIEW 3 closed-loop vision-based control. Therefore, our system can respond to dynamic disturbances and deal with complex environments. In recent years, there have also been many studies on closed-loop grasping [12,[17][18][19][20][21][22][23][24][25][26][27][28]. For example, the Google team proposed a vision-based deep reinforcement learning algorithm to realize robotic grasping, which enables closed-loop control [10]. The robot can grasp objects unknown to the model with a grasping success rate of 96%. These works all focus on the grasping success rate. However, most tasks faced by intelligent robots have hard constraints such as a variable environment, time sensitivity, and the need for precise manipulation, so only focusing on the grasping is not enough, as in the example of playing chess above. Moreover, solving a tangram puzzle requires not only grasp success (how to grasp) but also task success (how to rotate and put down). Our system, which uses task precision rather than the grasping success rate, has strong adaptability and intelligence for tasks and maintains acceptable manipulation precision.

Visual Feedback
The classic choice for visual feedback is visual features. In recent years, deep learning technology has seen a great breakthrough in automatic feature extraction with the development of the convolutional neural network (CNN) to Faster R-CNN and Mask R-CNN [29][30][31]. These algorithms can effectively complete instance segmentation and can be used for "object detection" and "object key point detection". For example, the algorithms can be used to identify the tangram and its contours in a 2D picture. These algorithms are usually evaluated by recall, precision, and pattern classification indicators. The precision attained so far, however, is far from sufficient to guide the robot through delicate movements. Moreover, the good performance of these algorithms is generally based on large datasets and sufficient machine learning training in advance, so the cost of data acquisition, manual annotation, high-performance hardware, and other aspects of the final product is very high. In contrast, our system uses simple contour-based methods that require no training, and it adapts to complex backgrounds with enough precision.

System Structure
In this study, we used a vision-guided hand-eye coordination system as shown in Figure 1. The camera first acquires images of the task environment. An image is extracted by the visual processing module. The problem-solving module confirms the current action according to the environmental information and determines the best task to carry out. The motion planning module guides the robot's manipulation. The robot itself is composed of a controller and a body, forming a closed loop. Thus, the whole system forms a typical double closed-loop system.  The robot and its controller belong to the hardware system. The hardware system includes robot hardware, kinematics, and communication systems. The hardware system is not the focus on this paper, but its structure is shown in Appendix A.
According to the principles of Sanderson and Weiss [32] and other related works, each module has multiple choices. The camera can be mounted on the arm of the robot (i.e., eye in hand) or somewhere in the environment (i.e., eye to hand). The visual processing module can have an RGB image or RGB-D image as the input and can select the feature and algorithm accordingly. The problem-solving module can use search, reasoning, or learning methods to achieve its function. The motion planning module can use a sampling-based approach (such as RRT) or a combinatorial approach.

Problem Solving
The robotic grasping task in our study has the following requirements: (1) good real-time performance, meaning the robot needs to respond to real-time changes in the environment; and (2) high manipulation precision, meaning that a large error leads to mission failure. Solving such a task requires a real-time, high-precision system, and we contend that the vision-guided robot hand-eye coordination system is the best choice.

Algorithm 1 Servoing
1: Given current image X and Task Precision f(t i ,p) 2: Get state s from visual processing module 3: Calculate p with image X and robot state 4: Get subtask t i from problem-solving module and s 5: n = f (t i , p) 6: for 1 . . . n do 7: Execute t i with motion planning module Similar to hierarchical reinforcement learning [33], we propose having the whole task to be composed of several subtasks t ∈ T and each subtask to be composed of several actions a ∈ A. The key to problem solving is the Task Precision function f(t i ,p), which uses input subtasks t i and precision p to obtain the number of dynamic divisions n. The precision p can be defined as the relative distance between the target object and the endeffector of the robot. The Task Precision function can be predefined or learned. Finally, we propose Algorithm 1 Servoing to guide the robot with continuous control.

Task Description
The tangram puzzle is a toy composed of seven blocks: five isosceles right triangles, a square, and a parallelogram [34]. The puzzle is shown in Figure 2.
AI 2020, 1, FOR PEER REVIEW 4 The robot and its controller belong to the hardware system. The hardware system includes robot hardware, kinematics, and communication systems. The hardware system is not the focus on this paper, but its structure is shown in Appendix A.
According to the principles of Sanderson and Weiss [32] and other related works, each module has multiple choices. The camera can be mounted on the arm of the robot (i.e., eye in hand) or somewhere in the environment (i.e., eye to hand). The visual processing module can have an RGB image or RGB-D image as the input and can select the feature and algorithm accordingly. The problem-solving module can use search, reasoning, or learning methods to achieve its function. The motion planning module can use a sampling-based approach (such as RRT) or a combinatorial approach.

Problem Solving
The robotic grasping task in our study has the following requirements: (1) good realtime performance, meaning the robot needs to respond to real-time changes in the environment; and (2) high manipulation precision, meaning that a large error leads to mission failure. Solving such a task requires a real-time, high-precision system, and we contend that the vision-guided robot hand-eye coordination system is the best choice. Similar to hierarchical reinforcement learning [33], we propose having the whole task to be composed of several subtasks ∈ and each subtask to be composed of several actions ∈ . The key to problem solving is the Task Precision function ( , ), which uses input subtasks and precision p to obtain the number of dynamic divisions . The precision p can be defined as the relative distance between the target object and the endeffector of the robot. The Task Precision function can be predefined or learned. Finally, we propose Algorithm 1 Servoing to guide the robot with continuous control.

Task Description
The tangram puzzle is a toy composed of seven blocks: five isosceles right triangles, a square, and a parallelogram [34]. The puzzle is shown in Figure 2. In this study, our system framework is applied to the tangram puzzle task. The main process of the task is as follows. (1) Tangram blocks are laid down on the table randomly.
(2) The target state image with a given pattern (see in Section 4.4) is known. (3) The robot selects the tangram blocks in the needed order and lays them down in accurate positions (see Section 4.5.4 for more detail).
As we are primarily concerned with our system having acceptable precision of manipulation, for this task, we can make the following assumptions: (1) the objects have one solid color and are not textured, and (2) the offset from the camera center to gripper center is known. We use preprogrammed motion to compensate for this offset.

Vision-Guided Hand-Eye Coordination for Tangram Task
In our system, the monocular camera for RGB imaging is mounted on the arm of the robot. We use the search policy in the problem-solving module and the combinatorial approach in the motion planning module.
The tangram task can be regarded as a 2D planar grasp, which means that the target object lies on a planar workspace and the grasp is constrained from one direction. In this case, the height of the gripper is fixed, and the gripper direction is perpendicular to one plane. Therefore, the essential information is simplified from 6D to 3D, including the 2D in-plane positions and the 1D rotation angle.
The key to solving the task lies in understanding the environment. We selected the centers of the shapes as image features. As the offset from the camera center to the gripper center is known, the error between the center of the shape and the center of the twodimensional image (see in Section 4.5.1) can be used to calculate p instead of the relative distance between the target object and the end-effector of the robot, as shown in Figure 3.
Additionally, we predefined Task Precision f(t i ,p) as follows: AI 2020, 1, FOR PEER REVIEW 5 Figure 2. The tangram blocks that need to be operated on in the task.
In this study, our system framework is applied to the tangram puzzle task. The main process of the task is as follows. (1) Tangram blocks are laid down on the table randomly.
(2) The target state image with a given pattern (see in Section 4.4) is known. (3) The robot selects the tangram blocks in the needed order and lays them down in accurate positions (see Section 4.5.4 for more detail).
As we are primarily concerned with our system having acceptable precision of manipulation, for this task, we can make the following assumptions: (1) the objects have one solid color and are not textured, and (2) the offset from the camera center to gripper center is known. We use preprogrammed motion to compensate for this offset.

Vision-Guided Hand-Eye Coordination for Tangram Task
In our system, the monocular camera for RGB imaging is mounted on the arm of the robot. We use the search policy in the problem-solving module and the combinatorial approach in the motion planning module.
The tangram task can be regarded as a 2D planar grasp, which means that the target object lies on a planar workspace and the grasp is constrained from one direction. In this case, the height of the gripper is fixed, and the gripper direction is perpendicular to one plane. Therefore, the essential information is simplified from 6D to 3D, including the 2D in-plane positions and the 1D rotation angle.
The key to solving the task lies in understanding the environment. We selected the centers of the shapes as image features. As the offset from the camera center to the gripper center is known, the error between the center of the shape and the center of the two-dimensional image (see in Section 4.5.1) can be used to calculate p instead of the relative distance between the target object and the end-effector of the robot, as shown in Figure 3.

Visual Processing of the Tangram Puzzle
The complex background forces the system to perform effective preprocessing on the images. To ensure that the system can recognize and extract features of images such as shape, position, and pose, the shape recognition and rotation calculations are added to the visual processing module. Figure 4 illustrates the stages of precise object recognition.

Visual Processing of the Tangram Puzzle
The complex background forces the system to perform effective preprocessing on the images. To ensure that the system can recognize and extract features of images such as shape, position, and pose, the shape recognition and rotation calculations are added to the visual processing module. Figure 4 illustrates the stages of precise object recognition.

Shape Recognition
In considering (1) the lack of a large number of training samples, (2) the requirement of minimal computing resources and fast computing speed, and (3) the needs of finegrained image analysis, the system adopted the basic image processing method and mathematical judgment instead of using the neural network. First, simple image enhancement operations are applied to the acquired image, such as Gaussian de-noising, brightness, and contrast adjustment [35]. To filter out different blocks according to color, the system converts the image mode from RGB to HSV for a more favorable segmentation interval. Simultaneously, morphological expansion and erosion operations are used to optimize regional boundaries [35].
The obtained results were not satisfying because of the complexity of the background and the limitations of the color filtering algorithm. The system needed to remove the noise that disturbs the task, and this process runs through the entire shape recognition process. First, small areas (e.g., the areas in the red boxes of Figure 5b) can be filtered out by area computation, and the remaining parts are distinguished by shape, as shown in Figure 5c. The system determines the vertices of each area (e.g., the areas in Figure 5c) by approximating its contour. According to the number of vertices, the areas are divided into two types. Then, the triangle is identified according to the angles and the relationships

Shape Recognition
In considering (1) the lack of a large number of training samples, (2) the requirement of minimal computing resources and fast computing speed, and (3) the needs of fine-grained image analysis, the system adopted the basic image processing method and mathematical judgment instead of using the neural network. First, simple image enhancement operations are applied to the acquired image, such as Gaussian de-noising, brightness, and contrast adjustment [35]. To filter out different blocks according to color, the system converts the image mode from RGB to HSV for a more favorable segmentation interval. Simultaneously, morphological expansion and erosion operations are used to optimize regional boundaries [35].
The obtained results were not satisfying because of the complexity of the background and the limitations of the color filtering algorithm. The system needed to remove the noise that disturbs the task, and this process runs through the entire shape recognition process. First, small areas (e.g., the areas in the red boxes of Figure 5b) can be filtered out by area computation, and the remaining parts are distinguished by shape, as shown in Figure 5c.

Shape Recognition
In considering (1) the lack of a large number of training samples, (2) the requirement of minimal computing resources and fast computing speed, and (3) the needs of finegrained image analysis, the system adopted the basic image processing method and mathematical judgment instead of using the neural network. First, simple image enhancement operations are applied to the acquired image, such as Gaussian de-noising, brightness, and contrast adjustment [35]. To filter out different blocks according to color, the system converts the image mode from RGB to HSV for a more favorable segmentation interval. Simultaneously, morphological expansion and erosion operations are used to optimize regional boundaries [35].
The obtained results were not satisfying because of the complexity of the background and the limitations of the color filtering algorithm. The system needed to remove the noise that disturbs the task, and this process runs through the entire shape recognition process. First, small areas (e.g., the areas in the red boxes of Figure 5b) can be filtered out by area computation, and the remaining parts are distinguished by shape, as shown in Figure 5c. The system determines the vertices of each area (e.g., the areas in Figure 5c) by approximating its contour. According to the number of vertices, the areas are divided into two types. Then, the triangle is identified according to the angles and the relationships The system determines the vertices of each area (e.g., the areas in Figure 5c) by approximating its contour. According to the number of vertices, the areas are divided into two types. Then, the triangle is identified according to the angles and the relationships between the three sides. The square and the parallelogram can be distinguished by the aspect ratio, but the parallelogram needs to be further determined by finding out whether the opposite sides are parallel.

Rotation Computation
Seven tangram blocks are placed on the table with random positions and poses. To put the messy blocks into a preconceived pattern (i.e., a target state image), the system needs to recognize the current posture of each block and rotate the blocks to fit the target state image. Based on the recognition process in Section 4.3.1, the rotation angle α can be calculated according to the detected image and the target state image, as shown in Figure 6.
AI 2020, 1, FOR PEER REVIEW 7 between the three sides. The square and the parallelogram can be distinguished by the aspect ratio, but the parallelogram needs to be further determined by finding out whether the opposite sides are parallel.

Rotation Computation
Seven tangram blocks are placed on the table with random positions and poses. To put the messy blocks into a preconceived pattern (i.e., a target state image), the system needs to recognize the current posture of each block and rotate the blocks to fit the target state image. Based on the recognition process in Section 4.3.1, the rotation angle α can be calculated according to the detected image and the target state image, as shown in Figure  6. The difficulty of rotation computation lies in the accuracy of mathematical calculations, which depends on the accuracy of recognition of contours and corners. Therefore, the entire process requires detailed and accurate information to be extracted from the image. Accuracy here reduces the error of positioning in the later stage of the task.

Tangram Problem Solving
Our system uses the search algorithm to solve the problem and find the target state in the state space [36]. Newell and Simon put forward the means-ends analysis method in 1961 [37]. By analyzing the actions required by the environmental state to be achieved and the influence of actions on the environmental state, search-based problem solving can be realized.
To solve the problem of this study, we can use the means-ends analysis method, which is composed of the subtasks of the robot [38]. We do not need to care about microlevel details or details about the hardware. Instead, we can define subtasks at a higher level, such as the impact of actions on the task environment. At the same time, we need to describe whether a particular action changes the task environment and describe the state of the environment with a series of predicates.
Based on this concept, we can use an approach called STRIPS [39], which uses three action elements (i.e., premise, add list, delete list) to represent the mission planning process. The first element is the set of premises (P), which must be satisfied when an operation is applied. The second element is the add list (A), which applies an operation to add the state of the environment. The third element is the delete list (D), which applies an action to reduce the state of the environment. Such a series of actions can be organized within a triangular table.
The environment state is described with a series of predicates: gripping(X)-The robot holds a tangram X in its hand ontable(X)-Tangram block X is on the table located(X)-Tangram block X's position is known rotated(X)-Tangram block X has been rotated around the specified angle inplace(X)-Tangram block X is in the specified position havepattern()-Information about the pattern is obtained The difficulty of rotation computation lies in the accuracy of mathematical calculations, which depends on the accuracy of recognition of contours and corners. Therefore, the entire process requires detailed and accurate information to be extracted from the image. Accuracy here reduces the error of positioning in the later stage of the task.

Tangram Problem Solving
Our system uses the search algorithm to solve the problem and find the target state in the state space [36]. Newell and Simon put forward the means-ends analysis method in 1961 [37]. By analyzing the actions required by the environmental state to be achieved and the influence of actions on the environmental state, search-based problem solving can be realized.
To solve the problem of this study, we can use the means-ends analysis method, which is composed of the subtasks of the robot [38]. We do not need to care about microlevel details or details about the hardware. Instead, we can define subtasks at a higher level, such as the impact of actions on the task environment. At the same time, we need to describe whether a particular action changes the task environment and describe the state of the environment with a series of predicates.
Based on this concept, we can use an approach called STRIPS [39], which uses three action elements (i.e., premise, add list, delete list) to represent the mission planning process. The first element is the set of premises (P), which must be satisfied when an operation is applied. The second element is the add list (A), which applies an operation to add the state of the environment. The third element is the delete list (D), which applies an action to reduce the state of the environment. Such a series of actions can be organized within a triangular table.
The environment state is described with a series of predicates: gripping(X)-The robot holds a tangram X in its hand ontable(X)-Tangram block X is on the table located(X)-Tangram block X's position is known rotated(X)-Tangram block X has been rotated around the specified angle inplace(X)-Tangram block X is in the specified position havepattern()-Information about the pattern is obtained In practice, the subtask may consist of a series of actions, such as moving to a specific position and rotating to a specific angle (see Section 4.5). Such a design can correspond to the interface functions provided by the Step 700E robot (see Appendix A.1) and allow the robot to realize the change in environmental state.
Define the initial state of the tangram task as ontable(X i ) ∧ gripping() ∧ camera(), where X i corresponds to the seven pieces of the tangram puzzle i = 1 − 7.
The completed status is inplace(X i ). In Figure 7, there is no set of premises for the total task Tangram(), which indicates the start of the task. Arrow means branch structures (i.e., move to a different action). In Nextpuzzle(), True indicates the presence of a disassembled tangram block and Nextpuzzle() moves to Place(X), whereas False ends the task. In Evaluate(X), True means a correct placement, whereas False means failure in the placement, and Evaluate(X) changes the environment state to ontable(X). In practice, the subtask may consist of a series of actions, such as moving to a specific position and rotating to a specific angle (see Section 4.5). Such a design can correspond to the interface functions provided by the Step 700E robot (see Appendix A.1) and allow the robot to realize the change in environmental state.
Define the initial state of the tangram task as ontable( ) ∧ gripping() ∧ camera(), where corresponds to the seven pieces of the tangram puzzle i = 1 − 7.
The completed status is inplace( ). In Figure 7, there is no set of premises for the total task Tangram(), which indicates the start of the task. Arrow means branch structures (i.e., move to a different action). In Nextpuzzle(), True indicates the presence of a disassembled tangram block and Nextpuzzle() moves to Place(X), whereas False ends the task. In Evaluate(X), True means a correct placement, whereas False means failure in the placement, and Evaluate(X) changes the environment state to ontable(X). With the aid of the triangle table, a flowchart for solving the tangram puzzle can be drawn (see Figure 8). The implementation of ontable(X) uses a stack to determine whether the tangram block is on the desktop. With the aid of the triangle table, a flowchart for solving the tangram puzzle can be drawn (see Figure 8). The implementation of ontable(X) uses a stack to determine whether the tangram block is on the desktop. Before motion planning, the system analyzes the target state image, which is designed by us in the client. Figure 9 illustrates an example of this (i.e., a dog pattern). The system uses the shape recognition algorithm introduced in Section 4.3.1 to process the image of the target state to obtain the color, shape, and coordinates of each tangram block's center. After sorting the coordinates of the tangram block's center as needed, the robot places each tangram block.

Motion Planning
The motion planning module determines how to execute the atomic actions of the robot. For instance, locating the tangram blocks requires the robot to perform multiple moves, as discussed in the following subsection.

Locating the Tangram Blocks
The main task of the locating operation is to obtain the error between the image's center and the tangram block's center by using the visual feedback module so that the robot can accurately differentiate a specified tangram block from the randomly placed blocks. Figure 10 shows the algorithm flow of this task. Before motion planning, the system analyzes the target state image, which is designed by us in the client. Figure 9 illustrates an example of this (i.e., a dog pattern). Before motion planning, the system analyzes the target state image, which is designed by us in the client. Figure 9 illustrates an example of this (i.e., a dog pattern). The system uses the shape recognition algorithm introduced in Section 4.3.1 to process the image of the target state to obtain the color, shape, and coordinates of each tangram block's center. After sorting the coordinates of the tangram block's center as needed, the robot places each tangram block.

Motion Planning
The motion planning module determines how to execute the atomic actions of the robot. For instance, locating the tangram blocks requires the robot to perform multiple moves, as discussed in the following subsection.

Locating the Tangram Blocks
The main task of the locating operation is to obtain the error between the image's center and the tangram block's center by using the visual feedback module so that the robot can accurately differentiate a specified tangram block from the randomly placed blocks. Figure 10 shows the algorithm flow of this task. The system uses the shape recognition algorithm introduced in Section 4.3.1 to process the image of the target state to obtain the color, shape, and coordinates of each tangram block's center. After sorting the coordinates of the tangram block's center as needed, the robot places each tangram block.

Motion Planning
The motion planning module determines how to execute the atomic actions of the robot. For instance, locating the tangram blocks requires the robot to perform multiple moves, as discussed in the following subsection.

Locating the Tangram Blocks
The main task of the locating operation is to obtain the error between the image's center and the tangram block's center by using the visual feedback module so that the robot can accurately differentiate a specified tangram block from the randomly placed blocks. Figure 10 shows the algorithm flow of this task. In Figure 10, is the user-defined count number of fault tolerances, and its initial value is 3. If the camera captures an image three times and the system still cannot correctly identify the tangram block with the specified color and shape, an error is reported.
After correctly identifying the tangram block, the system determines whether the tangram block is in the center of the camera's field of view by where ( , ) is the coordinate of the tangram block's center in the image, and ( , ) is the coordinate of the image's center. The coordinates are shown in Figure 11.
The system defines a threshold as ℎ ℎ( ). When | − | ℎ ℎ and | − | ℎ ℎ, the system considers the block to be in the center of the camera's field of view. Otherwise, the system calculates the correct direction of movement from ⃗ according to the relationship between the world coordinates of the robot and the image coordinates. Then, the end-effector moves one step in the correct direction to gradually approach the target block. The value of ℎ ℎ and the size of the step are reduced adaptively with the height of the robot's end-effector.

Picking up Tangram Blocks
After locating a specified block, the block is in the center of the camera's field of view. In engaging hand-eye coordination, the end-effector of the robot is moved to the grasping position. Then, the block is picked up by the electric gripper.

Rotating Tangram Blocks
After correctly locating and grabbing the tangram block, we focus on how to accurately place it. First, the system calculates the rotation angle of the block by using the rotation algorithm introduced in Section 4.3.2. Then, the robot uses its end-effector to rotate the block to adjust the block's pose. In Figure 10, cnt is the user-defined count number of fault tolerances, and its initial value is 3. If the camera captures an image three times and the system still cannot correctly identify the tangram block with the specified color and shape, an error is reported.

Putting down Tangram Blocks
After correctly identifying the tangram block, the system determines whether the tangram block is in the center of the camera's field of view by is the coordinate of the tangram block's center in the image, and P 1 (x 1 , y 1 ) is the coordinate of the image's center. The coordinates are shown in Figure 11. The system defines a threshold as thresh(pix). When |x 1 − x 0 | < thresh and |y 1 − y 0 | < thresh, the system considers the block to be in the center of the camera's field of view. Otherwise, the system calculates the correct direction of movement from → P 0 P 1 according to the relationship between the world coordinates of the robot and the image coordinates. Then, the endeffector moves one step in the correct direction to gradually approach the target block. The value of thresh and the size of the step are reduced adaptively with the height of the robot's end-effector. In Figure 10, is the user-defined count number of fault tolerances, and its initial value is 3. If the camera captures an image three times and the system still cannot correctly identify the tangram block with the specified color and shape, an error is reported.
After correctly identifying the tangram block, the system determines whether the tangram block is in the center of the camera's field of view by where ( , ) is the coordinate of the tangram block's center in the image, and ( , ) is the coordinate of the image's center. The coordinates are shown in Figure 11.
The system defines a threshold as ℎ ℎ( ). When | − | ℎ ℎ and | − | ℎ ℎ, the system considers the block to be in the center of the camera's field of view. Otherwise, the system calculates the correct direction of movement from ⃗ according to the relationship between the world coordinates of the robot and the image coordinates. Then, the end-effector moves one step in the correct direction to gradually approach the target block. The value of ℎ ℎ and the size of the step are reduced adaptively with the height of the robot's end-effector.

Picking up Tangram Blocks
After locating a specified block, the block is in the center of the camera's field of view. In engaging hand-eye coordination, the end-effector of the robot is moved to the grasping position. Then, the block is picked up by the electric gripper.

Rotating Tangram Blocks
After correctly locating and grabbing the tangram block, we focus on how to accurately place it. First, the system calculates the rotation angle of the block by using the rotation algorithm introduced in Section 4.3.2. Then, the robot uses its end-effector to rotate the block to adjust the block's pose.

Picking up Tangram Blocks
After locating a specified block, the block is in the center of the camera's field of view. In engaging hand-eye coordination, the end-effector of the robot is moved to the grasping position. Then, the block is picked up by the electric gripper.

Rotating Tangram Blocks
After correctly locating and grabbing the tangram block, we focus on how to accurately place it. First, the system calculates the rotation angle of the block by using the rotation algorithm introduced in Section 4.3.2. Then, the robot uses its end-effector to rotate the block to adjust the block's pose.

Putting down Tangram Blocks
This section describes how to calculate the position of tangram blocks in the planar world coordinate of the robot. We refer to the images with special patterns (see Figure 9) as goal state images. By proportionally magnifying the length and width of the goal state image, the system sets a puzzle area to put down tangram blocks in the robot's planar world coordinate. The coordinate value of the tangram block's center in the world coordinate is calculated according to the pixel coordinate values of the tangram block's center in the goal state image. Since the initial height of the robot's end-effector is known, after obtaining the coordinate value of the block's center, the robot can accurately put down blocks in the puzzle area. Finally, all blocks are placed in the pattern indicated in the goal state image.

Evaluating Tangram Blocks
The evaluation section provides a function with which to check the tangram blocks placed by the robot (shown in Figure 12). The system defines a set to record the placed blocks. It checks whether each tangram block recorded in the collection is at the specified puzzle area. If not, the tangram block is pushed into the stack.
AI 2020, 1, FOR PEER REVIEW 11 This section describes how to calculate the position of tangram blocks in the planar world coordinate of the robot. We refer to the images with special patterns (see Figure 9) as goal state images. By proportionally magnifying the length and width of the goal state image, the system sets a puzzle area to put down tangram blocks in the robot's planar world coordinate. The coordinate value of the tangram block's center in the world coordinate is calculated according to the pixel coordinate values of the tangram block's center in the goal state image. Since the initial height of the robot's end-effector is known, after obtaining the coordinate value of the block's center, the robot can accurately put down blocks in the puzzle area. Finally, all blocks are placed in the pattern indicated in the goal state image.

Evaluating Tangram Blocks
The evaluation section provides a function with which to check the tangram blocks placed by the robot (shown in Figure 12). The system defines a set to record the placed blocks. It checks whether each tangram block recorded in the collection is at the specified puzzle area. If not, the tangram block is pushed into the stack.

Visual Feedback Statistical Experiment
To demonstrate the effectiveness of our visual feedback processing algorithm, we compare its performance with the widely used visual feedback algorithms (Faster R-CNN [21] and Mask R-CNN [22]). In our visual feedback experiment, the background is complex, while the position and pose of the object are changeable.

Visual Feedback Indicators
The square error in a picture is defined as = (∑ ( − ) )/ , where is the position of the center of mass of the i-th tangram on the picture and is the ground truth position of the center of mass of the i-th tangram. In other words, we evaluate the accuracy of the visual feedback by calculating the center of mass point deviation from each tangram.
In Figure 13, we provide a visual look at the square error. The Mask R-CNN algorithm predicts the position of each tangram and obtains the mask of the object, based on which mass center of each tangram can be obtained. Our algorithm predicts the center of mass from the contour of each tangram.

Visual Feedback Statistical Experiment
To demonstrate the effectiveness of our visual feedback processing algorithm, we compare its performance with the widely used visual feedback algorithms (Faster R-CNN [21] and Mask R-CNN [22]). In our visual feedback experiment, the background is complex, while the position and pose of the object are changeable.

Visual Feedback Indicators
The square error in a picture is defined as E square = ( In Figure 13, we provide a visual look at the square error. The Mask R-CNN algorithm predicts the position of each tangram and obtains the mask of the object, based on which mass center of each tangram can be obtained. Our algorithm predicts the center of mass from the contour of each tangram.

Visual Feedback Statistical Results
We count the square errors of 400 pictures and take the average error of all the blocks of each picture as an experimental result. As shown in Figure 14, our result is generally equal to or better than that of the comparison algorithm.

Tangram Experiment
The tangram blocks are placed in any position and pose in a complex background. Then, the robot makes reasonable decisions to complete the tangram puzzle based on the target state randomly assigned by us. During the execution of the task, we can change the task environment at will. For example, we can verify whether the robot can still make correct decisions when we remove a block that had been placed previously or is currently being identified.

Tangram Indicators
The gap error between tangram blocks is defined as = (∑ | − |)/21, where 0 < i < j < 8, is the value of the distance between the mass centers of any two tangram blocks placed by the robot after finishing the tangram puzzle, and is the value of the distance between any two center points of the tangram blocks in the target state image. The difference between the two values is used to reflect the gap between the two tangram blocks. In this experiment, one pixel corresponds to 70/256 0.273 mm in the real environment.

Visual Feedback Statistical Results
We count the square errors of 400 pictures and take the average error of all the blocks of each picture as an experimental result. As shown in Figure 14, our result is generally equal to or better than that of the comparison algorithm.

Visual Feedback Statistical Results
We count the square errors of 400 pictures and take the average error of all the blocks of each picture as an experimental result. As shown in Figure 14, our result is generally equal to or better than that of the comparison algorithm.

Tangram Experiment
The tangram blocks are placed in any position and pose in a complex background. Then, the robot makes reasonable decisions to complete the tangram puzzle based on the target state randomly assigned by us. During the execution of the task, we can change the task environment at will. For example, we can verify whether the robot can still make correct decisions when we remove a block that had been placed previously or is currently being identified.

Tangram Indicators
The gap error between tangram blocks is defined as = (∑ | − |)/21, where 0 < i < j < 8, is the value of the distance between the mass centers of any two tangram blocks placed by the robot after finishing the tangram puzzle, and is the value of the distance between any two center points of the tangram blocks in the target state image. The difference between the two values is used to reflect the gap between the two tangram blocks. In this experiment, one pixel corresponds to 70/256 0.273 mm in the real environment.

Tangram Experiment
The tangram blocks are placed in any position and pose in a complex background. Then, the robot makes reasonable decisions to complete the tangram puzzle based on the target state randomly assigned by us. During the execution of the task, we can change the task environment at will. For example, we can verify whether the robot can still make correct decisions when we remove a block that had been placed previously or is currently being identified.

Tangram Indicators
The gap error between tangram blocks is defined as E gap = (∑ d ij − r ij )/21 , where 0 < i < j < 8, d ij is the value of the distance between the mass centers of any two tangram blocks placed by the robot after finishing the tangram puzzle, and r ij is the value of the distance between any two center points of the tangram blocks in the target state image. The difference between the two values is used to reflect the gap between the two tangram blocks. In this experiment, one pixel corresponds to 70/256 ≈ 0.273 mm in the real environment.

Dog Pattern Experiment
One of our examples, as shown in Figures 15-17, consists of an input state, target state, and our subtasks. The experimental indicators of this example are shown in Table 1. We list all the gap errors between any two tangram blocks.
AI 2020, 1, FOR PEER REVIEW 13 One of our examples, as shown in Figures 15-17, consists of an input state, target state, and our subtasks. The experimental indicators of this example are shown in Table  1. We list all the gap errors between any two tangram blocks.   AI 2020, 1, FOR PEER REVIEW 13 One of our examples, as shown in Figures 15-17, consists of an input state, target state, and our subtasks. The experimental indicators of this example are shown in Table  1. We list all the gap errors between any two tangram blocks.   AI 2020, 1, FOR PEER REVIEW 13 One of our examples, as shown in Figures 15-17, consists of an input state, target state, and our subtasks. The experimental indicators of this example are shown in Table  1. We list all the gap errors between any two tangram blocks.

Gap Error(mm)
Target State the task environment; (b) the process of grabbing after recognition.

Statistical Experiment
A large number of experiments are used to test the performance of the proposed hand-eye coordination system for the tangram task. The target state images used are given in Figure 18, and the errors are given in Figure 19.

Statistical Experiment
A large number of experiments are used to test the performance of the proposed hand-eye coordination system for the tangram task. The target state images used are given in Figure 18, and the errors are given in Figure 19.  In Figure 20, the solid red line represents the average gap error in one trial. The dotted red line shows the total average error; it is generally small, and the task accuracy is high. The dotted blue line is the total standard deviation, which shows that the error does not fluctuate very much.
For the 100 trials, the number of trials is counted according to a certain range, indicating the error distribution. Figure 18 shows that most trials are concentrated near the average value, and 99% of the trials are concentrated within the range of 0.5-2.5 mm. The data are normally distributed. Therefore, the entire experiment is dominated by accidental errors [40].

Conclusions
This study presents a vision-guided hand-eye coordination system for robotic grasping. Our system is a lightweight, intelligent decision-making system utilizing complete closed-loop control. Owing to the implementation of the concept of task precision, the robot was able to complete some complex tasks through hand-eye coordination in a dynamic and changing environment and achieve high precision similarly to how humans can. In Figure 20, the solid red line represents the average gap error in one trial. The dotted red line shows the total average error; it is generally small, and the task accuracy is high. The dotted blue line is the total standard deviation, which shows that the error does not fluctuate very much. In Figure 20, the solid red line represents the average gap error in one trial. The dotted red line shows the total average error; it is generally small, and the task accuracy is high. The dotted blue line is the total standard deviation, which shows that the error does not fluctuate very much.
For the 100 trials, the number of trials is counted according to a certain range, indicating the error distribution. Figure 18 shows that most trials are concentrated near the average value, and 99% of the trials are concentrated within the range of 0.5-2.5 mm. The data are normally distributed. Therefore, the entire experiment is dominated by accidental errors [40].

Conclusions
This study presents a vision-guided hand-eye coordination system for robotic grasping. Our system is a lightweight, intelligent decision-making system utilizing complete closed-loop control. Owing to the implementation of the concept of task precision, the robot was able to complete some complex tasks through hand-eye coordination in a dynamic and changing environment and achieve high precision similarly to how humans can. For the 100 trials, the number of trials is counted according to a certain range, indicating the error distribution. Figure 18 shows that most trials are concentrated near the average value, and 99% of the trials are concentrated within the range of 0.5-2.5 mm. The data are normally distributed. Therefore, the entire experiment is dominated by accidental errors [40].

Conclusions
This study presents a vision-guided hand-eye coordination system for robotic grasping. Our system is a lightweight, intelligent decision-making system utilizing complete closed-loop control. Owing to the implementation of the concept of task precision, the robot was able to complete some complex tasks through hand-eye coordination in a dynamic and changing environment and achieve high precision similarly to how humans can.
In this study, the robust system showed high adaptability and fault tolerance to the tangram task. This is helpful for promoting research on intelligent tasks such as cupcarrying by a household-service robot.
The hand-eye coordination system in this study needs to know the hand-eye relationship and the distance between the end-effector and the operating surface in advance. In order to improve intelligence, the system's dependence on prior knowledge should be reduced to adapt to various hand-eye relationships. This is a possible direction for follow-up research.  The STEP robot hardware system is composed of a body, robot controller, and control cabinet, as shown in Figure A1a. The body consists of six axes through which the robot moves, as shown in Figure A1b. The robot controller has basic software to control the robot operation with a series of control functions. The control cabinet contains the control hardware of the six axes, and it loads the control program written by CodeSys. The control program is the intermediary between the user and the robot controller.
AI 2020, 1, FOR PEER REVIEW 16 In this study, the robust system showed high adaptability and fault tolerance to the tangram task. This is helpful for promoting research on intelligent tasks such as cup-carrying by a household-service robot.
The hand-eye coordination system in this study needs to know the hand-eye relationship and the distance between the end-effector and the operating surface in advance. In order to improve intelligence, the system's dependence on prior knowledge should be reduced to adapt to various hand-eye relationships. This is a possible direction for followup research.  Data Availability Statement: All data included in this study are available upon request by contact with the corresponding author.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A Appendix A.1. Robot Hardware
The STEP robot hardware system is composed of a body, robot controller, and control cabinet, as shown in Figure A1a. The body consists of six axes through which the robot moves, as shown in Figure A1b. The robot controller has basic software to control the robot operation with a series of control functions. The control cabinet contains the control hardware of the six axes, and it loads the control program written by CodeSys. The control program is the intermediary between the user and the robot controller.

Appendix A.2. Kinematic Modeling
In this system, because an electric gripper was added, the connecting rod parameters of the robot needed to be added to the kinematics equation. Thus, the electric gripper can be regarded as an ideal Cartesian element.
According to the improved D-H representation, the transformation matrix from frame { − 1} to frame { } can be written as follows: where the parameters are shown in Table A1. Appendix A.2. Kinematic Modeling In this system, because an electric gripper was added, the connecting rod parameters of the robot needed to be added to the kinematics equation. Thus, the electric gripper can be regarded as an ideal Cartesian element.
According to the improved D-H representation, the transformation matrix from frame {i − 1} to frame {i} can be written as follows: where the parameters are shown in Table A1.
where cθ i is the cosine of θ i , which is abbreviated as c i below, and sθ i is the sine of θ i , which is abbreviated as s i below.
The relationship between the end of the mechanical arm and the base 0 6 T is According to the general formula, the transformation matrix of each connecting rod can be obtained as follows: where Here, s 23 is sin(θ 2 + θ 3 ), and c 23 is cos(θ 2 + θ 3 ).

A.3. Sequential Instruction Communication Protocol
To solve the transmission problem of the sequential motion instruction of the robot on the half-duplex port, we adopted a sequential instruction communication protocol based on a semaphore. Six steps complete the motion command communication and execution. Send and Finish are used to control the two semaphores between CodeSys and the robot controller program, where Send indicates whether new instructions were sent and Finish indicates whether the robot completed an action. The CodeSys program communicates with the user's Python program through a socket and with the robot controller program through a port.
Steps (shown in Figure A2.) Send and Finish have initial values of F. Send is sent on port 4 and Finish on port 5. Figure A2. Illustration of the sequential instruction communication.