Design of Demonstration-Driven Assembling Manipulator

: Currently, a mechanical arm or manipulator needs to be programmed by humans in advance to deﬁne its motion trajectory before practical use. However, the programming is tedious and high-cost, which renders such manipulators unable to perform various different tasks easily and quickly. This article focuses on the design of a vision-guided manipulator without explicit human programming. The proposed demonstration-driven system mainly consists of a manipulator, control box, and camera. Instead of programming of the detailed motion trajectory, one only needs to show the system how to perform a given task manually. Based on internal object recognition and motion detection algorithms, the camera can capture the information of the task to be performed and generate the motion trajectories for the manipulator to make it copy the human demonstration. The movement of the joints of the manipulator is given by a trajectory planner in the control box. Experimental results show that the system can imitate humans easily, quickly, and accurately for common tasks such as sorting and assembling objects. Teaching the manipulator how to complete the desired motion can help eliminate the complexity of programming for motion


Introduction
Assembly is required when a machine is built of many individual parts. In the past few decades, assembly has been manually completed by workers [1]. Thanks to the rapid development of industrial automation, more and more work is now completed by machines [2]. Peg-hole assembly, flat panel parts assembly, and many other automatic assembly machines have been invented for this purpose [3,4].
Though these assembly machines could accomplish the desired task satisfactorily, they are usually complicated, and can only be used in some specific scenarios. In a large-scale automobile assembly system, flexible and adaptable assembly technology and strategies include robotic fixtureless assembly, self-reconfigurable assembly systems, and increased modularity in the assembly process [5]. In flexible manufacturing, it is required that the assembly machine be able to perform various tasks easily and quickly. However, it is expensive to adjust the existing assembly line, since the equipment needs to be replaced to performance different tasks. Besides, different tasks have different requirements for the robot's initial and final postures as well as the placement location of the workpieces. The traditional assembly system not only has difficulties in sorting and assembling various different workpieces, but also may fail to grab the required positions of workpieces that change dynamically in different tasks, and thus has negative influences on production efficiency [6,7].
To further improve the universality of assembly machines, 6-DOF (Degree Of Freedom) robot manipulators are widely used in automated assembly because of their flexibility. However, how to control the manipulators efficiently, accurately, and easily is always a difficult and urgent problem.

Methods
The proposed system shown in Figure 1 works in two phases: the demonstration phase and the automatic assembly phase.
First, in the demonstration phase, one manually shows to the manipulator each step (e.g., move one workpiece on top of another) of a specific task. The 3D camera obtains the spatial positions of the workpiece dynamically during the demonstration, and then transform the positions based on the coordinate system of the industrial manipulator. According to the inverse kinematic model of the industrial manipulator, the control information of each manipulator joint is obtained for each position of the workpiece. Then, in the auto-assembly phase, the manipulator is driven to complete the task automatically based on the control information generated above. To further improve the universality of assembly machines, 6-DOF (Degree Of Freedom) robot manipulators are widely used in automated assembly because of their flexibility. However, how to control the manipulators efficiently, accurately, and easily is always a difficult and urgent problem. Vision-guided manipulators were proposed to tackle this problem [8][9][10]. To improve the performance of object recognition in the system, Laptev proposed a method of boosted histograms [11]. Influential research by Aivaliotis and Zampetis presented a method for the visual recognition of parts using machine learning algorithms to enable the manipulation of parts [12]. Chunjie Zhang came up with an image representation method using boosted random contextual semantic spaces [13]. Panagiota Tsarouchi proposed an online two-dimensional (2D) vision system combined with offline data from CAD (Computer Aided Design) files for the computation of three-dimensional (3D) POI (Points of Interest) coordinates [14].
In robotic assembly with a vision system, many new technologies are emerging. Corner detection is quite efficient in part identification and quick enough to be used for online applications [15]. Part recognition using vision and ultrasonic sensor, which is installed at the end-effector of the robot, is used to recognize objects for a robotic assembly system [16]. Two cameras are coordinated to capture images, which are used for golf club head production. The system adopts automated mechanical arms matched with cameras for 3D space alignment [17]. For large-scale components, robotic assembly system guided by two vision sensors and three one-dimensional (1D) laser sensors were realized to automatically localize and search the best 3D position and orientation between the object and the installation location [18].
This article focuses on a demonstration-driven industrial manipulator based on the object recognition and vision-guided method. One can demonstrate how to assemble the workpieces manually to the manipulator in advance, and then the manipulator can understand the task based on visual information processing. Teaching the manipulator how to complete tasks by simple demonstration can eliminate the complexity of explicit programming for manipulator control for different tasks.

Methods
The proposed system shown in Figure 1 works in two phases: the demonstration phase and the automatic assembly phase.
First, in the demonstration phase, one manually shows to the manipulator each step (e.g., move one workpiece on top of another) of a specific task. The 3D camera obtains the spatial positions of the workpiece dynamically during the demonstration, and then transform the positions based on the coordinate system of the industrial manipulator. According to the inverse kinematic model of the industrial manipulator, the control information of each manipulator joint is obtained for each position of the workpiece. Then, in the auto-assembly phase, the manipulator is driven to complete the task automatically based on the control information generated above.  The proposed automatic assembly system consists of a manipulator, a control box, and a camera, as Figure 2 shows. NVIDIA Jetson TK1 was chosen as the image processing module for this system.  [19,20] was adopted and implemented for object detection. It is a fast algorithm based on convolutional neural network that can detect the target of interest in the input image accurately. The detection speed is very fast, up to 45 frames per second. The robot control circuit uses STM32F103 embedded structure to control the movements of seven servomotors, six of which are the same joint servomotors, while the seventh is a claw servomotor. The time of execution of the control algorithm on the STM32 microcontroller is about 3 ms. The parameters of the robot dynamics of the six joint servomotors are shown in Table 1. The proposed automatic assembly system consists of a manipulator, a control box, and a camera, as Figure 2 shows. NVIDIA Jetson TK1 was chosen as the image processing module for this system. YOLO (You Only Look Once) [19,20] was adopted and implemented for object detection. It is a fast algorithm based on convolutional neural network that can detect the target of interest in the input image accurately. The detection speed is very fast, up to 45 frames per second. The robot control circuit uses STM32F103 embedded structure to control the movements of seven servomotors, six of which are the same joint servomotors, while the seventh is a claw servomotor. The time of execution of the control algorithm on the STM32 microcontroller is about 3 ms. The parameters of the robot dynamics of the six joint servomotors are shown in Table 1.  Here, M* is the maximum output torque of the X20-8.4-50 steering gear and M is the theoretical total torque calculated from the actual mass and length. Therefore, the safety factor provides sufficient allowance for the uncalculated friction force and non-inertial force of the mechanism.
The visual signal processing hardware (NVIDIA TK1), including a 3D camera, can recognize the objects in real-time and capture their motions. According to the signals given by TK1, the manipulator controller can grasp the object, move it to a specified place (as demonstrated by a human), and release it. A wireless communication module is used to increase the flexibility of the manipulator. The structure of the control box is shown in Figure 3.
The underlying models for robot control are shown in Figure 4 below. The boxes with solid lines next to the green arrows represent the input image which contains target objects. The blue boxes represent the sensors and controller components (including a camera, upper computer, lower computer, and rudder). The boxes with solid lines next to the orange arrows represent the output of each component. The boxes with dotted lines next to the green arrows represent the transmission link.  The theoretical safety factor can be obtained as follows: Here, M* is the maximum output torque of the X20-8.4-50 steering gear and M is the theoretical total torque calculated from the actual mass and length. Therefore, the safety factor provides sufficient allowance for the uncalculated friction force and non-inertial force of the mechanism.
The visual signal processing hardware (NVIDIA TK1), including a 3D camera, can recognize the objects in real-time and capture their motions. According to the signals given by TK1, the manipulator controller can grasp the object, move it to a specified place (as demonstrated by a human), and release it. A wireless communication module is used to increase the flexibility of the manipulator. The structure of the control box is shown in Figure 3.
The underlying models for robot control are shown in Figure 4 below. The boxes with solid lines next to the green arrows represent the input image which contains target objects. The blue boxes represent the sensors and controller components (including a camera, upper computer, lower computer, and rudder). The boxes with solid lines next to the orange arrows represent the output of each component. The boxes with dotted lines next to the green arrows represent the transmission link.
The control interface of the upper computer is written in C# language. It takes the position and shape of the target object as input. Given the arm length of the manipulator and the initial value of each joint, the rudder angles of each joint are solved by the D-H algorithm and are then sent to the lower machine. The upper computer can initialize and configure the minimum and maximum rotation angles of each joint rudder to ensure that there is no collision with the parts or environment. The upper computer also has several other functions which enables fast adjustment of the configurations of the manipulator. First, the rotation angle of each joint rudder can be adjusted and the initial values can be set manually. Second, aided by a human's demonstration, the manipulator can move along a specific trajectory. The angle values of the joint rudder are sent to the lower computer through the wireless module and the output of the joint rudder rotation is controlled through the PWM (Pulse Width Modulation), which works at 10 Hz. The angle of each transformation can be set by the upper computer and the velocity range is 1 • /s-180 • /s. In order to prevent jitter, progressive acceleration and deceleration are used when starting and stopping.  The control interface of the upper computer is written in C# language. It takes the position and shape of the target object as input. Given the arm length of the manipulator and the initial value of each joint, the rudder angles of each joint are solved by the D-H algorithm and are then sent to the lower machine. The upper computer can initialize and configure the minimum and maximum rotation angles of each joint rudder to ensure that there is no collision with the parts or environment. The upper computer also has several other functions which enables fast adjustment of the configurations of the manipulator. First, the rotation angle of each joint rudder can be adjusted and the initial values can be set manually. Second, aided by a human's demonstration, the manipulator can move along a specific trajectory. The angle values of the joint rudder are sent to the lower computer through the wireless module and the output of the joint rudder rotation is controlled through the PWM (Pulse Width Modulation), which works at 10 Hz. The angle of each transformation can be set by the upper computer and the velocity range is 1°/s-180°/s. In order to prevent jitter,  The control interface of the upper computer is written in C# language. It takes the position and shape of the target object as input. Given the arm length of the manipulator and the initial value of each joint, the rudder angles of each joint are solved by the D-H algorithm and are then sent to the lower machine. The upper computer can initialize and configure the minimum and maximum rotation angles of each joint rudder to ensure that there is no collision with the parts or environment. The upper computer also has several other functions which enables fast adjustment of the configurations of the manipulator. First, the rotation angle of each joint rudder can be adjusted and the initial values can be set manually. Second, aided by a human's demonstration, the manipulator can move along a specific trajectory. The angle values of the joint rudder are sent to the lower computer through the wireless module and the output of the joint rudder rotation is controlled through the PWM (Pulse Width Modulation), which works at 10 Hz. The angle of each transformation can be set by the upper computer and the velocity range is 1°/s-180°/s. In order to prevent jitter,

Kinematics Modeling of Manipulator
The D-H kinematics model (as shown in Figure 5) of a 6-DOF manipulator is solved, and the positive and inverse kinematics equations are obtained, which provides the designed parameters for intelligent control of the manipulator. According to the actual grasping situation, the degree of the freedom of the manipulator is simplified and the complex mathematical model is transformed into a simple kinematic equation. According to the parameters of the D-H linkage model of the manipulator joint [21,22], the transformation matrix of the adjacent joint is obtained. The pose expression of the end (Equation (3)) is easily obtained according to Equation (2). The degree of the joints is shown in Table 2 according to the D-H coordinate system.

Kinematics Modeling of Manipulator
The D-H kinematics model (as shown in Figure 5) of a 6-DOF manipulator is solved, and the positive and inverse kinematics equations are obtained, which provides the designed parameters for intelligent control of the manipulator. According to the actual grasping situation, the degree of the freedom of the manipulator is simplified and the complex mathematical model is transformed into a simple kinematic equation. According to the parameters of the D-H linkage model of the manipulator joint [21,22], the transformation matrix of the adjacent joint is obtained. The pose expression of the end (Equation (3)) is easily obtained according to Equation (2). The degree of the joints is shown in Table 2 according to the D-H coordinate system.

Automatic Control of the Manipulator
The 3D camera obtains the 2D image and the depth information of the manipulator at each position. The manipulator completes the automatic assembling task according to the teaching process after being instructed and trained. During the teaching process, the workpiece is held by hand to complete the migration from the assembly initial posture to the assembly termination posture, as shown in Figure 6.
The 3D camera obtains the 2D image and the depth information of the manipulator at each position. The manipulator completes the automatic assembling task according to the teaching process after being instructed and trained. During the teaching process, the workpiece is held by hand to complete the migration from the assembly initial posture to the assembly termination posture, as shown in Figure 6. Firstly, the coordinate system of the manipulator and the 3D camera is calibrated, and the transition matrix from the coordinate system of the 3D camera to the coordinate system of the manipulator is calculated.
Secondly, the 3D camera takes a picture of the working scene, with the workpiece in the original pose at this moment, and the SIFT (scale-invariant feature transform) [23,24] feature of the scene image is calculated. Then the position of the workpiece is identified by the color of the workpiece, and the SIFT characteristic of the workpiece is obtained by the position. The two feature points A and B are picked up, which are the two points farthest away from each other; the vector AB ⃗ is taken as the original pose of the workpiece, and point A is taken as original position of the workpiece. The feature point A, the feature point B, and the vector AB ⃗ are all under the coordinate system of the 3D camera.
Thirdly, the workpiece is adjusted from the original pose to the initial pose of assembly, and the SIFT feature is calculated again in the initial pose of the assembly. Then the SIFT feature is matched with the SIFT feature A and the SIFT feature B of the original pose, and the SIFT features A1 and B1 in the initial pose are found. The vector A B ⃗ is the initial attitude of the assembly and point A1 is taken as the initial assembly position of the workpiece. The feature points A1, B1, and the vector AB ⃗ are all in the coordinate system of the 3D camera (because of the feature-scale invariance of the SIFT feature, the feature points A1 and B1 can still be found from the initial pose of the assembly, even if the workpiece rotates or the light changes from the original pose to the initial pose of assembly).
Fourthly, the workpiece is adjusted from the initial pose to the final pose, and the SIFT feature is calculated again in the final pose of the assembly and matched with the SIFT features A and B of the original pose to find the SIFT features A2 and B2 in the initial pose. The vector A B ⃗ is the end gesture of assembly, and point A2 is taken as the end assembly position of the workpiece. The feature points A2, B2, and the vector AB ⃗ are all in the coordinate system of the 3D camera.
Lastly, the original attitude AB ⃗ and position A are transformed into control information of the industrial manipulator gripping the workpiece, which is in the initial position and orientation. The initial attitude A B ⃗ and position of assembly A1 are transformed into control information of the industrial manipulator when it moves the workpiece to the initial pose of assembly. The end attitude A B ⃗ and position of assembly A2 are transformed into control information of the industrial manipulator when it moves the workpiece to the end pose of assembly. Each pose contains the workpiece's position coordinates and attitude vector. Firstly, the coordinate system of the manipulator and the 3D camera is calibrated, and the transition matrix from the coordinate system of the 3D camera to the coordinate system of the manipulator is calculated.
Secondly, the 3D camera takes a picture of the working scene, with the workpiece in the original pose at this moment, and the SIFT (scale-invariant feature transform) [23,24]

Object Recognition with YOLO
Due to the end-to-end characteristic of deep convolutional neural networks, YOLO directly predicts the locations and the classes of objects in a given input image. The model of YOLO consists of many layers. The first several layers (typically convolutional layers) are used for feature extraction and the last several layers are used for the prediction of the final detection results.
In detail, YOLO divides an input image into S × S grids. Each grid is responsible for outputting B bounding boxes, each of which has four location information variables (x, y, l, w), where (x, y) means the predicted center coordinates and (l, w) means the length and width, as well as one confidence score represented by Pr(Object) which predicts whether this bounding box has object or not. Additionally, C conditional probabilities represented by Pr(Class|Object) are also generated for each grid to indicate the class label for the object. So, the final output layer should have S × S × (5B + C) units [20]. We can calculate the class label for each bounding box as: If the predicted and bounding box of the object and its corresponding ground truth are denoted as B p and B g , respectively, which are both rectangles, the IOU (Intersection Over Union) is defined as the ratio of the area of the intersection of B p and B g to that of their union.
Notably, the loss function should be designed carefully in the training process. The objective of the loss function is to achieve a good balance among the three terms, namely, the location of objects, the prediction of object existence, and the classification of object class. This function can be expressed as follows in Equation (5) and Table 3. Table 3. Definition of the loss function.

Symbol Math Expression Meaning
Loss of center coordinates of the bounding boxes Prediction loss of length and width of the bounding boxes Confidence prediction loss of the bounding boxes with object Confidence prediction loss of bounding boxes without object Conditional class prediction Equation (5) is the loss function, which has five parts. The first is the loss of center coordinates of the bounding box. To all real objects, square loss is taken between the predicted values and its ground truth. The second is the prediction loss of length and width of the bounding box. The square root is chosen, since the big deviations are usually much bigger than small ones, and the square root could avoid deviations of small bounding boxes. The third and the fourth parts represent the loss functions of confidence scores of all bounding boxes. And the last one is the class prediction of recognition.
λcoord is the weight of the importance of losses related to coordinate predictions. Increasing this weight can help improve the average accuracy. λnoobj can decrease the penalty loss of a bounding box without an object, as many bounding boxes do not contain objects. Here, the default values λcoord = 5 and λnoobj = 0.5 are set according to Reference [20].
YOLO is first pretrained on Pascal VOC 2007 and 2012 datasets and then finetuned on our own datasets to make it adapt to our scenarios. Our training images are collected by the camera of the system and labeled using labelImg (https://github.com/tzutalin/labelImg), which is a widely used annotation tool for object detection. YOLO is run with CUDA (Compute Unified Device Architecture) to ensure its real-time performance. The whole application process consists of six steps, as Figure 7 shows: annotation tool for object detection. YOLO is run with CUDA (Compute Unified Device Architecture) to ensure its real-time performance. The whole application process consists of six steps, as Figure 7 shows: The aim is to detect and identify many common objects, such as the person, bird, and plane in the images. Firstly, training datasets should be prepared in advance. The locations and class labels of all possible objects should be labeled manually. Secondly, the hyperparameters should be changed according to the meta information of our datasets. Then, the number of categories and categories' names are modified in src/yolo.c; relevant induced changes should be made to the penultimate layer of YOLO (with parameters modified in src/yolo_kernels.cu). The hyperparameter relevant to the output and the number of classes are also modified in the configuration file cfg/yolo.cfg. The initial parameters of the network are set as the pretrained values trained on Pascal VOC 2007 and 2012 datasets. Finally, the YOLO is trained on our dataset. We identified our target objects in rectangles of different colors by using the trained model.

Experimental
The proposed demonstration-driven system was tested experimentally, as shown in Figure 8. There were three workpieces (cubes with a side length of 1 cm) colored red, green, and blue. One key step of the assembly task is taking these workpieces from an initial position to a target position quickly and accurately. As a demonstration, we placed each workpiece by hand from the initial position to the target position, which was captured by the camera. Then the manipulator was controlled to repeat this process automatically. In the target position, the three cubes were stacked in order stably. The position error was within 5% of the side length of the cube. The whole process is shown in Figure 9. The aim is to detect and identify many common objects, such as the person, bird, and plane in the images. Firstly, training datasets should be prepared in advance. The locations and class labels of all possible objects should be labeled manually. Secondly, the hyperparameters should be changed according to the meta information of our datasets. Then, the number of categories and categories' names are modified in src/yolo.c; relevant induced changes should be made to the penultimate layer of YOLO (with parameters modified in src/yolo_kernels.cu). The hyperparameter relevant to the output and the number of classes are also modified in the configuration file cfg/yolo.cfg. The initial parameters of the network are set as the pretrained values trained on Pascal VOC 2007 and 2012 datasets. Finally, the YOLO is trained on our dataset. We identified our target objects in rectangles of different colors by using the trained model.

Experimental
The proposed demonstration-driven system was tested experimentally, as shown in Figure 8. There were three workpieces (cubes with a side length of 1 cm) colored red, green, and blue. One key step of the assembly task is taking these workpieces from an initial position to a target position quickly and accurately. As a demonstration, we placed each workpiece by hand from the initial position to the target position, which was captured by the camera. Then the manipulator was controlled to repeat this process automatically. In the target position, the three cubes were stacked in order stably. The position error was within 5% of the side length of the cube. The whole process is shown in Figure 9.
Two additional experiments were performed to demonstrate that our system can quickly adapt to multiple new tasks. In the experiment on workpiece placement shown in Figure 10a, the three workpieces were moved from the initial position (side by side) to the target position (45 • oblique angle) in a certain order. The first two images show the human's demonstration and the last two show how the manipulator repeating the task. The position error of workpiece placement was less than 5%. In the experiment on coin manipulation shown in Figure 10b, the manipulator grabs the coin from the initial position and quickly drops it to a box through a narrow gap. The thickness of coin is 1.9 mm and the gap width of box is 2 mm, which means that the position error of the manipulator is within 0.1 mm.
We also have a graphic user interface for the usage of the manipulator. The vision-guided manipulators were controlled to take an object (a blue man toy) to its target position (white circle), as shown in Figure 11.  Two additional experiments were performed to demonstrate that our system can quickly adapt to multiple new tasks. In the experiment on workpiece placement shown in Figure 10a, the three workpieces were moved from the initial position (side by side) to the target position (45° oblique angle) in a certain order. The first two images show the human's demonstration and the last two show how the manipulator repeating the task. The position error of workpiece placement was less than 5%. In the experiment on coin manipulation shown in Figure 10b, the manipulator grabs the coin from the initial position and quickly drops it to a box through a narrow gap. The thickness of coin is 1.9 mm and the gap width of box is 2 mm, which means that the position error of the manipulator is within 0.1 mm.  Two additional experiments were performed to demonstrate that our system can quickly adapt to multiple new tasks. In the experiment on workpiece placement shown in Figure 10a, the three workpieces were moved from the initial position (side by side) to the target position (45° oblique angle) in a certain order. The first two images show the human's demonstration and the last two show how the manipulator repeating the task. The position error of workpiece placement was less than 5%. In the experiment on coin manipulation shown in Figure 10b, the manipulator grabs the coin from the initial position and quickly drops it to a box through a narrow gap. The thickness of coin is 1.9 mm and the gap width of box is 2 mm, which means that the position error of the manipulator is within 0.1 mm. We also have a graphic user interface for the usage of the manipulator. The vision-guided manipulators were controlled to take an object (a blue man toy) to its target position (white circle), as shown in Figure 11.  We also have a graphic user interface for the usage of the manipulator. The vision-guided manipulators were controlled to take an object (a blue man toy) to its target position (white circle), as shown in Figure 11.

Discussion and Conclusions
This paper introduces a demonstration-driven approach for a vision-guided automatic assembly system to make it easily adapt to performing various different tasks. The manipulator seems to be the "hand" of the system and camera seems to be the "eyes". Thus, the manipulator has the ability to learn the motion trajectory demonstrated by humans and reproduce it quickly with processed visual information and the generated control signals. In flexible manufacturing, demonstration-driven manipulators are able to perform more complex tasks under the guidance of vision, and potentially may cooperate with other manipulators. At the same time, the system can also be applied to Figure 11. Testing whole process of target object detection.

Discussion and Conclusions
This paper introduces a demonstration-driven approach for a vision-guided automatic assembly system to make it easily adapt to performing various different tasks. The manipulator seems to be the "hand" of the system and camera seems to be the "eyes". Thus, the manipulator has the ability to learn the motion trajectory demonstrated by humans and reproduce it quickly with processed visual information and the generated control signals. In flexible manufacturing, demonstration-driven manipulators are able to perform more complex tasks under the guidance of vision, and potentially may cooperate with other manipulators. At the same time, the system can also be applied to intelligent sorting. The accuracy and efficiency of cargo picking may be improved by using the proposed approach. The application and extension of the proposed approach and system in the future will be an important research direction and will greatly help to make such robots more intelligent.
Author Contributions: Q.W. and C.Y. conceived and designed the experiments; Q.W., W.F. and Y.Z. performed the experiments; Q.W. wrote the paper.