Robot Imitation Learning of Social Gestures with Self-Collision Avoidance Using a 3D Sensor

To effectively interact with people, social robots need to perceive human behaviors and in turn display their own behaviors using social communication modes such as gestures. The modeling of gestures can be difficult due to the high dimensionality of the robot configuration space. Imitation learning can be used to teach a robot to implement multi-jointed arm gestures by directly observing a human teacher’s arm movements (for example, using a non-contact 3D sensor) and then mapping these movements onto the robot arms. In this paper, we present a novel imitation learning system with robot self-collision awareness and avoidance. The proposed method uses a kinematical approach with bounding volumes to detect and avoid collisions with the robot itself while performing gesticulations. We conducted experiments with a dual arm social robot and a 3D sensor to determine the effectiveness of our imitation system in being able to mimic gestures while avoiding self-collisions.


Introduction
Social robots are currently being developed to interact with humans by perceiving human behavior [1] and displaying natural human-like gestures such as waving to greet humans or pointing to objects to guide their visual focus of attention [2][3][4]. Trajectories to generate these human-like gestures are often programmed and implemented by expert roboticists. Imitation learning provides a natural mode of human-robot interaction for non-experts to teach a robot arm gestures by having a robot observe, using visual sensors, the teacher's physical demonstrations of the gestures and then map these observed motions onto the robot's own embodiment [5][6][7].

Imitation Learning
A common approach for imitation learning is to sense the 3D arm joint positions of a demonstrator and solve for the inverse kinematics problem to determine the joint angles between each of the demonstrator's arm segments [8]. The demonstrator joint angles can then be directly mapped to the robot's arms for robots with the same degrees of freedom (DOF) of a human (e.g., the Nao robot). The benefit of utilizing an inverse kinematics model for trajectory representation is that all arm segments are represented at once. These models can be solved using an analytical approach [9], iterative Jacobian [10], or neuro-fuzzy system potential to reduce the similarity of the robot poses to the demonstrator and can result in jerky motions due to rapid changes in speed when avoiding collisions. In [17], a Kinect sensor-based imitation learning system with self-collision avoidance was developed for the humanoid robot Nao. Namely, an inverse-kinematics-based approach was applied to determine demonstrator joint angles from sensed joint positions and directly map them to the Nao robot's joints. A self-collision avoidance technique was also presented that models the robot links as cylinders with spherical ends. The criterion for a collision was defined as situations where the shortest distance between two-line segments was less than the sum of the cylinder radii. When a collision was detected, the robot just stopped moving. However, for such a collision avoidance technique, the demonstrator's poses need to be sampled at a fast-enough rate in order to avoid the motion between two collision free poses introducing a collision. In [18], a similar Kinect sensor-based imitation learning system with self-collision was developed for the humanoid robot Nao. Namely, the robot's workspace was first empirically modeled off-line by manually sampling robot joint poses to identify if a collision would occur. The workspace model was then utilized online to solve for collision-free joint poses to imitate a demonstrator's poses by determining if a solution was within the collision-free workspace of the robot. However, the accuracy was limited to the number of samples used to model the workspace and could not account for all possible collision/collision-free poses.
To date, imitation learning systems with self-collision avoidance are limited as they are: (1) capable of modelling only collisions between two body parts [14]; (2) computationally expensive and therefore cannot be used in real-time [15]; (3) unable to account for all collisions/collision-free poses during a demonstration [17,18]; and (4) unsuitable for gesture demonstrations due to rapid joint speed changes as a result of the collision avoidance technique [16]. Therefore, in this paper, we propose a real-time gesture imitation learning system that can address these limitations by detecting all self-collisions for a robot with multiple body parts. The novelty of this approach is that it optimizes a collision-free motion trajectory while minimizing the distance from the demonstrator's trajectory. The system utilizes a geometric-based approach to map human joint angles to robot joint angles and a kinematical-based approach for self-collision avoidance.

Gesture Imitation Learning System
The proposed gesture imitation learning system is presented in Figure 1. The gesture imitation leaning system has two main modules: (1) a skeleton motion mapping module; and (2) a self-collision detection and avoidance module. Namely, the skeleton motion mapping module takes as input a sequence of demonstrator poses (i.e., joint positions in 3D space), herein referred to as a demonstration trajectory, and solves for the sequence of demonstrator joint angles executed to attain this sequence of poses. The self-collision detection and avoidance module then detects for potential self-collisions when the sequence of joint angles is directly mapped to the robot. If a collision is detected for a robot joint angle configuration in the sequence it solves for a collision-free joint angle configuration that closely imitates the human demonstrator. Finally, the sequence of collision-free joint angle configurations is outputted to be executed by the robot.

Skeleton Motion Mapping Module
The skeleton motion mapping module first obtains the demonstration trajectory and applies a geometric-based approach to each of the demonstrator's poses, x human , to obtain a mapping to the robot's pose, q mimic . As shown in Figure 2, the demonstrator's arm is modeled with four degrees of freedoms (DOFs). There are three DOFs in the shoulder and one DOF in the elbow. We first determine the shoulder roll (θ 1 ), pitch (θ 2 ), and yaw (θ 3 ) angles using the following equations: (1) where → l 1x , → l 1y , and → l 1z are the projections of l 1 on the axes of x, y and z, respectively. Similarly, → l 2x , → l 2y , and → l 2z are the projections of l 2 on the axes of x, y and z, respectively. → l 1 , → l 2 , and → l 3 are obtained as follows: where → P w , → P e , and → P s are the positions of the wrist, elbow and shoulder, and x human = { → P w , → P e , → P s }. We then determine the elbow roll angle, θ 4 , with the following equation: where | → l 1 | and | → l 2 | are the upper arm and forearm of the left/right arm, respectively. | → l 3 | is the length from the shoulder to the elbow, as shown in Figure 2.

Self-Collision Detection and Avoidance Module
The self-collision detection and avoidance module is represented by two submodules: (1) the self-collision detection submodule that utilizes a bounding-volume-based approach to detect collisions between any two body parts for a robot pose; and (2) the collision avoidance submodule that solves for the closest collision-free joint angles that mimic the demonstrator's poses.
To avoid the possible collision points, the module first checks the distance between any two components. Using a bounding-volume-based approach, the components are modeled as spheres or capsules with a radius r. Our approach obtains the position of each joint based on q mimic by forward kinematics, and the distance D between two components whose radii are r 1 and r 2 . If D > r 1 + r 2 , there is no collision between the two components. If there is a collision between two components, the module finds a set of new angles q collision-free that has a minimum difference with q mimic , to ensure that there is a minimum distance D between the two volumes.
Finally, q collision-free is sent to the robot to implement.

Self-Collision Detection
The self-collision detection submodule is used to determine whether two body parts of the robot will be in collision given a robot pose q mimic . Herein, we model a robot's body parts as either capsules (e.g., arms, torso) or spheres (e.g., head). A sphere is defined using its centroid and radius. A capsule is a cylinder with two half spheres at its two ends, which is defined by the positions of its two end points and the radius of the cylinder. We define a robot pose to have no self-collisions when the distance between two body parts is larger than the summation of the radii of the two body part models: D kj − (r k + r j ) > 0, k = 1, . . . , n; j = 1, . . . , n where D kj is the shortest distance between two modeled robot body parts, body parts k and j. r k and r j are the radii of robot body parts k and j, respectively. n is the total number of robot body parts.

Collision Checks between a Sphere and a Capsule
The collision checks of a sphere and a capsule are represented by the distance between a point and a line segment in 3D space. Let P s , M t , and M b represent the center of sphere S, and the two end points of capsule M, respectively, as shown in Figure 3. The following different cases are considered for calculating the distance between a sphere and a capsule:

Collision Checks between Two Capsules
The collision checks of two capsules become the distance between two line-segments in 3D space. Let Mt, Mb, Nt and Nb represent the two end points of capsule M and capsule N, respectively. �����������⃗ is the axis of M and ����������⃗ is the axis of N.

Collision Checks between Two Capsules
The collision checks of two capsules become the distance between two line-segments in 3D space.
Let M t , M b , N t and N b represent the two end points of capsule M and capsule N, respectively.
The following are several cases for calculating the distance between two capsules: Figure 4. There are two situations as follows: If both the intersections M h and N h lie on the two segments, i.e., | (ii) If one of the two intersections, M h or N h , do not lie on the segment, i.e., | (2) If the two lines where k = 0, the two lines are parallel, as shown in Figure 5. The shortest distance will be the smaller of d 1 and d 2 .
(3) If the two lines → M t M b and → N t N b are neither skew nor parallel, the two lines are intersected at P, as shown in Figure 6. There are two situations as follows: If P lies on the two segments, i.e., | (ii) If P does not lie on one of the two segments, i.e., |

Self-Collision Avoidance
Given a self-collision has been detected, the self-collision avoidance submodule determines robot joint angles that best mimic a human demonstrator's pose. Namely, we define self-collision avoidance as an optimization problem which solves for the set of robot joint angles which are collision-free, q collision-free , and has the minimum joint angle difference with a demonstrated pose q mimic . The objective function, e, for the optimization is defined as: where i is the robot joint angle being optimized, u is the total number of robot joints, θ i and θ i are, respectively, the demonstrated joint angle and collision-free joint angle of joint i. The constraint for the optimization is that all the robot body parts (i.e., head, arms, upper torso, and lower torso) must be collision free. D kj is the shortest distance between two modeled robot body parts, body parts k and j. r k and r j are the radii of robot body parts k and j, respectively. n is the total number of robot body parts.
In this work, we used a constrained optimization by linear approximation (COBYLA) algorithm [19] to solve the optimization problem. COBYLA is a derivation-free optimization method for constrained problems [20]. The specific algorithms we have designed for this optimization are presented in Figure 7.

Self-Collision Avoidance
Given a self-collision has been detected, the self-collision avoidance submodule determines robot joint angles that best mimic a human demonstrator's pose. Namely, we define self-collision avoidance as an optimization problem which solves for the set of robot joint angles which are collision-free, qcollision-free, and has the minimum joint angle difference with a demonstrated pose qmimic. The objective function, e, for the optimization is defined as: where i is the robot joint angle being optimized, u is the total number of robot joints, θi and ′ are, respectively, the demonstrated joint angle and collision-free joint angle of joint i. The constraint for the optimization is that all the robot body parts (i.e., head, arms, upper torso, and lower torso) must be collision free. Dkj is the shortest distance between two modeled robot body parts, body parts k and j. rk and rj are the radii of robot body parts k and j, respectively. n is the total number of robot body parts.
In this work, we used a constrained optimization by linear approximation (COBYLA) algorithm [19] to solve the optimization problem. COBYLA is a derivation-free optimization method for constrained problems [20]. The specific algorithms we have designed for this optimization are presented in Figure 7.

Imitation Learning with the Tangy Robot
We implemented our gesture imitation learning system on the humanoid robot, Tangy, shown in Figure 8. The robot consists of two six-DOF arms, an upper torso with a tablet screen, a lower torso connected to a mobile base. The configuration and parameters of the robot are detailed in Appendix A. Following the bounding-volume-based approach presented in Section 2.2.1, the robot's upper torso is modeled as five capsules (see Figure 8a), with one capsule located at the bottom of the screen and two spheres located at the two bottom corners of the screen to incorporate the USB cables, the circles shown in Figure 8b. The lower torso, the upper arms, and the forearms are also modeled as capsules, and the head is modeled as a sphere, as shown in Figure 8a. We used the ASUS Xtion 3D sensor, Figure 9, to track the position of the demonstrator's joints using the OpenNI skeleton tracker [21]. In our real-world implementation of the gesture imitation learning system, we include angular velocity joint constraints to filter noisy sensor data which does not correspond to actual human demonstrator joint movements. Namely, an empirically identified threshold of 1 rad/s was used with our gesture imitation learning system. We have also included speed constraints to prevent the robot manipulator from exceeding 250 mm/s to comply with the current ISO 15066 for collaborative robots to ensure safety when a robot is working in close proximity to a human [22].

Imitation Learning with the Tangy Robot
We implemented our gesture imitation learning system on the humanoid robot, Tangy, shown in Figure 8. The robot consists of two six-DOF arms, an upper torso with a tablet screen, a lower torso connected to a mobile base. The configuration and parameters of the robot are detailed in Appendix A. Following the bounding-volume-based approach presented in Section 2.2.1, the robot's upper torso is modeled as five capsules (see Figure 8a), with one capsule located at the bottom of the screen and two spheres located at the two bottom corners of the screen to incorporate the USB cables, the circles shown in Figure 8b. The lower torso, the upper arms, and the forearms are also modeled as capsules, and the head is modeled as a sphere, as shown in Figure 8a. We used the ASUS Xtion 3D sensor, Figure 9, to track the position of the demonstrator's joints using the OpenNI skeleton tracker [21]. In our real-world implementation of the gesture imitation learning system, we include angular velocity joint constraints to filter noisy sensor data which does not correspond to actual human demonstrator joint movements. Namely, an empirically identified threshold of 1 rad/s was used with our gesture imitation learning system. We have also included speed constraints to prevent the robot manipulator from exceeding 250 mm/s to comply with the current ISO 15066 for collaborative robots to ensure safety when a robot is working in close proximity to a human [22].

Experimental Section
We evaluate the performance of our proposed gesture imitation learning system both in simulation and in the real-world on the Tangy robot. Namely, the experiments focused on evaluating the imitation learning system to investigate the performance of the system in: (1) avoiding self-collisions during a human demonstration; and (2) generating collision-free robot arm trajectories that are similar to the demonstrator's motions.
Human demonstrators presented five separate motions to the robot, motions I to V, Figure 10. Motion I has the demonstrator extend his/her arms forward horizontally from the vertical position (from pose I1 to pose I2) and then back to the vertical position (from pose I2 to pose II1). motion II has the demonstrator extend his/her arms horizontally sideways from the vertical position (from pose II1 to pose II2) and then back to the vertical position (from pose II2 to pose III1). Motion III has the

Experimental Section
We evaluate the performance of our proposed gesture imitation learning system both in simulation and in the real-world on the Tangy robot. Namely, the experiments focused on evaluating the imitation learning system to investigate the performance of the system in: (1) avoiding self-collisions during a human demonstration; and (2) generating collision-free robot arm trajectories that are similar to the demonstrator's motions.
Human demonstrators presented five separate motions to the robot, motions I to V, Figure 10. Motion I has the demonstrator extend his/her arms forward horizontally from the vertical position (from pose I1 to pose I2) and then back to the vertical position (from pose I2 to pose II1). motion II has the demonstrator extend his/her arms horizontally sideways from the vertical position (from pose II1 to pose II2) and then back to the vertical position (from pose II2 to pose III1). Motion III has the

Experimental Section
We evaluate the performance of our proposed gesture imitation learning system both in simulation and in the real-world on the Tangy robot. Namely, the experiments focused on evaluating the imitation learning system to investigate the performance of the system in: (1) avoiding self-collisions during a human demonstration; and (2) generating collision-free robot arm trajectories that are similar to the demonstrator's motions.
Human demonstrators presented five separate motions to the robot, motions I to V, Figure 10. Motion I has the demonstrator extend his/her arms forward horizontally from the vertical position (from pose I 1 to pose I 2 ) and then back to the vertical position (from pose I 2 to pose II 1 ). motion II has the demonstrator extend his/her arms horizontally sideways from the vertical position (from pose II 1 to pose II 2 ) and then back to the vertical position (from pose II 2 to pose III 1 ). Motion III has the demonstrator extend his/her arms forward horizontally from the vertical position (from pose III 1 to pose III 2 ), then sideways (from pose III 2 to pose III 3 ), and finally back to the vertical position (from pose III 3 to pose IV 1 ). Motion IV has the demonstrator fold his/her elbow (from pose IV 1 to pose IV 2 ) and then back to the vertical position (from pose IV 2 to pose V 1 ). Motion V has the demonstrator fold his/her arms towards the chest (from pose V 1 to pose V 2 ) and then back to the vertical position (from pose V 2 to pose V 3 ). These motions were selected as they evaluate the entire range of motion for the robot's shoulder and elbow joints. Similar motions were also evaluated in [18]. Demonstrators were asked to repeat each motion twice. . These motions were selected as they evaluate the entire range of motion for the robot's shoulder and elbow joints. Similar motions were also evaluated in [18]. Demonstrators were asked to repeat each motion twice.

Simulation Results
The demonstrator's five set of motions are captured using the ASUS Xtion 3D sensor. Following the geometric-based approach presented in Section 2.1, the demonstrator's poses are mapped to the robot. Figure 11 shows the position of the robot's end-effector and the elbow for the aforementioned five set of motions. The blue circles represent the workspace of the robot's end-effector and elbow when the joint angles of the human demonstrator are directly mapped to the robot, qmimic, without any self-collision avoidance. The red circles represent the robot's collision-free poses determined with self-collision avoidance, qcollision-free. It can be observed from Figure 11 that some of the blue circles collide with the robot, while no red circles collide with the robot. It is noted that there is collision if the upper arm (the segment of shoulder and elbow) or the forearm (the segment of elbow and wrist) contacts any of the robot body parts. For example, without self-collision avoidance, the robot arms collide with the tablet screen in motions I, III, IV and V, and the two wrists collide with each other in motion V. Figure 10. Video frame sequences when imitating human demonstrator motion using the imitation learning system. Five sets of motions include motion I (I 1 -II 1 ), motion II (II 1 -III 1 ), motion III (III 1 -IV 1 ), motion IV (IV 1 -V 1 ), and motion V (V 1 -V 3 ).

Simulation Results
The demonstrator's five set of motions are captured using the ASUS Xtion 3D sensor. Following the geometric-based approach presented in Section 2.1, the demonstrator's poses are mapped to the robot. Figure 11 shows the position of the robot's end-effector and the elbow for the aforementioned five set of motions. The blue circles represent the workspace of the robot's end-effector and elbow when the joint angles of the human demonstrator are directly mapped to the robot, q mimic , without any self-collision avoidance. The red circles represent the robot's collision-free poses determined with self-collision avoidance, q collision-free . It can be observed from Figure 11 that some of the blue circles collide with the robot, while no red circles collide with the robot. It is noted that there is collision if the upper arm (the segment of shoulder and elbow) or the forearm (the segment of elbow and wrist) contacts any of the robot body parts. For example, without self-collision avoidance, the robot arms collide with the tablet screen in motions I, III, IV and V, and the two wrists collide with each other in motion V.   Figure 12a,b show the q mimic and q collision-free trajectories for each motion, respectively. It is observed that motion I (extending arms forward) is mainly actuated by the shoulder roll angle θ 1 ; motions II (extending arms horizontally) and III (extending arms forward and then horizontally) are actuated by all the shoulder joints, θ 1 , θ 2 , and θ 3 . Different from motions I, II, and III, motion IV (folding elbows) is mainly actuated by the elbow joint θ 4 . Motion V (a random movement) is actuated by all the joints. Figure 12c shows which joints are actuated to avoid the potential collisions. ∆ is used to represent the difference between q mimic and q collision-free , i.e., ∆ = |q mimic − q collision-free |. For example, for motion I and motion III, the robot arm collide with the side of the robot's tablet screen without self-collision avoidance, thus the collision-free trajectory stretches the arms out to move the arms away from the side of the screen (Figure 11a,c), which is achieved by changing the shoulder pitch angle θ 2 , as can be denoted by ∆θ 2 for motion I and III in Figure 12c. For motions IV and V, there are collisions with the screen surface without self-collision avoidance, and furthermore, in motion V between the two wrists, Figure 11d,e. To avoid these collisions, the collision-free trajectory lifts up the robot arms (via changing the shoulder yaw angle θ 3 ), straightens the arms (via changing the elbow roll angle θ 4 ), and stretches the arms out (via changing the shoulder pitch angle θ 2 ), as can be seen by ∆θ 3 , ∆θ 4 and ∆θ 2 for motions IV and V in Figure 12c. As there were no self-collisions for motion II (Figure 11b), the difference between q mimic and q collision-free was 0, as can be seen in Figure 12c. elbows) is mainly actuated by the elbow joint θ4. Motion V (a random movement) is actuated by all the joints. Figure 12c shows which joints are actuated to avoid the potential collisions. Δ is used to represent the difference between qmimic and qcollision-free, i.e., Δ = |qmimic − qcollision-free|. For example, for motion I and motion III, the robot arm collide with the side of the robot's tablet screen without self-collision avoidance, thus the collision-free trajectory stretches the arms out to move the arms away from the side of the screen (Figure 11a,c), which is achieved by changing the shoulder pitch angle θ2, as can be denoted by Δθ2 for motion I and III in Figure 12c. For motions IV and V, there are collisions with the screen surface without self-collision avoidance, and furthermore, in motion V between the two wrists, Figure 11d,e. To avoid these collisions, the collision-free trajectory lifts up the robot arms (via changing the shoulder yaw angle θ3), straightens the arms (via changing the elbow roll angle θ4), and stretches the arms out (via changing the shoulder pitch angle θ2), as can be seen by Δθ3, Δθ4 and Δθ2 for motions IV and V in Figure 12c. As there were no self-collisions for motion II (Figure 11b), the difference between qmimic and qcollision-free was 0, as can be seen in Figure 12c. Figure 12d shows the similarity between qmimic and qcollision-free. We define similarity as follows [15]: where θi and ′ are, respectively, the demonstrated joint angle and collision-free joint angle of joint i, u is the total number of robot joints, and θi_max and θi_min are the maximum and the minimum angle of joint i. A larger W represents a higher similarity between qcollision-free to qmimic. Namely, if W = 1, the collision-free pose is the same as the demonstrated pose. From Figure 12d, it can be seen that the similarity for all the motions are close to 1, that is, the collision-free trajectory is close to the trajectory that is directly mimicked.   Figure 12d shows the similarity between q mimic and q collision-free . We define similarity as follows [15]: of joint i. A larger W represents a higher similarity between q collision-free to q mimic. Namely, if W = 1, the collision-free pose is the same as the demonstrated pose. From Figure 12d, it can be seen that the similarity for all the motions are close to 1, that is, the collision-free trajectory is close to the trajectory that is directly mimicked.

Real-World Results with Tangy
The Tangy robot's poses during human demonstration using the imitation learning system are also shown in Figure 10 for all five motions. In particular, the figure shows the robot's arm poses during one entire demonstration session. In general, all potential collisions were avoided in real-time for all five demonstrators.

Conclusions
In this paper, we developed a gesture imitation learning system with self-collision avoidance using a 3D sensor. The performance of our imitation learning system with self-collision avoidance was evaluated in both simulation and on the physical robot Tangy. System performance results comparing trajectories generated by the imitation learning system to directly mapping demonstrator trajectories without self-collision showed that our proposed imitation learning system was capable of avoiding all self-collisions and generated collision-free arm trajectories that were similar to the human demonstrator motions. Furthermore, the physical robot was able to mimic the arm motions of several demonstrators in real-time when using the proposed imitation learning system.
Our current imitation learning system does not mimic the demonstrated joint angular accelerations of a gesture. As a future step, we will consider learning from the human demonstrations these accelerations because a study has shown that the level of acceleration exhibited by a robot is related to how a human interprets the affect of the robot [23].

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A. Kinematics Model of the Tangy Robot Arm
The kinematic description of Tangy robot arms is modeled by assigning a coordinate system to each joint with its z-axis along the axis of rotation, following the rules of the Denavit-Hartenberg convention, as shown in Figure A1. Frame {0}, frame {1}, frame {2} and frame {3} are the frames assigned for shoulder roll, shoulder pitch, shoulder yaw and elbow roll, respectively. Frame {4} and frame {m} are fixed frames assigned in the end-effector and shoulder, respectively. Frame {w} is the world frame assigned in the center of the torso, as shown in Figure A1b.