Learning, Improving, and Generalizing Motor Skills for the Peg-in-Hole Tasks Based on Imitation Learning and Self-Learning

: We propose a framework based on imitation learning and self-learning to enable robots to learn, improve, and generalize motor skills. The peg-in-hole task is important in manufacturing assembly work. Two motor skills for the peg-in-hole task are targeted: “hole search” and “peg insertion”. The robots learn initial motor skills from human demonstrations and then improve and/or generalize them through reinforcement learning (RL). An initial motor skill is represented as a concatenation of the parameters of a hidden Markov model (HMM) and a dynamic movement primitive (DMP) to classify input signals and generate motion trajectories. Reactions are classiﬁed as familiar or unfamiliar (i.e., modeled or not modeled), and initial motor skills are improved to solve familiar reactions and generalized to solve unfamiliar reactions. The proposed framework includes processes, algorithms, and reward functions that can be used for various motor skill types. To evaluate our framework, the motor skills were performed using an actual robotic arm and two reward functions for RL. To verify the learning and improving/generalizing processes, we successfully applied our framework to different shapes of pegs and holes. Moreover, the execution time steps and path optimization of RL were evaluated experimentally.


Introduction
In this paper, we propose a framework for learning, improving, and generalizing the motor skills for a peg-in-hole task. Here, we define a tuple of model parameters that can perform both classification of input signals and generation of appropriate motion trajectories as a motor skill. This task plays an important role in assembly work and is frequently encountered in the manufacturing industry [? ]. The peg-in-hole task is often performed in conditions where the exact positions/postures of a hole or peg are unknown due to the errors in vision sensors and robot actuators. To solve this problem, robots need to continuously perform the repetition of reaction classification and reaction generation, while the peg and hole maintain contact until their task completion. Therefore, robots need to possess the abilities to classify reaction force/moment types and generate their corresponding reaction motion trajectories in real-time. In this paper, we focus on obtaining these optimal motor skills for the peg-in-hole task.
Numerous researchers have proposed various approaches for obtaining motor skills from human demonstrations [? ? ? ]. This type of approach is an effective way for robots to capture the characteristics 1. A general method is proposed that concatenates the parameters of two different models, one for classification (hidden Markov models (HMMs)) and one for motion generation (dynamic movement primitives (DMPs)), from human demonstrations. Robots are then able to select an appropriate motor skill from a library of motor skills. This method is used to classify various types of reaction force/moment signals and generate their corresponding reaction motion trajectories for the peg-in-hole task. 2. The policy learning by weighting exploration with the returns (PoWER) algorithm is used in the reinforcement learning (RL) process. Using this algorithm, the RL process improves and/or generalizes motor skills. It not only optimizes the parameters to reduce the execution time step and improve path of a DMP, but also re-estimates the parameters of its corresponding HMM to improve the motor skill. Furthermore, the RL process estimates new targets and initial parameters of new motor skills for generalization.
The studies for obtaining motor skills can be broadly divided into three types: (i) predefined strategy; (ii) imitation learning; and (iii) RL methods. In Method (i), motor skills can be used immediately without any training costs. However, it is not easy to manually design motor skills and ensure their optimal solutions. Numerous researchers, such as Xu et al. [? ], Park et al. [? ], Zhang et al. [? ], and Jokesch et al. [? ], have proposed methods for solving the peg-in-hole problem, based on this method. These studies are not learning-based approaches; instead, they use predesigned strategies after analyzing the peg-in-hole task. In Method (ii), those can be learned using a few human demonstrations. However, it provides near-optimal solutions, not optimal ones. Calinon et al. [? ], Ude et al. [? ], and Kyrarini et al. [? ] proposed methods to obtain motor skills based on imitation learning. Their motor skills are modeled from human demonstration dataset. In contrast, motor skills can be optimized based on the RL process, but a large trial-and-error iterations are required to obtain their optimal solutions in Method (iii). Yun [? ] and Inoue et al. [? ] proposed methods for learning the peg-in-hole task based on RL. Their aim was to enable robots to learn the motor skills for the peg-in-hole task through random explorations without human demonstration.
To reduce training costs and obtain optimal solutions, many researchers have proposed ways to combine both imitation learning and RL process. These methods are able to learn the optimal solutions from initial motor skills learned by imitation learning through several trials-and-errors of the RL process. Various researchers have proposed methods to use imitation learning and RL. Kober et al. obtained DMPs based on imitation learning and improved their parameters using RL [? ]. They dealt with the tasks of placing a ball into a cup and paddling a ball. Kormushev et al. used DMP-like dynamic models based on imitation learning and RL to enable a robot to learn and improve a pancake-flipping task [? ]. Kroemer et al. used imitation learning and RL to divide a task into multiple phases and enable a robot to learn their transitions and motor skills [? ]. The robots performed a bimanual grasping task with contact force. Levine et al. trained a neural network using a policy parameter optimization method-referred to as guided policy searching-to optimize the initial model parameters [? ]. They verified their method for various tasks, including the peg-in-hole task, in virtual environments. None of these researchers considered including classification functions in their models for multiple motor skills. They only focused on improving model parameters through imitation learning and RL. This is slightly different from the definition of our motor skills. Moreover, they did not explicitly consider the generalization of motor skills, or both time step and path optimization.
The remainder of this paper is organized as follows. Section ?? presents the details of the proposed framework. We describe how we represent a parameter tuple by concatenating the parameters of two different models for classification and motion generation. We present the slightly modified PoWER algorithms as well as a general reward function to reduce the number of execution time steps, optimize/generalize motion trajectories, and estimate new targets for new motor skills during the improvement and generalization processes. Section ?? presents the experimental results obtained for the peg-in-hole task. Furthermore, two reward functions are specified for the peg-in-hole task from the general reward function in this section. Section ?? discusses the proposed framework and fundamental techniques. In this section, we provide a guideline for applying our RL algorithm to other tasks. Finally, in Section ??, we present the conclusions of this study and directions for future research. Figure ?? illustrates the overall process of our framework for learning, improving, and generalizing motor skills based on a mixture of imitation learning and self-learning. The framework consists of three processes: learning initial motor skills from human demonstrations, improving (initial) motor skills through RL, and generalizing motor skills (i.e., adding new motor skills) through RL. Here, the RL process is referred to as self-learning process from the viewpoint that a robot itself can determine and perform the process of improving and generalizing motor skills. As mentioned above, a motor skill is represented by concatenating the parameters of a HMM and a DMP. A threshold model (TM) is used to distinguish whether they are familiar (i.e., modeled) or unfamiliar (i.e., not modeled) with respect to existing HMMs [? ]. An existing motor skill is optimized using the improvement process when its HMM likelihood is higher than those of both the other HMMs and the TM. When the TM likelihood is higher than the likelihood of all the other HMMs, a new motor skill is created from a similar motor skill belonging to the HMM with the next highest likelihood. The details are presented in Section ??.

Learning (Initial) Motor Skills through Imitation Learning
In this process, a human provides a demonstration dataset for learning initial motor skills for classification and motion generation; after this process, a robot can generate appropriate motion trajectories given its current situation. In this work, classification is also used to select a suitable motor skill from a library. The specific motor skill obtained by the classification result is used to generate motion trajectories.  To achieve this, a motor skill is represented as a parameter tuple that concatenates the parameters of a HMM and a DMP. A HMM is a model that is suitable for classifying time-varying signals. The parameters of a HMM are defined as λ = { , , } , =1 , where , , and denote the initial probability distribution of the th hidden state, the transition probability distribution from the th hidden state to the th hidden state, and the observation probability distribution of th hidden state, respectively. In this case, the number of hidden states is determined using the Bayesian information criterion [? ]. Moreover, parameter is modeled as a Gaussian mixture model to represent continuous non-linear trajectories. The parameters of a HMM are estimated by employing the Baum-Welch algorithm using training data {X }, with = 1, 2, ..., and = 1, 2, ..., , where and denote the total number of demonstrations and the number of data points for classification, respectively [? ].
A DMP is similar to a linear spring-damper system, which is dependent on the external force term and ensures convergence to the final goal (or target) [? ]. The DMP is defined as and where X, V, X 0 , and X represent the position, velocity, initial position, and target position, respectively. All these variables are multidimensional vectors. Moreover, , , and indicate the constants for adjusting the time-scale, spring, and damping terms. The external force term must be learned from demonstration dataset, and is defined using , which is expressed as where ( ) = (−ℎ ( − ) 2 ) is a Gaussian basis function with center and width ℎ . Parameter is its weighting value. Parameter represents the number of Gaussian basis functions. Term is directly dependent on the phase variable , which is monotonically reduced from 1 to 0, independent of time, and is obtained by where is a predefined constant, and this differential equation is referred to as a canonical system. To learn a DMP from demonstrations, the robot learns the average path of several demonstrations. The average path X( ) is recorded and its derivatives V( ) and V( ) are computed for each time step = 0, ..., . Next, the canonical system ( ) is computed for an appropriately adjusted temporal scaling parameter . Based on Equations (1) and (2), target ( ) is computed according to where X 0 and X are set to X(0) and X( ), respectively. Estimating is a linear regression problem, which is solved by estimating in Equation (3) using the errors as the training data {X }, with = 1, 2, ..., and = 1, 2, ..., , to minimize the error criterion = ( target ( ) − ( )) 2 for motion generation.
Finally, using these two types of parameters, the parameter tuple of an initial motor skill is concatenated as where λ and indicate the parameters of a HMM and a DMP, respectively. In this tuple, the target X and total length of policy are additionally inserted to optimize/generalize the parameters of motor skills in the RL process. As mentioned above, the parameters of the HMM are used to estimate the likelihood under the current situation to select an appropriate DMP from a library of DMPs. Moreover, the parameters of the DMP associated with the appropriate HMM are used to generate motion trajectories. For reference, target X is provided by an external (vision) sensor at execution time, and it tends to be expressed as relative information between a robot and a target. The DMP generates a motion trajectory to reach target X during the length of policy .

Improving and Generalizing Motor Skills through RL
The learned motor skills are improved or generalized through RL using the Algorithm ??. In this algorithm, existing motor skills are optimized in the improvement process and new motor skills are added to the library in the generalization process. To distinguish the improvement and generalization processes, a TM is generated from the HMMs. This TM is used to calculate a threshold based on the likelihood of the input signals [? ? ]. As mentioned above, this is used to distinguish whether the input signals are familiar or unfamiliar. We consider the current input signals to be unfamiliar when the likelihood of the TM is higher than the likelihoods of all other HMMs. In contrast, the current input signals are familiar when the likelihood of one HMM is higher than that of the TM. This TM is created by fully connecting all hidden states in all the HMMs. Next, the observation probability distributions of the hidden states are used without any modification, and their transition probability distributions are uniformly assigned for all connections. The details of the TM can be found in [? ]. Here, the likelihood of the TM is denoted by . In Algorithm ??, all parameters of the motor skills and parameter λ of the TM are used to identify the improvement and generalization processes. First, the improvement process is performed to optimize the parameters ( * ) of the corresponding motor skill using the iPoWER algorithm (i.e., the PoWER algorithm for improving motor skills; Algorithm ??) when the likelihood of one HMM is higher than those of the HMMs and TM. The iPoWER algorithm optimizes the all parameters for the existing motor skills except for their targets. In contrast, the generalization process is used to learn the parameters of new motor skills using the gPoWER algorithm (i.e., the PoWER for generalizing motor skills; Algorithm ??) when the likelihood of the TM is higher than those of all the HMMs. The gPoWER algorithm can estimate the parameters ( ) of new motor skills including their new targets. Here, new targets are the configurations of robots optimized by their reward functions. These new motor skills are added into the library of motor skills, after which they can be improved as * in the improvement process according to the classification results.
Algorithm 1 Overall algorithm for improving and/or generalizing motor skills 1: Input: a set of initial parameters = { 1 , 2 , ..., } of all motor skills and the TM .

8:
Add the parameters of a new motor skill with to the motor skill library. 9: end if 10: Output: the parameters * of the (new) motor skill as well as the parametersλ of updated TM.

13:
Reweighting: Reweight rollouts and discard low-reward rollouts. 14: Update the parameters of DMP using Update the parameters λ +1 of the HMM using reaction force/moment recorded during the motion generation a = ( +1 + ) (X, ).

13:
Reweighting: Reweight rollouts and discard low-reward rollouts. 14: Update the parameters of DMP using The iPoWER and gPoWER algorithms use a deterministic policyā = (X, ) with the weighting parameters and basis functions of a DMP [? ]. When optimizing and generalizing a DMP, this policy is transformed into a stochastic policy using additive exploration (X, ); to perform model-free RL, we always use a policy (a |X , ), which can be modified into the form a = (X, ) + (Ψ(X, )). Here, (Ψ(X, )) = (X, ) with ∼ (0, 2 ) is used, where is a meta-parameter of the exploration that can also be optimized in these algorithms. In the iPoWER algorithm, the target X of a motor skill is used without any changes or updates. The length of a corresponding DMP can be reduced by the stop signal when X = X and < . This means that during the RL process, robots can arrive more quickly at the target than the humans who demonstrated the motion. Stop signal is generated when the robot reaches the target within extremely small margins |X ± | or when execution time steps reaches a pre-selected length ( + ). The policy parameters of a DMP are optimized, after which the parameters of its corresponding HMM are re-estimated while generating the motion trajectories from the optimized DMP.
In the generalization process, the gPoWER algorithm finds a new target and the policy parameters for achieving it. To do this, parameter˜ is set to be the time step (arg max ( )) such that the reward value in all rollouts is higher than max and the new target is set to be the robot configuration in time step X(˜ ). Parameters X and are determined by the value of max , which indicates the highest reward in all rollouts and the time step in the rollout of the value of max . Parameters , X , and are incrementally updated until the parameters converges: +1 = . Next, the parameters λ of a HMM are estimated during motion generation with , X , and . In this algorithm, the length of the new motor skill may be increased from the original one of the initial motor skill. Therefore, the length of motor skill is changed using while estimating the new target and the parameters of the new motor skill. The length of the new motor skill can be optimized during the improvement process after the generalization process is complete.
To calculate the expected return values for the improvement and generalization processes, a reward function should be defined. Its general equation is defined as where X and Y indicate the values that can be measured from the robot and/or sensors. Here, these are only used to represent different variables. The superscripts and denote the target and starting values of each variable, which depend on the given task. Here, the term (X − X( )) is used to obtain a high return value when the robot configuration is close to the target values, and the term 1 Y −Y( ) is used to obtain high return value when it is far from the starting value. Parameters and are constants to adjust the importance of each term. Equation (7) is designed to take the form of exp −( ) ; therefore, a lower value of either term provides a higher return value. In this framework, we can easily obtain the value of X that the robot must finally achieve from human demonstrations. The robot can use the first term in Equation (7) to obtain the optimal path and optimal execution time step while achieving the target X . In contrast, the robot may need to be as far from the initial value Y as possible depending on the given tasks. We define this case as the second term in Equation (7). Here, the initial value Y can be easily obtained from a robot or its sensors.

Description of the Peg-in-Hole Task
To evaluate the proposed method, we applied our framework to the peg-in-hole task. This task consists of the following two motor skills: "Hole search"-a parameter tuple that classifies the direction of a hole and generates the motion trajectories of searching a hole based on reaction force/moment signals in the inaccurate current positions/postures of the peg and hole-and "peg insertion"-the parameter tuple that classifies reactions according to the directions of the hole and generates the motion trajectories of inserting a peg into a hole. Although the "hole search" and "peg insertion" motions can be performed using various strategies [? ], humans perform the "hole search" demonstrations by adopting a tilt-search strategy, as illustrated in Figure ??a, and they perform the "peg insertion" demonstrations by employing a two-point contact strategy, as indicated in Figure ??b-e. In the "hole search" motor skill, suitable tilting angles should be learned and then generated: if small tilting motion trajectories are generated, the hole may not be found because reaction force/moment signals cannot be measured; if large tilting motion trajectories are generated, the motion skill of "peg insertion" may not be performed because the peg is caught in the hole even though it has been found by "hole search" motor skill. Moreover, in the "peg insertion" motion, the robot must learn and generate suitable insertion motion trajectories for the peg according to reaction force/moment signals measured depending on the relative positions and directions of the hole and peg. To achieve this, it is important to classify the relative positional and directional relationship between the peg and hole and then to generate their appropriate motion trajectories from the selected motor skills, as shown in Figure ??.
For this purpose, the peg-in-hole task was performed using the experimental setup illustrated in Figure ??. We used the UR3 robotic arm (developed by Universal Robots, Denmark) and the FT300 F/T sensor, 2-finger gripper, and a wrist camera (all developed by Robotiq, Canada). We conducted the experiments using five pegs and five holes with triangle, rectangle, pentagon, hexagon, and star shapes, as indicated in Figure ??a-e. The clearance between the pegs and holes was approximately 200 µm. The vision solution of the robot allows it to recognize the approximate peg and hole locations as well as their exact shapes. However, the peg cannot simply be inserted into the hole due to errors in the vision system of the wrist camera. The errors in the position and posture obtained in the experiments using this vision solution were approximately 5-10 mm and between 2 • and 3 • , respectively. Therefore, both types of motor skills are needed to complete the peg-in-hole task despite these errors.

Reward Functions of Imitation Learning and Self-Learning in the Peg-in-Hole Task
To calculate the expected return values for the "hole search" and "peg insertion" motor skills, two reward functions (i.e., for ''hole search" and for "peg insertion") are, respectively, defined as and where F, M, R, and P indicate the force, moment, rotation, and position measured from the robot, respectively. In particular, the variables R and P represent the robot configuration measured from the reference axis of the tool coordinate system. In addition, F is calculated by F ( , , ) ( ) = | − ( )| + | − ( )| + | − ( )|, and M and P are calculated using equations with a similar form. In contrast, R is calculated using R ( , , ) Here, superscripts and indicate the target and starting values of each variable depending on the given task, respectively. Subscripts , , and denote the variables of each measure. Further, parameters , , and are constants, which are used to adjust the weight of each term. In Equation (8), the motor skills should be able to determine the minimum tilting angle needed to quickly distinguish whether or not a hole exists. This is determined using reward function for the "hole search" motor skill. Here, it increases when all axes of F ( , , ) ( ) and M ( , , ) ( ) at every time step are closer to all targets F and M . In contrast, the position of the z-axis is incorporated as a reward term for inserting the peg into the hole. The reward increases when the z-axis position P ( ) at every time step is closer to target P . It is possible for robots to determine the optimal motions. In these two equations, F , M , R , and P are set to zero for the peg-in-hole task.
In the imitation learning process, human demonstrations are provided by employing a kinesthetic teaching method. This is a method for easily and rapidly conveying the motor skills of human performers to robots. However, it is not suitable for the peg-in-hole task, in which reaction classification is important for achieving the goal because unintended reaction force/moment signals may be included in human demonstrations. When reproducing unintended reaction force/moment, the robot does not achieve the goal of the motor skill. Thus, such unintended signals should be eliminated from the human demonstrations through robot self-reproduction. That is, the robot acquires the targets of motor skills as well as the reward functions through this self-reproduction. Despite this self-reproduction, robots may still not be able to obtain an optimal solution. This can be improved through RL during the improvement process.
A demonstration dataset must be modeled to enable reaction classification and motion generation. In this experiment, the robot should generate different motions depending on the relative positions/directions of the peg and the hole, as indicated in Figure ??. The states of both motor skills are defined as X = {F ( , , ) , M ( , , ) } to enable the HMMs to classify the reaction force/moment. In the DMPs, the state of the "hole search" motor skill is defined as X = {R ( , , ) }, owing to the fact that a tilt search is performed without changing the peg's position. In contrast, the state of the "peg insertion" motor skill is defined as X = {F ( , , ) , R ( , , ) } to control the force and rotation. Finally, the state of the two reward functions for RL are defined as X = {F ( , , ) , M ( , , ) , R ( , , ) } (to calculate Equation (8)) and X = {F ( , , ) , M ( , , ) , } (to calculate Equation (9)). The states for HMMs, DMPs, and reward functions are summarized, as indicated in Table ??. Table 1. States for HMMs, DMPs, and reward for RL for "hole search" and "peg insertion" motor skills.

Results
First, we performed human demonstrations on the rectangle shape (Figure ??a) to learn and acquire the initial motor skills based on imitation learning. The different hole and peg shapes (Figure ??b-e) were used to evaluate the generalization of the motor skills. The results of these experiments can be confirmed from supplemental video clip. Human performers provided demonstrations for the "hole search" and "peg insertion" demonstrations, as shown in Figure ??. Figure ?? illustrates an example of the different demonstrations that were performed according to the initial hole and peg positions. Figure ??a shows the initial positions of the pegs with respect to the hole, and Figure ??b shows their clustering results. In this case, the points were clustered using the reaction force/moment measured at the initial positions and the k-means clustering algorithm. The robot acquired four motor skills for the "hole search" and four motor skills for the "peg insertion". The motion trajectories of the robot were extracted at 50 Hz using the kinesthetic teaching method, following which the training data were acquired through self-reproduction. Eight motor skills (i.e., the parameter tuples for classifying reaction signals and generating motion trajectories for four types of "hole search" and four types of "peg insertion") were learned using the training dataset. We configured the reaction classification and motion generation processes to be independent for rational task execution in the RL implementation. In this case, the classification of the reaction force/moment was configured to use five pairs of robot and sensor signals, and the classification time was set to 50 Hz to ensure real-time performance.
Next, the eight initial motor skills were improved using the RL process. For self-improvement of the "hole search" motor skills, the RL rollouts and their rewards were generated according to the following two steps: (i) acquiring rollouts and calculating rewards in a hole-free (blocked) location; and (ii) verifying rollouts and updating policy parameters at the hole position. The robot verified which of the rollouts actually found the hole at its location and changed the rewards of the rollouts that failed to find holes to zero. It then updated the policy parameters using all the rollouts. The verification process in Step (ii) is necessary because the rollouts and their rewards are obtained at a location where no hole exists. As mentioned in Section ??, the purpose of improving the "hole search" motor skill is to enable robots to identify whether a hole is present using the tilting angle. The self-improvement of the "peg insertion" motor skill was performed at the hole location. This process did not require a specific verification step for the acquired rollouts because it was performed at the hole location. The robot was able to identify the optimal path for inserting the peg into the hole within a short time. The return values increase with the number of iterations, and the number of execution time steps are reduced when using the reward functions and the iPoWER algorithm, as indicated in Table ??. In the "hole search" and "peg insertion" motor skills, the robot obtained the expected return value of 0.7412 during 64 steps of motion in the first iteration while it received the expected return value of 0.9611 during 9 steps of motion after 300 iterations and the expected return value of 0.6233 during 75 steps of motion in the first iteration while it received the expected return value of 0.9249 during 10 steps of motion after 300 iterations, respectively. This table illustrates that the reward functions and the iPoWER algorithm were effectively designed in terms of time and path optimization. Figure ?? illustrates the results of the iPoWER algorithm compared with the original PoWER algorithm. The iPoWER algorithm reduces the number of robot execution time steps. Fewer execution time steps were needed when the iPoWER algorithm were used, as indicated in Figure ??.  Figure 8. Performance with respect to two reward functions and the original PoWER and iPoWER algorithm in the peg-in-hole task for the motor skills: (a) "hole search"; and (b) "peg insertion". The upper and lower rows illustrate the return values and number of execution time steps, respectively. The red lines indicate the results using the iPoWER algorithm, and the blue lines indicate the results using the original PoWER algorithm. Figure ?? illustrates the RL performance of our framework for the following cases: (a) the initial motor skill was represented by a HMM only; (b) the initial motor skill was represented by a DMP only; and (c) the initial motor skill was represented by both a HMM and a DMP (our framework). To evaluate Case (a), we performed RL after only learning the initial parameters λ for a HMM and target X from human demonstrations. However, the initial weight parameters for a DMP were randomly assigned without imitation learning (yellow line in Figure ??). This case can classify the reaction forces/moments, but it is impossible to generate their appropriate motion trajectories. The policy parameters of the DMP were improved by the iPoWER algorithm (Algorithm ??). In Case (b), we performed the RL process after only learning the initial weight parameters and the initial target X for a DMP from human demonstrations. Here, the initial parameters λ for a HMM were randomly assigned without imitation learning. In this case, the policy parameters and target of the DMP as well as the parameters of the HMM were generalized through the generalization process. Nevertheless, an unsuitable directions was used to generalize the DMP policy parameters and target, because the parameters of the HMM were not properly assigned. Both Cases (a) and (b) were started with inappropriate policy parameters and targets. Here, we confirm that the RL process of Case (b) converged more quickly than that of Case (a), because it has the appropriate policy parameters to perform the peg-in-hole task, as illustrated by the blue and yellow lines of Figure ??, respectively. In contrast, in Case (c), our framework converged to the highest reward values in the fewest iterations, because it started with the parameters and target of a suitable DMP and the parameters of a suitable HMM, as illustrated by the red line of Figure ??. (c) Figure 9. Expected returns of policy parameters with respect to number of iterations in the peg-in-hole task: (a) (yellow line) using only a HMM (without a DMP learned from human demonstrations); (b) (blue line) using only a DMP (without a HMM learned from human demonstrations); and (c) (red line) using both a HMM and a DMP. The average and variances of multiple RL trials from four different directions for the "hole search" and the "peg insertion" motor skills are shown. Figures ?? and ?? illustrate the generalization of the motor skills learned for the rectangle shape to other shapes. Figure ?? presents the successful cases in which the motor skills learned for the rectangle shape could be used for other shapes without any generalization process. In contrast, Figure ??a presents the failure cases, in which the learned motor skills could not be applied to the other shapes. The failures usually occurred when determining the directions of holes in different shapes. In contrast, the "peg insertion" motor skills could be used without any RL process, even for the other shapes. The gPoWER algorithm (Algorithm ??) was used to solve this problem. Figure ??b illustrates the successful results that new motor skills added through the RL-based generalization process was performed. Table ?? demonstrates the need for both the improvement and generalization processes. The initial motor skills were converged after 35-41 iterations during the improvement process. Generalizations of the improved motor skills converged more quickly than generalizations of the initial motor skills. In these cases, the convergence was approximately two to three times faster. Furthermore, the generalization of the improved motor skills was efficient even in the absence of initial motor skills acquired from human demonstrations. These results confirm that the robot can generalize motor skills even for unfamiliar shapes in the peg-in-hole task.
In these experiments, we generated the stop signal when the robot did not reach its target within X ± 0.001 or when the policy parameter did not converge within 300 iterations. When only the RL process was used without the imitation learning, it did not satisfy the targets of RL process, as shown in Figure ??c. We used several programs by modifying the open codes of DMP, HMM, and PoWER algorithms (these open codes can be downloaded in [? ? ? ]) developed by the python language. In addition, the k-means and BIC algorithms performed using scikit-learn library. This library can be downloaded in [? ]. Finally, the TM was manually created based on [? ]. All of these experiments were performed on a PC (CPU: Intel i7-6700 3.40GHz, RAM: 32.0GB) with Windows 10 OS and Python 3.6 version.  Table 3. Comparison of number of iterations used in the improvement and generalization processes.

Improvement/Generalization # of Iterations
Improvement of motor skills for the rectangle shape 35 from initial motor skills for the rectangle shape Improvement of motor skills for the triangle shape 37 from initial motor skills for the triangle shape Improvement of motor skills for the star shape 41 from initial motor skills for the star shape Generalization of motor skills for the triangle shape 95 from initial motor skills for the rectangle shape Generalization of motor skills for the triangle shape 46 from improved motor skills for the rectangle shape Generalization of motor skills for the star shape 146 from initial motor skills for the rectangle shape Generalization of motor skills for the triangle shape 59 from improved motor skills for the rectangle shape "hole search" "hole search" "hole search" "hole search"

Discussion
The proposed framework for the peg-in-hole task uses a mixture of imitation learning and RL. The peg-in-hole task requires the reaction force/moment classification and reaction motion generation because there is error in the sensors and robot actuators. Therefore, a robot should be able to continuously classify reactions and generate appropriate motion trajectories. To achieve this, motor skills are represented by concatenating the model parameters for reaction classification and motion generation. In general, a HMM and a DMP exhibit superior capabilities for time-varying classification and motion generation, respectively, as mentioned in [? ]. Therefore, we used both models to consider their advantages in this framework. We refer to these concatenated parameter tuples as motor skills. We can use only HMMs or only DMPs for classification and motion generation, but they perform worse than the combination of HMMs and DMPs (refer Figure ??).
The proposed framework was evaluated for the peg-in-hole task; however, it can be used for various robotic tasks. The algorithms presented in Algorithms ??-?? are task-independent and can be used without modification to improve and generalize the motor skills required for such tasks. In the algorithms, only two elements need to be prepared for various tasks: (i) the reward functions; and (ii) human demonstrations. Designing the reward functions is the most important and difficult process in RL design. The target configurations of robots and/or objects obtained by human demonstrations can be useful in designing reward functions. It is also necessary to acquire the initial parameters of the motor skills (that is, the parameters of the HMMs and DMPs) from the human demonstrations of a target task. The robot can acquire the initial parameters of the motor skill when at least one demonstration has been performed [? ]. Thereafter, it can use the proposed algorithms to create various motor skills that are automatically optimized by the improvement and generalization processes. After these two elements have been provided, the robot can obtain motor skills for which the number of execution time step and path have been optimized over several iterations. For example, this framework can also be considered for the use in a variety of industrial applications such as polishing, machine tending, soldering, painting, cutting, grinding, deburring, and inspection. First, humans provide human demonstrations to a robot. Next, the robot learns initial motor skills and collects the parameters of X and Y for reward functions from human demonstrations. Here, their motor skills and the parameters of X and Y can be modeled and extracted using various types of information (e.g., joints, positions, postures, velocities, forces, and/or torques) from human demonstrations depending on the purpose of motor skills. Finally, the robot can optimize or generalize motor skills through the improvement/generalization processes. However, it is necessary to determine the information to be modeled or used through human demonstrations in this process. The automation of this capability is not considered in this paper and it should be done with human.
In general, a robot needs to perform a sequence of some motor skills to perform its task. In other words, motor skills must be selected from a library of multiple ones. However, many researchers have focused on dealing with a single motor skill [? ? ? ? ]. They suggested the ways to improve the performance of a motor skill. In addition, they did not consider the generalization of reusing the learned motor skills for other similar tasks (e.g., from the "rectangle" to the "triangle" peg-in-hole motor skills). In contrast, our proposed framework was able to handle the library of multiple motor skills based on the concatenated parameters. Furthermore, Algorithms ??-?? provide a way to improve existing motor skills (optimizing paths and reducing execution time steps) as well as add new motor skills.
In the human demonstrations, we adopted a tilt search and two-point contact strategies for the "hole search" and "peg insertion". This is because humans tend to accomplish the peg-in-hole task by tilting a peg into a hole, as analyzed in [? ]. The authors of [? ] also determined the most appropriate tilting angle for inserting the peg. In this study, a suitable tilting angle was learned for the tilt search through RL. Moreover, our aim was to predict and select appropriate motor skills from current reaction signals using learned motor skills. In contrast, it is difficult to use reaction classification with different peg-in-hole task strategies, such as the spiral path and spray paint strategies, because of the uncertainties of the initial positions/poses of holes and pegs. These strategies tend to generate motions according to a set of predefined rules without any classification process.
The reward functions include the terms of the force, moment, -axis robot position, robot rotation, and time step. Meta-parameters , , and control the weights of the terms individually because they deal with different information types (see Equations (8) and (9)). We can change the importance of each term in the reward function by adjusting these meta-parameters. We assigned the largest weighting values to the force/moment terms and the z-position term in the "hole search" and the "peg insertion" motor skills, respectively. In other words, it makes sense to assign weighting values depending on terms that have a significant impact on the success of the task.

Conclusion and Future Work
We propose a framework for learning, improving, and generalizing the "hole search" and "peg insertion" motor skills for the peg-in-hole task. In this framework, motor skills are acquired using a mixture of imitation learning and RL. Reaction classification and motion generation are required in the peg-in-hole task owing to errors in the sensors and actuators. The robot learns the initial motor skills for classifying the reactions and generating the appropriate trajectories from human demonstrations. We designed a motor skill parameter tuple by concatenating the HMM parameters for reaction classification and the DMP parameters for motion generation. The initial motor skills learned using imitation learning are improved and generalized by means of RL. These motor skills are either improved for familiar reaction signals or generalized for unfamiliar reaction signals. We distinguish the improvement and generalization processes as follows: improvement updates the policy parameters using the RL process without changing the target of its DMP, whereas generalization adds new parameter tuples after modifying and updating the policy parameters and target of a DMP. These processes are determined by the HMMs and TM. The generalization process is selected when the likelihood of the TM is higher than those of all other HMMs, and the improvement process is selected when the likelihood of one of the HMMs is higher than that of the TM.
We evaluated these processes by applying them to different peg and hole shapes. The "hole search" and "peg insertion" motor skills learned for the rectangle shape were generalized for triangle, pentagon, hexagon, and star shapes. These algorithm and reward functions improved the paths of the initial motor skills and optimized them to reduce the number of execution time steps.
In the future, we will analyze the manner in which various humans perform peg-in-hole tasks and compare the reward functions with those learned through inverse RL to enable interpretation. We will verify our framework by means of industrial applications and various other robotic tasks. In addition, we will propose a method to determine the information to be noted in the human demonstrations and use them in modeling motor skills and reward functions.