Motion Planning of Robot Manipulators for a Smoother Path Using a Twin Delayed Deep Deterministic Policy Gradient with Hindsight Experience Replay

: In order to enhance performance of robot systems in the manufacturing industry, it is essential to develop motion and task planning algorithms. Especially, it is important for the motion plan to be generated automatically in order to deal with various working environments. Although PRM (Probabilistic Roadmap) provides feasible paths when the starting and goal positions of a robot manipulator are given, the path might not be smooth enough, which can lead to inefﬁcient performance of the robot system. This paper proposes a motion planning algorithm for robot manipulators using a twin delayed deep deterministic policy gradient (TD3) which is a reinforcement learning algorithm tailored to MDP with continuous action. Besides, hindsight experience replay (HER) is employed in the TD3 to enhance sample efﬁciency. Since path planning for a robot manipulator is an MDP (Markov Decision Process) with sparse reward and HER can deal with such a problem, this paper proposes a motion planning algorithm using TD3 with HER. The proposed algorithm is applied to 2-DOF and 3-DOF manipulators and it is shown that the designed paths are smoother and shorter than those designed by PRM.


Introduction
In the Industry 4.0 era, robots and related technologies are fundamental elements of assembley systems in manufacturing; for instance, efficient robot manipulators for various tasks in assembly lines, control of robots with high accuracy, and optimization methods for task scheduling [1,2].
When a task is given from a high level task scheduler, the manipulator has to move its end-effector from the starting point to the goal point without collision with any obstacles or other robots. For this, motion planning algorithms let robot manipulators know how to change their joint angles in order for the end-effector to reach the goal point without collision. Currently, in practice, human experts teach robot manipulators how to move in order to conduct various predefined tasks. Namely, a robot manipulator learns from human experts how to change its joint angles for a given task. However, when the tasks or the working environments change, such manual teaching (a robot manipulator's learning) procedure has to be done again. The other downside of the current approach is optimality or efficiency. In other words, it is not clear if the robot manipulator moves optimally even though the robot manipulator can perform a given task successfully when a robot manipulator learns from human experts. Therefore, it is important to teach the robot manipulators an optimal path automatically when a task is given. Using policy search-based reinforcement learning, this paper presents a motion planning algorithm for robot manipulators, which makes it possible for the robot manipulator to generate an optimal path automatically; it is a smoother path compared with existing results [3][4][5].
For robot path planning, sampling-based algorithms find feasible paths for the robot manipulator using a graph consisting of randomly sampled nodes and connected edges in the given configuration space [6,7]. PRM (Probabilistic Roadmaps) and RRT (Rapid Exploring Random Trees) are two representatives of sampling-based planning algorithms. PRM consists of two phases. The learning phase samples nodes randomly from collision-free space in the configuration space and makes edges with direction by connecting the sampled nodes. Then, it constructs a graph using the nodes and edges. The query phase finds the optimal path connecting the starting node and goal node in the graph [8][9][10][11][12][13]. Note that the resulting path by PRM is made by connecting the sampled nodes in the configuration space; usually, it is not smooth and might be longer than the optimal path. RRT samples nodes from the neighbor of the starting point in the configuration space, constructs a tree by finding a feasible path from the starting node, and expands the graph until the goal point is reached. It works for various environments and can generate a path quickly, but its optimality is not guaranteed in general [14,15]. More recently, Fast Marching Methods (FMMs) using level sets have been proposed for path planning. The FMMs are mainly about efficiently solving the Eikonal equation whose solution provides the optimal path. It is shown that FMMs lead to an asymptotically optimal path and faster convergence than PRM and RRT [16]. Since FMMs, PRM, and RTT are sampling-based approaches, they need a high number of sampling points for high dimensional configuration space in order to obtain a smoother path, which means that they are computationally demanding in calculating the optimal path for given arbitrary starting and ending points. Also, they can suffer from memory deficiency in high dimensional space. However, in the proposed method, when a TD3 agent is trained, the optimal path can easily be computed (i.e., trained neural network computation).
Reinforcement learning is a deep learning approach which finds an optimal policy for an MDP (Markov Decision Process). The agent applies an action according to the policy to the environment and then the agent gets the next state and reward from the environment. The agent finds the optimal policy such that the sum of reward over the horizon is maximized [17,18]. In reinforcement learning, there are two typical approaches to find the optimal policy. Value-based approaches estimate the optimal (action) state value function and derive the corresponding policy from the estimate of the value function [19][20][21]. On the other hand, policy gradient approaches search the optimal policy directly from the set of state and reward data. It is known that policy gradient approaches show much better performance in general [22][23][24][25][26][27][28]. Recently, deep learning-based control and operation of robot manipulators have drawn much attention. In [29,30], robot path planning methods are proposed using a deep Q-network algorithm with emphasis on learning efficiency. For path training, a stereo image is used to train DDPG (Deep Deterministic Policy Gradient) in [31]. In [32], a real robot is trained using reinforcement learning for its path planning. This paper presents a policy gradient-based path planning algorithm. To this end, RAMDP (Robot Arm Markov Decision Process) is defined first. In RAMDP, the state is the joint angle of the robot manipulator and the action is the variation of the joint angle. Then, DDPG (Deep Deterministic Policy Gradient) with HER (Hindsight Experience Replay) is employed for the purpose of searching the optimal policy [24,33]. DDPG is applied since the action in RAMDP is a continuous value and DDPG is devised for MDP with a continuous action. The twin delayed DDPG enhances performance of DDPG so that it shows good convergent property and avoids overestimation. In fact, HER is quite fit to robot path planning since RAMDP is an MDP with sparse reward. Sparse reward means that when an MDP has a finite length of an episode with a specific goal state and the episodes end at non-goal states (say, a failed episode) due to any reasons frequently, the agent can not get much reward. Since all reinforcement learning finds the optimal policy by maximizing the sum of reward, sparse reward is critical in reinforcement learning. However, as the episodes are saved in the memory, HER modifies the last state in a failed episode as a new goal. Then, the failed episode becomes a normal episode which ends at the goal state. Hence, HER enhances the sample efficiency and fits to the robot path planning. It is shown that such a procedure is quite helpful in a motion planning algorithm. In the proposed algorithm, when the state is computed after applying the action, the collision with obstacle or reaching the goal are checked. It turns out that many states end at non-goal states in the middle of learning. This is why conventional reinforcement learnings do not work well for robot path planning. However, using HER, those episodes can be changed to a normal episode which ends at goal states. Hence, the contribution of the paper is to present a path planning algorithm using DDPG with HER. The proposed method is applied to a 2-DOF and 3-DOF robot manipulators using simulation; experimental results are also shown for a 3-DOF manipulator. In both cases, it is quantitatively demonstrated that the resulting paths are shorter than those by PRM.

Configuration Space and Sampling-Based Path Planning
In sampling-based path planning, configuration space Q (also called joint space for robot manipulators) represents the space of possible joint angles and is a subset of n-dimensional Euclidean space R n where n denotes the number of the joints of the manipulator. The values of the joint angles of a robot manipulator are denoted as a point in Q [1,2,34]. The configuration space consists of two subsets: the collision-free space Q free and Q collide in which the robot manipulator collides with other obstacles or itself. For motion planning, a discrete representation of the continuous Q free is generated by random sampling. Then a connected graph (roadmap) is obtained. The nodes in the graph denote admissible configuration of the robot manipulators, and the edges connecting any two nodes mean feasible paths (trajectory) between the corresponding configurations. Finally, when the starting and goal configurations q 0 and q goal are given, any graph search algorithm is employed to find the shortest path connecting q 0 and q goal . There exists a shortest path between q 0 and q goal since they are two nodes on the connected graph.

Reinforcement Learning
Reinforcement learning is an optimization-based method to solve an MDP (Markov Decision Process) [18]. An MDP is comprised of {S, A, P, R, γ} where S is a set of the state and A denotes a set of the action. Besides, P consists of P(s |s, a) which is the transition probability that the current state s ∈ S with the action a ∈ A becomes the next state s ∈ S. R stands for the reward function and γ ∈ [0, 1] is the discount factor. The agent's policy π(a|s) implies the distribution of the action a for the given state s. In reinforcement learning, the agent takes action a t according to the policy π at time t and state s t , and the environment returns the next state s t+1 and reward r t+1 by the transition P and reward function R. By repeating this, the agent updates its policy so as to maximize its expected return E π [∑ ∞ k=0 γ k r t+k+1 ]. The resulting optimal policy is denoted by π * . In order to find the optimal policy, value-based methods like DQN (Deep Q-Network) estimate the optimal value function (i.e., estimate the maximum return) and find the corresponding policy [19]. On the other hand, policy gradient methods compute the optimal policy directly from samples. For instance, REINFORCE, actor-critic method, DPG (Deterministic Policy Gradient), DDPG (Deep DPG), A3C (Asynchronous Advantage Actor-Critic), TRPO (Trust Region Policy Optimization) are representative methods of policy gradient methods [22,35,36]. Training performance of reinforcement learning is heavily dependent on samples which are several sets of the state, action, and next state. Hence, in addition to the various reinforcement learning algorithms, many research efforts have been directed to study on how to use episodes efficiently for the purpose of better agent learning, for example, replay memory [19] and HER (Hindsight Experience Replay) [33]. In this paper, for the sake of designing a motion planning algorithm, a policy gradient called TD3 (twin delayed Deep Deterministic Policy Gradient) is used for path planning.

RAMDP (Robot Arm Markov Decision Process) for Path Planning
In order to develop a reinforcement learning (RL) based path planning, the robot arm MDP (RAMDP) needs to be defined properly first [17]. The state q t of the RAMDP is the angle value of each joint of the manipulator where the joint angle belongs to the configuration space Q. Hence, for collision-free operation of a robot manipulator, in the motion planning, the state q t belongs to only Q free ∈ R n where n is the number of the joint in the manipulator. In the RAMDP, the action is joint angle variation. Unlike MDP with discrete state and action such as frozen-lake ( [17]), the RAMDP under consideration has continuous state (i.e., joint angle) and continuous action (i.e., joint angle variation). Due to this, DDPG or its variants are fit to find the optimal policy for the agent in the RAMDP. In this paper, TD3 (twin delayed DDPG) is employed to find an optimal policy for the continuous action with deterministic policy µ in the RAMDP. Figure 1 describes how to generate the sample (q t , a t , q t+1 , r t+1 ) in the RAMDP. Suppose that arbitrary initial state (q 0 ) and goal state (q goal ) are given, and that the maximum length of the episode is T. At state q t , if the action a t is applied to the RAMDP,q t+1 is produced from the RAMDP. Then, according to the reward function in (1), the next state q t+1 and r t+1 are determined. In view of (1), ifq t+1 does not belong to Q free , then the next state is set as the current state. Furthermore, if the next state is the goal point, then the reward is 0, and otherwise the reward is −1. Then, the whole procedure is repeated until the episode ends. Note that this reward function leads to as short as possible path since TD3 tries to maximize the sum of reward and a longer path implies more −1 reward. There are two possibilities for the episode to end: (1) the next state becomes the goal state, i.e., e = [(q 0 , a 0 ), (q 1 , r 1 , a 1 ), · · · , (q goal , r g )], and (2) the length of the episode is T, i.e., e = [(q 0 , a 0 ), (q 1 , r 1 , a 1 ), · · · , (q T , r T )] and q T = q goal . For both cases, every sample in the form of (q t , a t , r t , q t+1 ) is saved in the memory. Note that whenq t+1 / ∈ Q free , the next state q t+1 is set to the current state q t in order to avoid collision. In the worst case, if this happens repeatedly, the agent is trained such that the corresponding episode does not occur. Based on these samples in the memory, the optimal policy is determined in accordance with TD3, which is explained in the next subsection.

TD3 (Twin Delayed Deep Deterministic Policy Gradient)
In this section, TD3 is introduced [24], which is used to search the optimal policy in the proposed algorithm [17,22,35,36]. To this end, it is assumed that a sufficient number of samples (s t , a t , r t , s t+1 ) are stored in the memory. Figure 2 describes the structure of TD3. The basic structure of TD3 is the actor-critic network. However, unlike the actor critic network, there are two critic deep neural networks (DNN), one actor DNN, and three target DNNs for each critic and actor DNNs. Hence, there are 6 DNNs (2 critic DNNs and their target DNNs, 1 actor DNN, and its target DNN) in TD3. As seen in Figure 3, the critic DNNs generate an optimal estimate Q θ i (s t , a t ) of the state-action value function. The input of the critic DNN is a mini-batch from the memory and its output is Q θ i (s t , a t ), i = 1, 2. The mini-batch is a finite set of samples (s t , a t , r t , s t+1 ) from the memory. In TD3, it is important that the two critic DNNs are used in order to remove overestimation bias in Q θ i (s t , a t ) function approximation. The overestimation bias can take place when bad states due to accumulated noises are overestimated. In order to cope with this, TD3 chooses the smaller Q θ i (s t , a t ) value out of two critic DNN outputs critic DNN as the target value. In Figure 3, θ 1 and θ 2 are the parameters of the two critic DNNs, and θ 1 and θ 2 those of corresponding target DNNs. In order to train the critic DNN, as the cost function, the following quadratic function of temporal difference error δ := Q θ (q, a) − y target is minimized, where Q θ (s, a) stands for the parameterized state-action value function Q with parameter θ, is the target value of the function Q θ (s, a), and the target action (i.e., the action used in the critic target DNN) is defined as,ã = µ φ (q) +¯ , where noise¯ follows a clipped normal distribution clip[(N (0, σ), −c, c], c > 0. This implies that¯ is a random variable with N (0, σ) and belongs to the interval [−c, c].  The inputs of the actor DNN are both Q θ 1 (q t , a t ) from the critic DNN and the mini-batch from the memory, and the output is the action. Precisely, the action is given by a t = µ φ (q t ) + where φ is the parameter of the actor DNN, µ φ is the output from the actor DNN, and a deterministic and continuous value. Noise follows the normal distribution N (0, σ), and is added for exploration. In order to tune the parameter φ, the following cost function is minimized.
where d µ (q) denotes the distribution of the state. Note that the gradient ∇J µ (θ) (equivalently ∇Q θ (q, a)) is used to update the parameter φ. This is why the method is called the policy gradient. Between the two outputs from the two critic target DNN, the common target value in (3) is used to update the two critic DNNs. Also, TD3 updates the actor DNN and all three target DNNs every d steps periodically in order to avoid too fast convergence. Note that the policy µ φ (q) is updated proportionally to only Q θ 1 (q, µ φ (q)) [24]. The parameters for the critic target DNN and the actor target DNN are updated according to θ ← τθ + (1 − τ)θ at every d steps, which not only maintain small temporal difference error, but also make the parameters in the target DNN updated slowly.

Hindsight Experience Replay
Since the agent in reinforcement learning is trained from samples, it is utmost important to have helpful samples which mean that the action state pair improve the action value function. On the other hand, many unhelpful samples are generated in RAMDP since many episodes end without reaching the goal. In other words, RAMDP is an MDP with sparse reward. For the purpose of enhancing sample efficiency, HER (Hindsight Experience Replay) is employed in this paper. For the episode e = [(q 0 , a 0 ), (q 1 , r 1 , a 1 ), · · · , (q T , r T )] in the memory where q T is not the goal state, HER resets q T as q goal . This means that even though the episode does not end the goal state, the episode becomes a ended state at the goal after modification by HER. Then, the failed episodes can be use to train the agent since the modified episodes are a goal achieved episodes. Algorithm 1 summarizes the proposed algorithm introduced in this section.

39:
for t = 0 to T − 1 do 40: Sample a transition (q t ||q goal , a t , r t , q t+1 ||q goal ) from L 41: r t := r(q t , a t , q goal ) 42: Store the transition (q t ||q goal , a t , r t , q t+1 ||q goal ) in R

Case Study for 2-DOF and 3-DOF Manipulators
In order to show the effectiveness of the proposed method, it is applied to robot manipulators. Table 1 shows the information of the used 2-DOF and 3-DOF and manipulators.  For easy visualization, the proposed algorithm is applied to the 2-DOF manipulator first. Table 2 summarizes the tuning parameters for the designed TD3 with HER. In order to train TD3 with HER for the 2-DOF manipulator, 8100 episodes are used. Figure 4 describes the success ratio of each episode with HER when the network is training. In other words, when the network is learning with arbitrary starting and goal points, sometimes the episodes end at the given goal point but sometimes the episodes end before reaching the given goal point. In Figure 4, the green lines denote the success ratio of every 10 episodes and the thick lines stand for the moving average of the gray lines. Figure 5 shows the reward as the number of the episode increases, i.e., the training is proceeding. The reward converges as the number of the episode increases. In view of the results in Figures 4 and 5, we can see that learning is over successfully.
For the purpose of testing the trained TD3, it is verified if the optimal paths are generated when random starting and goal points are given to the trained TD3. For testing, only the actor DNN without its target DNN is used with the input being a starting and goal point repeatedly. The input to the trained actor DNN is (q t , q goal ), the output is (a t , q t+1 ), and then this is repeated with q t =: q t+1 as depicted in Figure 6.   When this is applied to the 2-DOF manipulator, the resulting paths are shown in Figure 7. In Figure 7, the green areas represent obstacles in the configuration space, and the rhombus denote the starting point and the circles means the goal point. For comparison, PRM is also applied to generate the paths with the same starting and goal points. As shown in Figure 7, the proposed method generates smoother paths in general. This is confirmed from many other testing data as well. In Figure 7, the red lines are the resulting paths by PRM and the blue lines by the proposed method. In average, the resulting path by the proposed method is shortened by 3.45% compared with the path by PRM.
In order to test the proposed method for a real robot manipulator, the 3-DOF open manipulator. For details, see http://en.robotis.com/model/page.php?co_id=prd_openmanipulator is considered. For easy understanding, Figure 8 shows the workspaces in Matlab and Gazebo in ROS (Robot Operating System) respectively, and configuration space of the open manipulator with four arbitrary obstacles.
The tuning parameters for the TD3 with HER are also shown in Table 2. For training, 140000 episodes are used. In view of the converged reward and success ratio in Figures 9 and 10, we can see that the learning is also over successfully. With this result, random starting and goal points are given to the trained network in order to obtain a feasible path between them. Figure 11 shows several examples of generated paths by the trained actor DNN when arbitrary starting points and goal points are given. The red lines are resulting paths by PRM and the blue lines by the proposed method. As seen in the figure, the proposed method results in smoother and shorter paths in general. For the sake of between comparison, 100 arbitrary starting and gold points are used to generate paths using PRM and the proposed method. Figure 12 shows the lengths of the resulting 100 paths. In light of Figure 12, it is obvious that the proposed method generates smoother and shorter paths in general. Note that, in average, the resulting path by the proposed method is shortened by 2.69% compared with the path by PRM.
When the proposed method is also implemented to the open manipulator, the same experimental result as the simulation was obtained. The experimental result is presented in the https://sites.google. com/site/cdslweb/publication/td2-demo.

Conclusions
For the purpose of enhancing efficiency in manufacturing industry, it is important to improve performance of robot path planning and tasking scheduling. This paper presents a reinforcement learning-based motion planning method of robot manipulators with focus on smoother and shorter path generation, which means better operation efficiency. To this end, motion planning problem is reformulated as a MDP (Markov Decision Process), called RAMDP (Robot Arm MDP). Then, TD3 (twin delayed deep deterministic policy gradient, twin delayed DDPG) with HER (Hindsight Experience Replay) is designed. DDPG is used since the action in RAMDP is a continuous value and DDPG is a policy gradient tailored to an MDP with a continuous action. Moreover, since many failed episodes are generated in the RAMDP meaning that the episode ends at non-goal state due to mainly collision, HER is employed in order to enhance sample efficiency.
Future research topic includes how to solve motion planning problem for multi-robot arms whose common working space is non-empty. To solve this problem, configuration space augmentation might be a candidate solution. Since the augmented configuration space becomes high dimensional, it would be interesting to compare performance of the proposed reinforcement learning-based approach by that of sampling-based approaches such as FMMs, PRM, and RTT. Moreover, reinforcement learning-based motion planning for dynamic environment is also a challenge problem.