Path Planning for Multi-Arm Manipulators Using Deep Reinforcement Learning: Soft Actor–Critic with Hindsight Experience Replay

Since path planning for multi-arm manipulators is a complicated high-dimensional problem, effective and fast path generation is not easy for the arbitrarily given start and goal locations of the end effector. Especially, when it comes to deep reinforcement learning-based path planning, high-dimensionality makes it difficult for existing reinforcement learning-based methods to have efficient exploration which is crucial for successful training. The recently proposed soft actor–critic (SAC) is well known to have good exploration ability due to the use of the entropy term in the objective function. Motivated by this, in this paper, a SAC-based path planning algorithm is proposed. The hindsight experience replay (HER) is also employed for sample efficiency and configuration space augmentation is used in order to deal with complicated configuration space of the multi-arms. To show the effectiveness of the proposed algorithm, both simulation and experiment results are given. By comparing with existing results, it is demonstrated that the proposed method outperforms the existing results.


Introduction
In the Industry 4.0 era, one of the important elements in the manufacturing industry for a smart factory is automation via collaboration of robot manipulators, and the manufacturing industry has been less affected by human workforce [1]. Especially, multi-arm manipulators have been drawing more attention due to its increasing application such as assembly line in a factory, space research, customer service, exploration, even rescue mission [2]. Hence, it is utmost important to improve the efficiency of the operation of the multi-arm manipulator.

Motivation
For manipulators in manufacturing industry, when a specific task (e.g., moving an object) is given, human experts find manually a collision-free path from the start and goal locations for the end-effector in order to perform the task and teach the path to the manipulator. Hence, if the start and goal locations of the end-effector vary owing to task change, this procedure has to be repeated. Moreover, in the case of multi-arm manipulators, such a procedure becomes much more difficult [3,4]. Hence, it is important to make such a procedure carried out automatically. In addition to this, for the purpose of making the whole manufacturing process efficient and optimal, the multi-arm manipulator has to learn the optimal (i.e., shortest) path rather than a feasible path when arbitrary start and goal locations are given [5,6].

Background and Related Work
A representative of path planning for multi-arm manipulators is the sampling-based algorithm which computes the path after building a graph using sampled points of the workspace [7]. The examples of sampling-based algorithms are fast marching method (FMM) [8], probabilistic road map (PRM) method [9], and rapid exploring random trees (RRT) method [10][11][12][13]. In PRM, the Dijkstra algorithm is applied to find the shortest paths from the constructed graph [14,15]. The path generated by the sampling-based method may not be the optimal path since the resulting path is heavily dependent on the sampling methods. The other approach to solve the path planning problem for multi-arm manipulators is to use the artificial potential function to create a motion equation that can guide the robot arm to the goal point [16,17]. By computing the gradient of the equation, the direction of the optimal path can be attained [18]. However, it can be trapped in the local minimum of the potential field and fail to find the right path [19]. This situation can be worse if the path planning problem is high-dimensional. Note that the path planning under consideration is complicated and high-dimensional by nature. Due to this reason, an effective path planning algorithm for multi-arm manipulators has to be developed. In the literature about path planning, there are already some deep learning-based approaches implemented for robot applications such as mobile manipulation [20,21], unmanned ship [22] and even for multi-mobile robot [23]. These imply that deep learning-based approach can be promising in path planning for single arm manipulator [24] and also for multi-arm manipulators [25]. Especially, the focus of this paper is placed on devising a deep reinforcement learning-based path planning algorithm for multi-arm manipulators [26].
There are mainly three difficulties in reinforcement learning-based path planning for the multi-arm manipulator. First, in order to design a path planning algorithm, although it is essential to deal with the configuration space properly, it is not easy to define the configuration space which consists of collision-free space and collision space for the multi-arm manipulator since there are multiple arms, which makes the problem nontrivial. Second, when reinforcement learning-based path planning is designed, efficient exploration is indispensable to find the shortest path. However, since path planning for the multi-arm manipulator is a high-dimensional problem, existing reinforcement learning-based path planning suffers from poor exploration performance, which makes the resulting path planning lead to a suboptimal solution (i.e., not the shortest path). Third, when path planning is designed using reinforcement learning, the agent can get the reward in a sparse manner. In other words, when arbitrary start and goal locations are given for the reinforcement learning-based path planning, there can be many cases where the agent fails to find the shortest path due to physical limitations or high dimensionality of the configuration space. In this case, the agent can not get the reward contributing to training.

Proposed Solution
In this paper, a recently reported policy gradient type reinforcement learning algorithm called SAC (Soft Actor-Critic)-based path planning is proposed to overcome the previously described difficulties [27]. To this end, the configuration spaces of each arm are augmented as if the whole multi-arm manipulator is viewed as a virtual single-arm manipulator whose configuration space has higher dimension. Since SAC is known to show superior exploration performance due to the entropy term in the objective function, this paper proposes how to apply SAC to designing of a path planning algorithm. In order to overcome the sparse reward problem, HER (Hindsight Experience Replay) [28] is employed to deal with generated episodes efficiently in training. Note that HER can convert the episodes without reward to episodes with reward by changing the target location.
The performance of the proposed SAC-based path planning is validated not only simulation but also experiment using two real open manipulators. The results show that the proposed method finds a shorter and smoother path for most scenarios due to enhanced exploration performance by SAC, and outperforms over the existing results such as PRM [29] and TD3 (Twin Delayed Deep Deterministic Policy Gradient)-based path planning [30].

Sampling-Based Path Planning for Robot Manipulator and Configuration Space
In sampling-based path planning for robot manipulators, the space of possible joint angles is referred to as configuration space Q (also called joint space) and is a subset of n-dimensional vector space R n where n is the number of the joints of the robot manipulator. A point in the configuration space Q corresponds to values of joint angles of a robot manipulator [6,31,32]. The configuration space is comprised of two subsets: Q collide and Q free . The robot manipulator does not collide with any obstacles or itself if its joint angles belong to Q free . If the values of the joint angles are in Q collide , it means that there is a collision between the robot manipulator and obstacles or itself.
For sampling-based path planning, random sampling for the continuous Q free is done in order to obtain its discrete representation. Then, the discrete representation is modeled as a connected graph. From this construction, the nodes in the graph imply admissible configurations of the robot manipulators, and feasible paths between any two configurations are represented by connected edges in the graph.
For the purpose of describing the path planning problem, letq t ∈ Q ∈ R n denote the values of the joint angles of the manipulator. Also, let T represent the maximum number of the iteration in any path planning algorithm. This means that the algorithm either finds the shortest path successfully within T iterations for a given starting configurationq init ∈ Q free and goal configurationq goal ∈ Q free , or decides that it is not possible to compute the shortest path for those configurations. With these definitions, whenq init ∈ Q free andq goal ∈ Q free are given, path planning is to compute a valid continuous and shortest path on the graph linkingq init andq goal . Since the sampled nodes of Q free are modeled as a connected graph, there always exists such a shortest path between any two nodes in Q free . Note that the path means the sequence of the stateq t such thatq 0 =q init , · · · ,q T 1 =q goal with T 1 ≤ T, and both the sequence {q 0 , · · · ,q T 1 } and the line segment connecting any two consecutive stateq t andq t+1 belonging to {q 0 , · · · ,q T 1 } have to belong to Q free .

Collision Detection in Workspace Using the Oriented Bounding Box (OBB)
Even though the path planning algorithm mostly works in the configuration space Q, the path generated by the path planning algorithm has to make sure that the collision does not occur when the robot moves in workspace W. The workspace W is a space where an actual robot works, and is the subset of 3-dimensional Euclidean R 3 . It is important to have a collision detection method to confirmq t ∈ Q free in the middle of running the path planning algorithm [6,33]. To detect the collision in the workspace W, the oriented bounding boxes (OBB) [34] are employed for modeling all the 3D objects and finding the boundary in the 3D environment of the workspace including robots, obstacles, and ground. Simply speaking, any objects in the workspace are represented as boxes surrounding them in order to check collision.
Since objects in the workspace are modeled in the form of the OBB, the collision checking between two boxes has 15 cases: 3 faces of the first box, 3 faces of the second box, and 9 edge combinations between the first and second boxes [35]. By checking these 15 cases made by two OBBs, the collision detection between two boxes can be tested. Based on such a collision check for two boxes, the higher level collision checking between one robot and one obstacle can be implemented by repeated application of collision checking for two OBBs [36]. For the multi-arm manipulator case, each arm is considered as a moving obstacle to the others. Hence, again, collision checking in the workspace of the multi-arm manipulator can be tested using collision checking of OBBs for obstacles and robots.

Reinforcement Learning
Reinforcement learning is a decision making procedure for Markov decision process (MDP) [37] which is defined by the tuple {S, A, P, r, γ} where S is the set of states, A the set of actions, P the transition probability, r(s, a) the reward function, and γ the discount factor [38]. The transition probability P(s |s, a) is the probability that the current state s ∈ S moves to the next state s when the action a ∈ A is applied to the environment at the current state s ∈ S. The policy of the agent denoted by π(a|s) is the distribution of action a for each state s ∈ S. At each time step t, the agent chooses action a t ∈ A based on the policy π : S → A. When the decided action is applied to the environment, the environment returns both the next state s t+1 ∈ S and reward r t+1 ∈ R from the transition probability P : S × A → P (A) and the reward function r : S × A → R. Conceptually, when the agent learns the optimal policy, the policy is determined such that it maximizes the expected return E π [∑ ∞ k=0 γ k r t+k+1 ]. This learning procedure is repeated until the expected return converges. However, since it is not possible to compute the expected return in practice, its estimated value, called optimal value function, is maximized. With this framework, there are two approaches: value-based and policy-based approaches. In the value-based approach, the optimal value function is estimated. For instance, deep Q-network (DQN) approximates the optimal value function using deep neural networks. When the optimal value function is estimated, the corresponding optimal policy is derived from the approximated value function. On the other hand, the policy-based approach (also called the policy gradient) computes the optimal policy directly from the agent's experience. Representatives of the policy gradient are REINFORCE, actor-critic method, deterministic policy gradient (DPG), deep DPG (DDPG), asynchronous advantage actor-critic (A3C), trust region policy optimization (TRPO), maximum a posteriori policy optimization (MPO) and distributed distributional DDPG (D4PG) [39][40][41][42][43]. In general, it is known that the policy gradient outperforms the value-based approaches, especially in the continuous action task [41,44].
In the reinforcement learning, the training performance depends heavily on sampled data consisting of the current state s t , action a t , next state s t+1 and reward r t+1 . To enhance sample efficiency, the agent saves the samples in the memory first and use the saved samples taken from the memory. Replay memory [45] and HER [28] are tailored methods for this. In this paper, a state-of-the-art policy gradient method, known as soft actor-critic (SAC) [27] with HER, is employed for the multi-arm manipulator path planning algorithm.

Path Planning for the Multi-Arm Manipulator and Augmented Configuration Space
For the path planning of a multi-arm manipulator, it is assumed that there are m identical robot manipulators and each robot has n joints. Let q i t ∈ R n denote the value of the joint angles of the ith manipulator at the tth iteration in the proposed deep reinforcement learning-based path planning algorithm, and q i t,j ∈ R represents the jth element of q i t ∈ R n , i.e., the value of the jth joint of the ith manipulator. Similarly to the path planning for a single robot manipulator, the path planning problem for the multi-arm manipulator is to find the shortest and collision-free path for a given starting configuration q 0 and goal configuration q goal where init and q i goal represent starting and goal configuration of the ith manipulator, respectively. In order to design a path planning algorithm for the multi-arm manipulator, it is difficult to define Q free and Q collide for each arm if each arm is considered independently since the multi-arm collaborate (i.e., move together) in the same workspace and each arm is an obstacle to the others in the workspace. To cope with this situation, one remedy is to view the multi-arm manipulator as one virtual single-arm manipulator whose number of the joint is nm. Then, the virtual manipulator's movement is described by augmenting the configuration of each arm. In other words, the state of the virtual manipulator is defined by . . .
Consequently, the resulting augmented configuration space is defined by Q a free denotes the collision-free space for q t and Q a collide is the corresponding collision space. Then, the path planning problem for the multi-arm manipulator can be redefined by the path planning problem for the one virtual single-arm manipulator with a given starting configuration q 0 ∈ Q a free and goal configuration q goal ∈ Q a free .

Multi-Arm Manipulator Markov Decision Process (MAMMDP)
For the sake of applying deep reinforcement learning to path planning algorithm design for multi-arm manipulators, a multi-arm manipulator MDP (MAMMDP), i.e., the tuple {S, A, P, r, γ}, is defined. The structure of MAMMDP can be seen in Figure 1. The current state of m manipulators q t ∈ Q a free ⊂ R nm is the n joint values of the m manipulators. The current action a t ∈ A is defined as a change of the state in the augmented configuration space Q a free . The action is generated by a t = f (e t , q t ) where e t follows the Gaussian distribution N (0, σ t ) with σ t being the variance, and f (e t , q t ) generates a stochastic action using noise e t and state q t . Actually, function f (·) is a sampler. In other words, the action is sampled from a probability distribution. Then, the next state is calculated as the sum of the current state and action such asq t+1 = q t + αa t + e where tuning parameter α is the geometric mean of the maximum variations of each joint at a time and e ∼ N (0, σ e ) and e is an environment noise. Whenq t+1 ∈ Q a collide occurs, the next state becomes the previous state, i.e., it stays at the current state. Such an iteration of computing, implementing the action, and obtaining the state and reward are repeated until the next state reaches the goal point q goal , namely q t+1 = q goal , or stops when the number of the iteration hits T where T is the predefined maximum number of the iteration. However, in practice, q t+1 = q goal rarely happens, at least, due to numerical problems. Hence, instead of that, a relaxed terminal condition q t+1 − q goal ≤ η · α is used where · denotes the norm of a vector and tuning parameter η meets 0 < η < 1. Note that the terminal condition means that the iteration stops when the next state reaches a small neighborhood of the goal configuration.
When the next state is not the goal state, the corresponding reward is −1 while the reward is 0 if the agent reaches the goal. Hence, at the end of the iteration, if the agent does not reach the goal, then the total reward is −T. On the other hand, if the iteration ends at a certain iteration T 1 < T, then the total reward becomes −(T 1 − 1). In summary, the state transition and reward function r(q t , a t ) are defined as follows.
The goal of reinforcement learning is to find the optimal policy maximizing the total reward that the agent gets in path planning iteration. If the iteration ends in a finite time T 1 < T, the agent can learn to find a path to reach the goal by maximizing the total reward since the graph is connected. Besides, by trying to reduce the ending T 1 during training, the agent can seek an as short path as possible. However, when the agent is not trained sufficiently, it is hard to find a goal since the action is determined mainly by only the exploration method (e.g., random action sampling). This means that there are many iterations which fail to reach the goal until the agent is well trained. In this case, there are only few state, action, and reward trajectories which can contribute to learning (i.e., higher total reward). MDP with such a problem is called MDP with sparse reward and the path planning problem for the robot arm manipulator exactly falls into MDP with sparse reward. To overcome this problem in this paper, a path planning algorithm is designed based primarily on both recently proposed soft actor-critic (SAC) [27] and hindsight experience replay (HER) [28]. SAC is famous for its outstanding exploration performance and HER enhances the sample efficiency in deep reinforcement learning for MDP with sparse reward. The details of the proposed algorithm are presented in the next subsection.

Soft Actor-Critic (SAC)
MAMMDP has a continuous action, i.e., the agent's action of MAMMDP is a continuous joint value and there are several policy gradient-based reinforcement learning methods for MDP with continuous action such as [27,41,46]. This section reviews briefly the main result of [27,47] and it is presented how to design a SAC-based path planning for MAMMDP. SAC is a maximum entropy reinforcement learning which maximizes not only the expected sum of rewards but also the entropy of the policy. Namely, SAC computes the optimal policy by maximizing the expected sum of entropy augmented reward defined by where π is the policy which is updated to find the maximum total reward. β is the temperature parameter and it regulates the entropy term against the reward. H(π(a t |q t )) denotes entropy. Usually, the distribution of the policy has small variance and its center is near a specific action that leads to the high return. On the other hand, the entropy in the objective of SAC makes the variance of the policy distribution increases. In the policy, the increased variance of the distribution means that the policy has more various actions that can be chosen. Accordingly, more explorations are carried out in this method. Because the multi-arm manipulator path planning is a multi-dimensional problem, enhancing the exploration plays a pivotal role in obtaining an efficient path planning algorithm. In SAC, the agent maximizes not only the expected return but also the entropy that leads to a better exploration. Because of this, the agent can acquire the optimal path of the multi-dimensional problem.

Remark 1.
As the policy is described by probability distribution in the case of a stochastic policy, the distribution of the policy should match that of Q-value. Besides, such a stochastic policy is heavily affected by the exploration. Because SAC is a maximum entropy reinforcement learning framework, it not only enhances the exploration but also finds multiple optimal policies, which is possible due to the fact that the distribution functions of the Q-value and the policy are getting match as the training is going on by SAC. In the path planning problem, in general, there can be multiple solutions for the optimal path.
For example, in Figure 2a, there are the initial point, goal point, and an obstacle in between them. In such a setting, it is obvious that there are two shortest paths between q init and q goal . It is difficult for a deterministic policy to capture these two solutions. In contrast, the problem with multiple optimal solutions can be tackled by a stochastic policy with entropy augmented reward. Then, the probability distribution of the policy represents the information of two optimal solutions like in Figure 2b. Because of this, the stochastic policy such as SAC can find all optimal paths of a problem with multiple optimal solutions. This is why the SAC-based path planning can be effective for the path planning for a multi-arm manipulator which is highly dimensional. With the augmented reward function, the soft value function and soft Q-value function (or soft action value function) are defined respectively as follows: These functions are evaluated at q t and a t , and tell about how much reward can be obtained in the future. Soft Q-value Q(q t , a t ) is used to train the agent's policy and the soft state value V(q t ) is needed to estimate the soft Q-value. In SAC, the policy, the soft state value, and the soft Q-value functions are approximated by deep neural networks (DNNs). Each of them is parameterized by φ, ψ, and θ, respectively. Besides, there exists another target network with parameterψ. The target network is used to improve learning performance for the soft Q-value and makes the learning more stable. In addition, the technique using the double Q-value functions is employed [27,46,48]. Consequently, there are five DNNs in SAC: the parameterized policy π φ (a t |q t ), the soft state value V ψ (q t ) with target Vψ(q t ) and the two soft Q-values Q θ 1,2 (q t , a t ). Finally, the action is sampled from f φ (e t , s t ), i.e., a t = f φ (e t , s t ).
To find the optimal policy, soft-Q value and soft state value, the stochastic gradient descent method is applied to their objective functions. The DNN for the soft state value is trained to minimize the mean squared error with estimated soft state value given by In this function, the minimum of the soft Q-value is taken over two Q-value functions parameterized by θ 1 and θ 2 . This helps to avoid overestimation that gives high ratings to inappropriate Q-values [46]. The soft Q-value is trained to predict how much return can be obtained from pair (q t , a t ) and this is also done by minimizing the Bellman equation given by Both θ 1 and θ 2 are trained with their Q-values. The minimum of these Q-values is also used for the policy network learning. The objective function of the policy network is the information projection between the distribution of the current policy and the distribution of the minimum Q-value as follows.
As a matter of fact, this objective function is a simplified Kullback-Leibler (KL) divergence between the policy and Q-value [27]. After minimizing this function with respect to parameter φ, the distribution of the policy becomes proportional to the distribution of Q-value. At each training step, after training the other parameters, the parameter of the target soft state valueψ is updated according toψ ← τψ + (1 − τ)ψ by a tuning parameter τ ∈ [0, 1]. Since SAC is based on off-policy actor-critic [27,44], we do not need to use the current policy for training the networks. Instead, the transition tuples (q t , a t , q t+1 , r(q t , a t )) are collected in every iteration and then, stored in experience replay memory D [45]. At the beginning of the training procedure, a bundle of these tuples is sampled and they are used to compute the expectation in the objective functions.

Hindsight Experience Replay (HER)
In the MAMMDP under consideration, the agent does not receive negative rewards only when the goal is reached. As mentioned before, since the MAMMDP is an MDP with sparse reward, the agent suffers from low sample efficiency. To improve the sample efficiency, the hindsight experience replay (HER) [28] method is used to deal with the measured samples. Suppose that the iteration in the algorithm ends without reaching the goal state, and let the corresponding trajectory becomes a failed episode e = [(q 0 , a 0 ), (q 1 , r 1 , a 1 , · · · , (q T , r T ))]. Since the agent does not arrive at the goal state in this episode, it follows that q T = q goal , which means that the agent does not learn much from this episode. To resolve this problem, HER defines a new goal state q goal , chooses one state in the failed episode, let say q t , t ∈ {1, 2, · · · , T}, and sets q goal = q t . Then, the modified episode e = [(q 0 , a 0 ), (q 1 , r 1 , a 1 , · · · , (q t , r t ))] becomes a successful episode since q t = q goal is achieved where r t ≡ 0. Then, this modified episode can improve the learning performance and this is how HER enhances the sample efficiency.
The proposed SAC-based path planning algorithm is depicted in Figures 3 and 4. Also, the details of the implementation of the proposed method can be found in Appendix A.

Results and Discussion
In this section, the proposed SAC-based path planning algorithm is applied to two real 3-DOF open manipulators. The parameters of the 3-DOF manipulator are given in Table 1 45) The parameters used in the SAC-based path planning are described in Table 2. For the experiment, the workspace in Figure 5 is considered. The size of the workspace is 90 cm × 60 cm × 47 cm. The bar in between two manipulators is an obstacle. Figure 5a represents the visualization of the workspace in Matlab and Figure 5b shows the actual workspace. The problem to be solved is to devise a path planning algorithm using the proposed method for each arm in this setting such that the shortest paths for each arm are computed when arbitrary start and goal positions for the end-effect of each arm are given. Note that the path planning for this problem setup is nontrivial since there is a common workspace in workspaces of each arm in addition to the obstacle. In the training process, 300,000 episodes are implemented to train the networks in the proposed SAC-based path planning algorithm. In each episode, arbitrary start and goal locations belonging to the workspace are given and the agent is trained to find the shortest path between them by the proposed algorithm. Figure 6 shows that success ratio of the 300,000 episodes. The success means that the path planning algorithm finds the shortest path successfully. In Figure 6, the green line denotes the moving average of success ratio of every 10 episodes where the success ratio is set to 1 when the shortest path is computed successfully, and 0 otherwise. The thick line is the moving average of the green lines. As seen in Figure 6, the proposed path planning algorithm finds the path without fail in most cases. In addition to the success ratio, the reward is also an important aspect to evaluate the training performance of any reinforcement learning-based algorithm. In the proposed algorithm, the reward value can be interpreted as the optimal time generating the path due to the definition of the reward Function (2). Figure 7 depicts the reward for the 300,000 episodes.
In Figure 7, the light blue line denotes the actual reward and the thick blue line describes the moving average of the light blue line. In view of Figure 7, it can be seen that the reward converges very quickly as the episode increases, which implies the algorithm finds the shortest path efficiently and consistently. Figures 6 and 7 demonstrate that the agent is trained well by the proposed SAC-based path planning algorithm such that it not only reaches the given goal mostly but also maximizes the return due to the quickly converged reward, which means that the training process is done successfully.
To see the advantage of the proposed method more, we implement other algorithms in the training process and plot the data in one graph for comparison. Figure 8 shows the training results made by three different path planning algorithms based on deep reinforcement learning. The light blue line is the actual reward from TD3 (Twin Delayed Deep Deterministic Policy Gradient) and the thick blue line is the moving average of the light blue line. Then, the orange line indicates SAC without entropy (β = 0) and the purple line is the original SAC with entropy. For SAC without entropy, the moving average is in the thick orange line and the moving average of SAC is in the thick purple line. Basically, because β is equal to zero in SAC without entropy, the objective function of SAC without entropy is exactly the same as the objective function of TD3. At the beginning of the training, we can see that SAC has a late transient response than TD3 and SAC without entropy. This is because SAC has an entropy factor in the objective function. Moreover, the TD3 and SAC without entropy have the same objective function that only maximizes the expected reward without the entropy. However, at the steady-state, SAC can reach a higher reward value than the other algorithms. This means that the trained actor from SAC can find shorter path than the other algorithms. This is because SAC aims to maximize not only the expected reward but also the entropy.  Once the actor is trained well, the actor-network can be used to compute the shortest path for a given arbitrary start and goal location in the workspace. As the output of the trained actor is only the mean of the stochastic policy, the output does not have any noise from the sampling distribution. To be specific, in testing step, (q init , q goal ) is injected into the trained actor in the beginning. Then, its output is the mean action (without noise) which is applied to the environment. The measurement, including the environment noise, is the next state which is injected into the actor-network together with q goal . Then, the actor-network generates the action again. This procedure is repeated until the goal state is reached. In other words, the trained actor generates the following path to the goal state.
(q init , q goal ) Input to the actor → a 0 Output of the actor Input to the actor → a 1 → q 2 · · · (q t , q goal ) → · · · → (q goal , q goal ).   Figure 10 shows an example of the start position q init and goal position q goal in the workspace where the green robot position is the initial position and the light yellow denotes the goal position. As explained before, if q init and q goal are given to the trained agent, the shortest path is computed.   Figure 11 shows the configuration space of the first arm and the right figure that of the second arm. Figure 11 shows several examples of resulting paths computed by the actor for given arbitrary start and goal locations in the configuration space. The two configuration spaces of each robot are used for visualizing the generated path. The lines in Figure 11 represent the joint movement of robots. In the configuration space, there are two kinds of collision space: the static collision space as a static obstacle and dynamic collision space as a moving obstacle (the other robot). In Figure 11, the static obstacle is represented by the brown area of the configuration space. The video clips for the simulation and experiment can be found at https://sites.google.com/site/cdslweb/publication/sacpath.
For comparison, the other method like PRM (35,000 sampled point graph) and TD3 with HER also generate a path with the same task. In Figure 11, the red lines denote the result by PRM method and the blue lines by TD3 with HER, and the green lines by the proposed method. The yellow points describe the goal position of each robot and the black points tell the start position of each robot. As shown in Figure 11, the proposed path planning using SAC with HER method leads to smoother and shorter paths than the other algorithms. To see the quality of the proposed method, 100 simulations of the arbitrary initial and goal points are implemented using the PRM, TD3 with HER, and SAC with HER simultaneously. Since the result paths can be interpreted as performance, the 100 generated paths are used for comparison in Figure 12. As seen in Figure 12, the proposed path planning using SAC with HER always finds the shortest path for all simulation cases. For the result shown in Figure 12, it takes 0.1385 s on average computing the shortest path for a start point and goal point. This means that the proposed SAC-based path planning can generate the optimal path quickly so that it works for the multi-arm manipulator in real-time such as industrial application. Note that the training is done off-line. For better comparison, the paths are calculated for 100 simulations using different algorithms and are drawn in Figure 13.  From Figure 13, we can see that SAC-or TD3-based path planning is better than PRM, and SAC-based algorithm is better than TD3-based algorithm. In Table 3, three path planning methods are compared. The average path cost is the average of the path length from testing simulation in Figure 12, and the cost percentage is another ratio of the path cost where PRM's average path cost is defined as 100%. As shown in Table 3, SAC leads to the lowest cost compared with others. The roughness is a metric representing how smooth the path is [49]. It is evaluated by the mean squared error with second derivative of the path. In other words, the roughness is defined by l roughness (q 1 , q 2 , · · · , q T ) = 1 T ∑ T t=1 ( d 2 q t dt 2 ) 2 . As shown in Figure 11 and Table 3, the generated path by SAC is smoothest among the others.
Such good performance of the proposed SAC-based algorithm is due to its better exploration performance coming from maximizing the entropy. In the maximum entropy framework, the agent explores the environment to maximize the return and the entropy. Because of this reason, the algorithm can find the optimal path in any task. However, in deterministic policy like TD3, the agent only maximizes the expected reward, which can lead to poor exploration performance.
In view of all simulation and experimental results, it is confirmed that the proposed method generates a shorter and smoother path than other algorithms. To be specific, notice that the optimal path of the proposed method is shorter by 9.66% than that of PRM on average. Compared with the TD3 with HER, on average, the paths by the proposed method are 7.6% shorter than those by TD3 with HER.

Conclusions and Future Work
This paper presents a deep reinforcement learning-based path planning algorithm for the multi-arm manipulator. In order to solve the high-dimensional path-planning problem, SAC (Soft Actor-Critic)-based algorithm is proposed. To deal with the multi-arm efficiently in configuration space, configuration spaces of each arm are augmented, and HER (Hindsight Experience Replay) is employed to enhance sample efficiency. Both the simulation and experiment show that the proposed algorithm finds the shortest path for arbitrary start and goal positions and the generated paths are shorter and smoother than those generated by existing results. Such results are made due to better exploration performance yielded by the entropy term in the objective function of SAC. Since the agent is trained off-line, the trained agent can generate the shortest path for an arbitrary start and goal positions quickly which means that the proposed algorithm can be used in real-time application.
This paper assumed that the work environment is static. Namely, the obstacles in the workspace do not change during the robot arm manipulator operation. Hence, a natural future work is the path planning for dynamic environment. For instance, if obstacles are moving in workspace, it is nontrivial to define Markov decision process in order to design a path planning algorithm using reinforcement learning.

Appendix A
In this section, the proposed algorithm is explained in terms of implementation. The proposed reinforcement learning-based path planning algorithm starts from goal-conditioned reinforcement learning. In this setting, the policy of the agent is defined like π(a t | q t ||q goal ) which is probability of the action a t where current state q t and goal state q goal are given. q t ||q goal is the concatenated vector of the joint values and is the input of the policy π(·, ·) where || denotes the concatenation. This setting is the same as that of the original HER (Hindsight Experience Replay) algorithm [28]. Also, the soft Q-value function and the soft state-value function depend not only on a state-action pair but also on a goal where the soft Q-value function is Q(q t ||q goal , a t ) and the soft state-value function is V(q t ||q goal ).
In the HER, the additional goal q goal ∈ {q 1 , q 2 , · · · , q T } is selected to generate a new sample q t ||q goal at the end of every episode. The additional goal is randomly chosen from {q 1 , q 2 , · · · , q T } ⊂ Q a free and q goal / ∈ Q a collide . When the agent is trained with this new sample, the sample efficiency of training process can be improved. Since the reward function of MAMMDP is based on a sparse reward setting and has no information on the direction or distance to the goal state, it is extremely difficult to train the agent without HER algorithm. This gets worse as the dimension of the path planning problem increases. By using this technique, we can increase the number of additional goals at each episode to speed up the training. However, when we compare the performance between TD3 and SAC in Figure 8, only one additional goal is used in every episode.
By utilizing reparameterized sampler f φ ( t , q t ||q goal ) and without using original stochastic policy π φ (a t | q t ||q goal ), SAC produces a continuous action as its output [27]. Then, for the sampling noise, t is given from the Gaussian distribution where t ∼ N (0, σ t ) and σ t is a variance. To be specific, the outputs of the policy network are mean µ φ and variance σ t . According to the objective Functions (6) and (8), we need to calculate E a t log π φ (a t | q t ||q goal ) where a t = f φ ( t , q t ||q goal ). This is nothing but log-likelihood with the probability of action π φ (a t | q t ||q goal ). Since a t is sampled from Gaussian distribution, this log-likelihood is given by where l is the number of samples. Variance σ t can be viewed as a design parameter or can be tunable via learning. In this paper, it is fixed as a constant. Figure A1 describes how the action is sampled. Soft Actor-Critic, unlike the original policy gradient algorithm, maximizes the expected sum of entropy augmented reward in the Equation (3). The entropy in the reward function can be evaluated by the Equation (A1) and this entropy augmented reward Function can be seen in the objective Functions (6)- (8). The order of optimizing objective Functions (6) and (7) is not important, but the optimization of objective Function (8) must be done after optimization of objective Functions (6) and (7). That is because the policy has to be updated towards the distribution of new soft Q-function. Moreover, these three optimizations are done by stochastic gradient descent with Adam optimizer and fixed learning rates. The gradients of the objective Functions (6)-(8) are defined as follows: Q θ k (q t ||q goal , a t )−β log π φ (a t | q t ||q goal )]) 2 ], (A2) ∇ θ k=1,2 J Q (θ k=1,2 ) = E q t ,a t [ 1 2 (∇ θ k=1,2 Q θ k=1,2 (q t ||q goal , a t ) − (r t+1 + Vψ(q t+1 ||q goal ))) 2 ], ∇ φ J π (φ) = E q t ,a t [β∇ φ log π φ (a t |q t ||q goal ) − ∇ φ min k=1,2 Q θ k (q t ||q goal , a t )] (A4) ≈ β∇ φ log π φ (a t |q t ||q goal ) + (β∇ a t logπ φ (a t |q t ||q goal ) (A5) − ∇ a t min k=1,2 Q θ k (q t ||q goal , a t ))∇ φ f φ ( t , q t ||q goal ).
Note that Equation (A5) is an approximated and decomposed form of Equation (A4) where a t = f φ ( t , q t ||q goal ). These three gradients are used to solve each of the minimization problems. The complete algorithm is described in Algorithm 1.

Algorithm 1
Proposed SAC-based path planning algorithm for multi-arm manipulator. 1: Define MAMMDP and the augmented state q t and the state and goal state q init and q goal 2: Initialize network parameters ψ, θ 1,2 , φ 3: Initialize the parameter values of the target networkψ ← ψ 4: Initialize global replay memory D 5: 6: for e = 1 to M do 7: Initialize local buffer L Memory for an episode 8: for t = 0 to T − 1 do 9: Randomly choose the goal and initial positions q goal , q init ∈ Q a free 10: a t = f φ ( t , q t ||q goal ), t ∼ N (0, σ t ) 11:q t+1 = q t + α · a t + e , e ∼ N (0, σ e ) 12: