Deep Reinforcement Learning-Based Path Planning for Multi-Arm Manipulators with Periodically Moving Obstacles

: In the workspace of robot manipulators in practice, it is common that there are both static and periodic moving obstacles. Existing results in the literature have been focusing mainly on the static obstacles. This paper is concerned with multi-arm manipulators with periodically moving obstacles. Due to the high-dimensional property and the moving obstacles, existing results suffer from ﬁnding the optimal path for given arbitrary starting and goal points. To solve the path planning problem, this paper presents a SAC-based (Soft actor–critic) path planning algorithm for multi-arm manipulators with periodically moving obstacles. In particular, the deep neural networks in the SAC are designed such that they utilize the position information of the moving obstacles over the past ﬁnite time horizon. In addition, the hindsight experience replay (HER) technique is employed to use the training data efﬁciently. In order to show the performance of the proposed SAC-based path planning, both simulation and experiment results using open manipulators are given.


Introduction
In the fourth industrial revolution, the operation of an autonomous multi-robot in a complicated workspace has been an important challenge for modern smart factories [1]. It is important to replace the human workforce with robots, to collaborate with robots, and to deploy robots in an efficient manner [2,3]. This paper presents a deep reinforcement learning-based path planning algorithm to deal with periodically moving obstacles.

Background and Motivation
In the autonomous robot framework, there are three fundamental concepts such as perception, planning, and control [4]. The perception aims to obtain information about the environment using various sensors. The planning schedules a sequence of valid configurations for the robot arms or wheels in order to perform a given task. When the path achieving the goal is given, the controller adjusts actuators so that, for example, the configurations of the robot manipulators follow the planned path as close as possible.
Although the robot manipulator operation is quite diverse in the workspace, most operations can simply be interpreted as moving the robot from a starting point to a goal point. Hence, it is utmost essential for the manipulator to move without collision with any obstacles in the workspace. In practice, there are typically two kinds of obstacles: static and periodically moving obstacles. In the case of static obstacles, their locations do not change in the middle of the robot operation, and most existing path planning methods focus on the static obstacles. For the robot operation with the static obstacles, it is a common situation in practice that human experts design the path for the robot manipulators in order to perform a given task. Recent research effort is directed to make this procedure in an automatic manner. In other words, the human experts' path planning is replaced with various automatic algorithms [5]. The other is the periodically moving obstacle. For the purpose of dealing with the moving obstacles, this paper considers the situation where every dynamic object in the workspace moves by plan. Consequentially, it is assumed in this paper that there are no unexpected obstacles, and that all dynamic obstacles are moving periodically. In the case of industrial robots in a real factory, this is the case. Moreover, since one robot manipulator is viewed as an obstacle to the other robots, and the entire operation on a factory is usually batch production, many obstacles can be regarded as periodically moving obstacles if they are not static obstacles. Since existing path planning algorithms deal with such a moving obstacle using any methods for static obstacles, the results are conservative. In other words, to deal with the moving obstacles, those algorithms view the whole area where the obstacle moves as one big artificial static obstacle, which is inevitably conservative.
In robot path planning research, the sampling-based algorithm is representative [6]. In order to compute the path connecting the initial and goal points, the sampling-based algorithm samples the nodes from the collision-free space and connects those nodes. Hence, the method relies on how to sample the nodes. In the case of multi-arm manipulators, the problem dimension increases quadratically as the number of arms increases, which makes it difficult for the planner to find an optimal path. Moreover, if the obstacle moves, it is not easy to use the sampling-based algorithms. Existing path planning algorithms, especially sample-based approaches, work well for a single robot manipulator. However, if the target manipulator is a multi-arm manipulator with moving obstacles, the path planning problem becomes difficult since the problem dimension is high and it is not trivial to sample the configuration space.
Hence, the focus of this paper is placed on devising a deep reinforcement learningbased path planning algorithm which can generate collision-free paths when the multi-arm manipulator has both static and periodically moving obstacles.

Related Work
Fast marching method (FMM) [7], probabilistic road map (PRM) method [8], and rapid exploring random trees (RRT) method [9][10][11] are representatives of sampling-based algorithms. In PRM, the shortest path is computed from the graph made by the sampled points using the Dijkstra algorithm [12,13]. Artificial potential field methods are also popularly used to design the path planning algorithm which leads the robot manipulator from the starting point to the goal point [14,15]. In the derivation, the gradient of the potential function plays a key role in computing the direction of the optimal path [16]. Since it is based on the gradient descent method, it can suffer from trapping in the local minimum, which makes it difficult to be applied to high-dimensional problems [17]. In addition, artificial intelligence like reinforcement learning is also widely used to design collisionavoiding path planning [18]. In [19], Real-Time RRT* (RT-RRT*) is proposed to solve the path planning problem with a dynamic obstacle in a two-dimensional environment.
This paper improves the result in [5,20] in such a way that the deep reinforcement learning-based path planning for multi-arm manipulators with both static and periodically moving obstacles is proposed using the SAC algorithm with hindsight experience replay (HER).

Proposed Method
This paper proposes a deep reinforcement learning-based path planning algorithm for multi-arm manipulators with periodically moving obstacles. In particular, in order to deal with the high-dimensional property of the multi-arm manipulators, the SAC-based path planning algorithm is devised. Since the SAC-based algorithm uses an entropy term in its objective function, it can find the optimal solution of the high-dimensional problem, which makes the SAC-based algorithm outperform the existing results. Moreover, for the purpose of handling periodically moving obstacles in the path planning, the neural networks in the SAC are designed such that the position information of the obstacles over the past finite time horizon is used as an input to the neural networks together with the joint information of the robot. Note that the SAC employs five neural networks in order to estimate the value function and optimal policy. Hence, it is crucial to design the neural networks appropriately depending on the given problem. After the proposed SAC-based path planning is trained offline for various starting and goal points in the workspace, it can find the optimal path for arbitrary starting and goal points online by computing the forward path of the trained actor-network, which means that the proposed method is computationally cheap even for the high-dimensional problems. Both simulation and experiment results using real robot manipulators show the efficiency of the proposed method.

Path Planning for Robot Manipulator and Configuration Space
In the path planning problem for a robot manipulator, the configuration is defined as a position representation of the robot in the workspace. In other words, the configuration space Q (also called joint space in a robot manipulator) is a set of all possible joint angles of the robot manipulator [3,21,22]. To be specific, Q is defined as a subset of n-dimensional vector space R n , where n is the number of the joints of the robot manipulator. Because of this reason, the position of the robot can be represented as pointq ∈ R n that has n angle values (q 1 , ...,q n ), where each value indicates the joint angle of the robot. The set Q consists of two subsets: the first subset is the collision-free space Q free that is comprised of all possible configurations, and in which the robot does not collide with any obstacles or itself. The second subset is called the collision space Q collide which is the complement of Q free in Q and the subset of Q. In Q collide , the robot arm manipulator collides with obstacles or itself [21,23]. The robot manipulator does not collide with any obstacles or itself if its joint angles belong to Q free . However, if there is a collision between the robot manipulator and obstacles or itself, the values of the joint angles must be in Q collide .
In the sampling-based method, the discrete representation of Q free is obtained by random sampling in Q free . The valid path is defined by a connected line between the sampling points. Such sampled points and paths can establish a graph. Because of this construction, the nodes in the graph describe possible configurations of the robot manipulators. In addition, the collision-free paths between any two nodes are represented by connected edges in the graph. For the notation of the path planning problem, letq t ∈ Q ∈ R n mean the values of the joint angles of the manipulator at the tth iteration, and T describe the maximum number of iterations in the algorithm. Thus, the algorithm needs to find the path within T iterations for a given starting configurationq init ∈ Q free and goal configurationq goal ∈ Q free .
Based on this concept, whenq init ∈ Q free andq goal ∈ Q free are acquired, the algorithm computes a valid continuous and shortest path linkingq init andq goal . The path is defined as a sequence of the stateq t such thatq 0 =q init , · · · ,q T 1 =q goal where T 1 ≤ T. In addition, the sequence {q 0 , · · · ,q T 1 } and the line segment connecting any two neighbors stateq t and q t+1 from {q 0 , · · · ,q T 1 } have to belong to Q free .

Collision Detection in Workspace Using the Oriented Bounding Box (OBB)
The definition of workspace W is a space where an actual robot operates. In many cases including the robot manipulator, W space is in three-dimensional Euclidean space R 3 . To confirm thatq t ∈ Q free , collision detection methods have to be applied in workspace W [21,23]. To this end, the oriented bounding boxes (OBB) [24] is employed for collision detection. In OBB, both the robot arm and obstacles are modeled as a box. Because the modeling is implemented in OBB, the collision checking between 2 boxes only has 15 cases: three faces of the first box, three faces of the second box, and nine edge combinations between the first box and second box [25]. Based on two box checking, the higher level checking between one robot and one obstacle can be implemented as a repeated two OBB checking [26].

Reinforcement Learning
As a standard solution to the MDP problem, the fundamental method in the reinforcement learning algorithm uses an optimization-based method to maximize the sum of the reward signal by letting the agent interact with a stochastic environment and choose the action over sequences of discrete-time steps [27]. The MDP is defined by the tuple {S, A, P, r, γ}, where S denotes the set of the state, A the set of action, P the transition probability, r(s, a) the reward function, and γ the discount factor [28]. The transition probability P(s |s, a) represents the probability of moving the current state s to the next state s when action a ∈ A is applied. Regarding the agent, the distribution of action a for the given state s from the environment is implied by the policy that is denoted by π(a|s). At each time step t, the agent selects action a t ∈ A based on the policy π : S → A and state s t ∈ S, and applies it to the environment. After that, the environment returns both the next state s t+1 ∈ S and reward r t+1 ∈ R according to the transition probability P : S × A → P (S), where P (S) means that the probability of the next state and the reward function r : S × A → R. By repeating this procedure, the learning process updates the policy until it reaches the optimal policy, defined by π * , which maximizes the expected return E π [∑ ∞ k=0 γ k r t+k+1 ]. In order to acquire the optimal policy, the optimal value function (i.e., estimate of the maximum return) is approximated by value-based methods like deep Q-network (DQN) as function approximation [29]. The other way to get the optimal policy is using the policy gradient methods to compute the optimal policy directly from the agent experience. Representative methods of the policy gradient are REINFORCE, actor-critic method, deterministic policy gradient (DPG), deep DPG (DDPG), asynchronous advantage actor-critic (A3C), trust region policy optimization (TRPO), maximum a posteriori policy optimization (MPO), and distributed distributional DDPG (D4PG) [30][31][32][33][34]. In the reinforcement learning, the training performance mostly depends on the set of samples of state s t , action a t , reward r t+1 , and next state s t+1 . In addition to the various reinforcement learning algorithms, efforts are directed at devising techniques to utilize the samples (called episode) effectively for better learning, for example replay memory [29] and HER [35]. In this paper, for the purpose of tackling the path planning problem for the multi-arm manipulators with periodically moving obstacles, a soft actor-critic (SAC) [36] algorithm is employed.

Soft actor-critic Based Path Planning for Periodically Moving Obstacles
In this section, in order to design a SAC-based path planning, the MDP for the problem is defined, and the proposed deep learning network used in the SAC is presented. Although the considered MDP is quite similar to that in [20], since both deal with multi-arm manipulators, the MDP is explained here in a self-contained manner.

Path Planning for the Multi-Arm Manipulator and Augmented Configuration Space
The multi-arm manipulator under consideration consists of m identical robot-arm manipulators that have n number of joints. Let q i t ∈ R n determine the joint angles of the ith robot-arm at the tth iteration in the proposed SAC-based method, and q i t,j ∈ R denotes the jth component of q i t ∈ R n , i.e., the value of the jth joint angle of the ith robot-arm. In the path planning algorithm for the multi-arm manipulator, if the configuration space is defined for each robot, it is difficult to define Q free and Q collide during the robot operation since Q free for each arm is affected by the other arms, i.e., one robot serves as an obstacle to the others in the workspace. Taking this into account, it is convenient to define one big configuration space by augmenting the configuration space of each manipulator. This corresponds to regarding the multi-arm manipulator as one virtual manipulator.
The dimension of the augmented configuration space is nm. Hence, the state of the virtual manipulator is represented by Therefore, the resulting augmented configuration space is described by Q a free defines the collision-free space for q t and Q a collide denotes the corresponding collision space. Finally, the path planning problem of the multi-arm manipulator can be redefined as the path planning for the (virtual) single-arm manipulator with a given starting point q 0 ∈ Q a free and goal point q goal ∈ Q a free being where q i init and q i goal denote the starting and goal point of the ith robot-arm.

Multi-Arm Manipulator Markov Decision Process Considering Periodically Moving Obstacle
In order to apply deep reinforcement learning to the path planning algorithm for the multi-arm manipulator considering a periodically moving obstacle, a multi-arm manipulator MDP (MAMMDP mo ), i.e., the tuple {S, A, P, r, γ}, needs to be defined. For a better understanding, the diagram of MAMMDP mo is depicted in Figure 1. In principle, MAMMDP mo is similar to MAMMDP in [20]. The only difference is that the state in MAMMDP mo is defined such that the information on the moving obstacle is related to the state. For this purpose, it has to be determined which information on the obstacle is used in defining MAMMDP mo . In this paper, the location of the obstacle in the workspace for the past n d sampling times is used where n d > 0 denotes the window size. Namely, the following information on the obstacle is used: where d t represents the location of the obstacle in the workspace at the sampling time t. This n d can be viewed as the number of the features obtained from the obstacle. In addition to this time window, it is also assumed that D t is available only when the obstacle belongs to a predefined region such as a neighborhood of the multi-arm manipulator. When the manipulator does not belong to the region, D t is not available to the MAMMDP mo .
With this in mind, the current state s t ∈ R nm is defined as s t = (q t D t ). In other words, the state contains the information of not only the current configuration of the multi-arm manipulator but also D t (i.e., obstacle location over the past n d sampling times). The action is calculated by a t = f (e t , s t ), where e t follows the Gaussian distribution N (0, σ t ) with σ t being the variance, and f (·, ·) generates a stochastic action using noise e t and state s t . As a matter of fact, function f (·) serves as a sampler. Thus, the action is sampled from a probability distribution and variation of the configuration. Then, the next configuration is calculated as the sum of the current configuration and the action such aŝ q t+1 = q t + αa t + e , where tuning parameter α is defined as the geometric mean of the maximum variations of each joint and e is an environmental noise, where e ∼ N (0, σ e ). Whenq t+1 ∈ Q a collide happens, the next configuration becomes the previous configuration meaning that it stays at the current configuration. The following summarizes how the configuration is updated: Using this, the next state is denoted as s t+1 = (q t+1 D t+1 ), where D t+1 is the updated information of the obstacle position. Applying the action and obtaining the state and reward are repeated until the next configuration reaches the goal point q goal . Owing to the practical limitation like numerical errors, instead of q t+1 = q goal , the relaxed condition q t+1 − q goal ≤ η · α is employed for the termination in an implementation where · defines the norm of a vector and tuning parameter η satisfies 0 < η < 1.
If the next configuration is not the goal point, the corresponding reward is −1. However, when the agent meets the terminal condition, the corresponding reward becomes 0. If the agent cannot reach the goal until the maximum number of iteration, the total reward is given by −T. On the other case, if the iteration ends at a certain iteration T 1 < T, the total reward amounts to −(T 1 − 1). This reward function r(s t , a t ) is summarized as follows: In view of the goal of reinforcement learning, the objective of the proposed algorithm is to find the optimal policy that maximizes the total reward. If the iteration can finish before reaching T, the sum of reward is more than −T such as −(T 1 − 1) > −T. By trying to maximize the total reward in training process, the agent finds the shortest path.
In the next subsection, it is presented how to update the policy in the middle of training using deep neural networks.

Soft actor-critic (SAC) Based Path Planning Considering Periodically Moving Obstacles
Since our MAMMDP mo has a continuous action value, there are several reinforcement learning methods for continuous action MDP such as those in [32,36,37]. Soft actor-critic (SAC), especially, is the best known for its performance for the MDP with the continuous action and high-dimensional state. SAC is based on maximum entropy reinforcement learning that maximizes not only the expected sum of rewards but also the entropy of the policy of action [36,38].
When the policy π is updated by maximizing only the total reward, the center of the distribution of the policy is formed around a specific action. However, by maximizing policy entropy together with the reward, the SAC method can make the distribution spread more and encourage the agent to explore a wider range of the actions. The soft state-value function and soft Q-value function are defined as follows: where β denotes the temperature parameter of the entropy function. The soft state value function V(s t ) defined in (4) is the expected sum of the entropy augmented reward for the given current state s t . In addition, the soft Q-value function or soft action-value function Q(s t , a t ) are defined in (5) and is the expected sum of augmented reward for a given pair of state s t and action a t . For the purpose of estimating the soft value and Q-value functions, and finding the optimal policy, the SAC-based algorithm consists of five neural networks: two for the soft value function, two for the soft Q-functions, and one for the policy. The proposed SAC-based path planning utilizes the structure described in Figure 2. The structure is quite similar to that in [20] except for the information of the moving obstacle. As the proposed path planning has to take the periodically moving obstacle into account, it has to be decided properly how to use the obstacle information D t for the path planning. Since information D t has to be considered when the policy is trained at a given state in order for the manipulator to enable to avoid the obstacle, information D t is also used as the input to the network for the policy. In addition, since the policy network is affected by the value functions, as a result, all the neural networks for the proposed SAC-based path planning have to have information D t as an input.  Figure 3 illustrates how the deep neural network (DNN) estimates the state value function. The DNNs N1 and N2 are parameterized by ψ andψ. The network parameterized byψ is the target network which makes the training stable and enhances the learning performance of DNNs N3 and N4 for estimating the Q-value function. The inputs to DNN N1 are the configuration q t , goal state q goal , and D t . In order to train DNN N1, the following objective function and its gradient are used: where the minimum of the soft Q-function is determined between Q θ 1 and Q θ 2 , which are the output of the DNN N3 and N4. It is reported that the use of such a minimum makes it possible to avoid overestimation of Q-value [37]. The parameterized policy π φ comes from DNN N5. To find the optimal parameter ψ of DNN N1, the network N1 for the soft state value is trained by minimizing J V (ψ) using ∇J V . After updating the parameter ψ using the gradient descent, the target parameterψ is updated according toψ ← τψ + (1 − τ)ψ at each training step where τ ∈ [0, 1]. In order to estimate the action value function, two DNNs are again used as depicted in Figure 4. The input to N3 and N4 are the same as those of N1 and N2. The network parameters θ 1 and θ 2 are trained by minimizing the following objective function: where Vψ comes from DNN N2, and the double Q-value method is applied as it is suggested in [36,37,39]. The minimum Q-function is also used in the DNN for the objective function (8).
where Q θ k is provided by N3 and N4. In fact, this function is a kind of the Kullback-Leibler (KL) divergence between the policy and Q-value [36]. The output of DNN N5 is f φ (e t ; s t ), and the action is sampled from f φ (e t ; s t ) such that a t = f φ (e t ; s t ).
Since the SAC is a type of the off-policy actor-critic method [36,40], the training gathers transition tuples (s t , a t , s t+1 , r(s t , a t )) in every action step and stores them in experience replay memory D [29]. At the start of the training, a collection of these tuples is sampled from the reply memory and used to calculate the expectation in the objective functions. In addition to the SAC, Hindsight Experience Replay (HER) is used to enhance the sample efficiency [20,35]. In summary, the proposed SAC-based path planning algorithm is presented in the form of the pseudo-code in Algorithm 1.

Case Study: Two 3DOF Manipulators with a Periodically Moving Obstacle on a Line
This section presents how to implement the proposed SAC-based path training for the multi-arm manipulator with a periodically moving obstacle, and shows the result using not only simulation but also experiment.

Simulation
For the implementation of the proposed SAC-based path planning algorithm, two 3DOF open-manipulators are used. The detailed information about the manipulator can be seen at http://en.robotis.com/model/page.php?co_id=prd_openmanipulator (accessed on 14 March 2021). The entire workspace is 70 cm × 120cm × 100 cm. For the moving obstacle, a hexagon is assumed to move along a line segment back and forth periodically in the workspace, the length of the line segment is 40 cm, and the average speed of the moving obstacle is 9 cm/sec. In the simulation, it is always possible for the agent to know the location information of the obstacle. For the training, n d is set to 5 meaning that the location of the obstacle over the past five sampling times are given to the networks. Moreover, a square prism is located in the workspace as a static obstacle. The two manipulators are located such that the intersection of the workspace of both is not empty. The parameters of the robot manipulators and environment are summarized in Table 1.

Value Notation Unit
The number of manipulator joints The hyper-parameters of the SAC-based path planning are presented in Table 2.  Figure 6a, obstacle 1 moves back and forth along the line segment defined by D ob = {(x, y, z)|x = 150 mm, −25 mm ≤ y ≤ 375 mm, and z = 300 mm}. The obstacle 2 depicts the static obstacle. It is assumed that the manipulator 1 can detect the obstacle only when it is in {x = 150 mm, 0 mm ≤ y ≤ 175 mm, and z = 300 mm}, and the manipulator 2 can do only when it is in {x = 150 mm, 175 mm ≤ y ≤ 350 mm, and z = 300 mm}. Namely, the manipulators can detect the moving obstacle only when it is in a predefined area. In this simulation, the location of the moving obstacle is directly extracted as the input of the network. Due to that, the network can detect the obstacle location based on the input value of the network. Figure 6b describes how the PRM deals with the moving obstacle. As seen in Figure 6b, the PRM views entire D ob as a static obstacle to deal with the moving obstacle. In the learning process, the five neural networks in the proposed SAC-based path planning are trained by 70,000 episodes using GPU Nvidia GeForce RTX 2080Ti and CPU Intel i7-9700F (https://www.nvidia.com, https://www.intel.com (accessed on 14 March 2021). For the training, it took 36 h and the proposed algorithm was implemented using Tensorflow 2.1. In each episode, the agent explores and learns the environment based on the previous experience that was stored in the memory. To visualize the success of the agent, the success ratio from every 10 episodes is calculated and recorded. Figure 7 depicts the success ratio from 70,000 episodes. In Figure 7, the green chart describes the actual success ratio. However, for better visualization, the moving average of the success ratio is calculated and shown as the dark green line. At the end of the training, the success ratio reaches 81.76 %. Not only the success ratio but also the reward value is monitored during the training. Because the agent receives -1 for the reward at every step if the next configuration is not the goal, the total reward value can be interpreted as the required total time step for the agent to complete the task. The total reward of every episode is shown in Figure 8. The light blue line tells the total reward in each episode and the dark blue line describes the moving average of the total reward. From the repeated learning, similar convergent results on the success ratio and total reward are obtained. In view of Figures 7 and 8, it can be concluded that the agent is well trained, which means the agent can find the optimal path for arbitrary starting and goal configurations. In view of the learning results, the success rate of the training is not satisfactory if n d is set to a value less than 5. On the other hand, when it is higher than 5, the success rates of most learning reach a similar level to that with n d =5. This is why n d is set to 5. To support this, Figure 9 presents the learning performance depending on various values of n d . When the training is over, the actor-network is used to generate the optimal path for an arbitrary starting and goal configurations. Figure 10 shows an example of initial and goal configurations.   To be specific, (q goal , q 0 , D 0 ) is given to the trained actor-network in the beginning and the actor-network calculates the variations of each joint denoted by f φ (s t ||q goal ) in Figure 11. Then, each joint changes its angle by the variations and the resulting new configuration q 1 is generated. This is one iteration. In the second iteration, (q goal , q 1 , D 1 ) is injected into the trained actor-network with the updated obstacle information D 1 , and the goal configuration is reached by repeating the same procedure.
One testing result is shown in Figure 12. In Figure 12, the areas in dark green denote Q collide . In this result, it is important to note that the path generated by the proposed SCA-based planning can penetrate D ob without collision if necessary. On the other hand, the paths by other methods go around D ob . This shows that the proposed method generates indeed an efficient path when there is a moving obstacle. For comparison, the paths computed by PRM and TD3 are also presented. Hence, there are three generated paths in Figure 12. The red lines are the resulted paths by the PRM method (10,000 sampled points graph), the blue lines by TD3 with HER, and the green lines by the proposed method path. As seen in Figure 12, the proposed SAC-based method generates the smoothest and shortest path.
To analyze quantitatively, 100 simulations with the arbitrary initial point and goal point are carried out using PRM, TD3 with HER, and the proposed method simultaneously. Then, the lengths of all 100 generated paths are given in Figure 13. As seen in Figure 13, the proposed SAC-based path planning generates the shortest path. Since, inevitably, PRM and TD3 with HER deal with the moving obstacle in a conservative manner, the result is quite natural actually. In Table 3, the optimal path by the proposed method is shorter by 14.86% than the path by PRM. If the path by the SAC with HER is compared with the path from TD3 with HER, the path by the proposed method is 17.40% better than that by TD3.

Experiment
In order to see if the proposed SAC-based path planning is implementable in real manipulators, it is tested using two open manipulators. Figure 14 shows the real experimental setup. In real implementation, a camera is used to detect the location of the moving obstacle and installed on the top of the workspace. The location of the moving obstacle is sampled every 0.35 s using the camera. The simulation and experimental results are displayed on the webpage https://sites.google.com/site/cdslweb/publication/dynamicsacpath (accessed on14 March 2021). In view of the experimental results, it is confirmed that the proposed SAC-based path planning works in real multi-arm manipulators effectively.

Conclusions
This paper presents a deep reinforcement learning-based path planning algorithm for multi-arm manipulators with both static and periodically moving obstacles. To this end, the recently developed soft actor-critic (SAC) is employed for the deep reinforcement learning since the SAC can compute the optimal solution for the high-dimensional problem and the path planning problem for the multi-arm manipulators is essentially high dimensional. To be specific, the Markov decision process is properly defined for the path planning problem, and the input and output structure of the neural networks in the SAC is designed such that the information of the moving obstacles plays an important role in the training of the neural networks. In both the simulation and experimental results, it is confirmed that the proposed SAC-based path planning generates shorter and smoother paths compared with existing results when there is a periodically moving obstacle.
The role of the neural networks in the SAC is to estimate the future location of the moving obstacles. If the obstacle moves periodically, it might be possible to estimate the future location better using other deep learning algorithms tailored to estimation of the periodic signals like Recurrent Neural Networks (RNN). Hence, possible future research includes how to improve the estimation of the moving obstacle using deep learning algorithms for better path planning.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: 1: Define MAMMDP and the current state s t = (q t D t ) and the initial and goal point q init and q goal 2: Initialize network parameters ψ, θ 1,2 , φ 3: Initialize the parameter values of the target networkψ ← ψ 4: Initialize global replay memory D 5: for e = 1 to M do 6: Initialize local buffer L Memory for an episode 7: for t = 0 to T − 1 do 8: Randomly choose obstacle position D t and the goal and initial positions q goal , q init ∈ Q a free 9: a t = f φ ( t , s t ||q goal ), s t = (q t D t ), t ∼ N (0, σ t ) 10:q t+1 = q t + α · a t + e , e ∼ N (0, σ e ) 11: ifq t+1 ∈ Q a free then Get next state and reward 12: q t+1 ←q t+1