Efficient Path Planning for Mobile Robot Based on Deep Deterministic Policy Gradient

When a traditional Deep Deterministic Policy Gradient (DDPG) algorithm is used in mobile robot path planning, due to the limited observable environment of mobile robots, the training efficiency of the path planning model is low, and the convergence speed is slow. In this paper, Long Short-Term Memory (LSTM) is introduced into the DDPG network, the former and current states of the mobile robot are combined to determine the actions of the robot, and a Batch Norm layer is added after each layer of the Actor network. At the same time, the reward function is optimized to guide the mobile robot to move faster towards the target point. In order to improve the learning efficiency, different normalization methods are used to normalize the distance and angle between the mobile robot and the target point, which are used as the input of the DDPG network model. When the model outputs the next action of the mobile robot, mixed noise composed of Gaussian noise and Ornstein–Uhlenbeck (OU) noise is added. Finally, the simulation environment built by a ROS system and a Gazebo platform is used for experiments. The results show that the proposed algorithm can accelerate the convergence speed of DDPG, improve the generalization ability of the path planning model and improve the efficiency and success rate of mobile robot path planning.


Introduction
With the rise and continuous development of robot technology, mobile robots are becoming more and more widely used and are playing an important role in more and more fields. Mobile robots can perform tasks in various scenarios, such as package pick-up and delivery in warehouses and guiding patients in hospitals [1][2][3]. As one of the key technologies of mobile robot applications, path planning has become an indispensable part of mobile robots moving towards artificial intelligence. Its purpose is to find an optimal or suboptimal collision-free path from the starting point of the mobile robot's movement to the target point in the application scenario so as to ensure the rapid and safe movement of the robot and improve the work efficiency. Ideal path planning can greatly save the movement time of mobile robots on the road, help mobile robots complete tasks efficiently and accurately, and provide favorable guarantees for the applications of mobile robots in various industries.
Algorithm design is the core of mobile robot path planning, according to the different application environment and the degree of intelligence, and the current path planning algorithms can be divided into traditional methods and intelligent methods [4][5][6]. The traditional methods mainly include the Dijkstra algorithm, A* algorithm, Artificial Potential Field and Genetic algorithm and so on [7]. The Dijkstra algorithm is a path planning algorithm that was proposed very early. The algorithm takes the starting point as the center origin and spreads layers outward until the shortest paths to all nodes are found [8]. However, the Dijkstra algorithm uses the path length as the weight factor to search for good reward function. However, the training effect is not very ideal. Peng Li et al. [33] proposed a new DDPG algorithm, which used Rectified Adam (RAdam) to replace the neural network optimizer in DDPG and combined this with curiosity algorithm to improve the training effect, but the convergence speed was not ideal.
Compared with traditional path planning methods, the path planning methods based on deep reinforcement learning do not need to build the whole environment model and can realize self-learning from the environment state to action mapping, which has high flexibility. Through the continuous interaction between the mobile robot and the environment, the deep reinforcement learning uses the corresponding action strategy to determine the next action of the mobile robot according to the state of the robot and combines the reward function to continuously optimize the action strategy. As one of the typical algorithms of deep reinforcement learning applied to path planning, DDPG can train the model in self-constructed simulation environment and be directly applied to the actual environment, with strong generalization ability. However, due to the adoption of deterministic policy, when the robot is in the same state, the actions given are also the same. This will lead to a single action of mobile robots that cannot fully explore the environment and may not be able to reach the target point with the optimal path. Especially in complex environments, due to the limited observable range of mobile robots and the lack of previous "memory" of DDPG, it is unable to collect enough state information to train the algorithm, resulting in low efficiency, slow convergence speed and low success rate in the training of the algorithm.
This paper fully analyzes the advantages and disadvantages of DDPG. Based on reference [31], the efficiency and success rate of DDPG algorithm in path planning are further improved by introducing LSTM, optimal design reward function and mixed noise. The main contributions are as follows: (1) The "memory" ability of LSTM is utilized to optimize the DDPG network structure, and Batch Norm layer is added after each layer of the Actor network to improve network stability and speed up algorithm convergence.
(2) By optimizing the reward function, the mobile robot is guided to move faster towards the target point. (3) Different normalization methods are used to normalize the distance and angle between mobile robot and target point to improve the learning efficiency of the path planning model. (4) A mixed noise composed of Gaussian noise and Ornstein-Uhlenbeck (OU) noise is designed to make the learning process of mobile robot have higher randomness, avoid falling into local optimum, and improve the exploration efficiency of mobile robot.
The structure of this paper is as follows, Section 2 describes and analyzes the correlation algorithms. Section 3 introduces the core method of this paper in detail, namely the efficient path planning algorithm for mobile robots based on DDPG, including the improvement of DDPG network structure, the optimization of reward function, the preprocessing of state information and the design of mixed noise. Section 4 is the simulation experiment and detailed comparison and analysis of the experimental results. Section 5 is a further summary of this paper.

Deep Deterministic Policy Gradient (DDPG) Algorithm
The DDPG algorithm is based on the Actor-Critic architecture and draws on the experience replay mechanism and target network idea of DQN to solve the continuous action problem. Its network consists of the current network and target network of the Actor, and the current network and target network of the Critic. The role of the experience replay mechanism is to collect samples and sample them randomly in batches from the experience pool at each training session to reduce the correlation between samples [34]. The target network will fix the parameters in the network within a certain period of time, so as to eliminate the model oscillation caused by the same parameters between the current network and the target network [35]. The DDPG algorithm has strong fitting ability and generalization ability of deep neural network, as well as the advantage of continuous action space. Additionally, it can learn the optimal action strategy in the current state through continuous training and adjustment of neural network parameters. The method is applied to the path planning process of the mobile robot, so that the mobile robot has more continuous action output and less decision error in the process of motion. In the process of path planning, the mobile robot obtains the state S according to the surrounding environment information and its own state data, and the current network of Actor outputs the action a of the mobile robot according to S. After the mobile robot performs an action, it will obtain the reward r from the environment. According to S and a, the current network of Critic outputs the Q value as the evaluation of the action, and constantly adjusts its value function. The current network of Actor continuously improves the action strategy according to the Q value. The target network of Actor and Critic is mainly used for the subsequent update process. The structure of the DDPG algorithm is shown in Figure 1. network and the target network [35]. The DDPG algorithm has strong fitting ability and generalization ability of deep neural network, as well as the advantage of continuous action space. Additionally, it can learn the optimal action strategy in the current state through continuous training and adjustment of neural network parameters. The method is applied to the path planning process of the mobile robot, so that the mobile robot has more continuous action output and less decision error in the process of motion. In the process of path planning, the mobile robot obtains the state S according to the surrounding environment information and its own state data, and the current network of Actor outputs the action a of the mobile robot according to S . After the mobile robot performs an action, it will obtain the reward r from the environment. According to S and a , the current network of Critic outputs the Q value as the evaluation of the action, and constantly adjusts its value function. The current network of Actor continuously improves the action strategy according to the Q value. The target network of Actor and Critic is mainly used for the subsequent update process. The structure of the DDPG algorithm is shown in Figure 1.
is the target value, γ is the discount coefficient, and i is the sample number of the sample.
5. Update the current network parameters of the Actor according to Formula (2), 1.
Initialize the current network of Actor µ(S θ µ ) and the current network of Critic Q(S, a θ Q ) , as well as the corresponding target networks θ µ ← θ µ , θ Q ← θ Q and experience pool D.

2.
Input the current state S t of the mobile robot into the current network of Actor to obtain action a t , receive the reward r by performing the action, and obtain the next new state S t+1 of the next step.

3.
Put S t , a t , r, S t+1 in experience pool D. When the number of samples in the experience pool reaches a certain number, N samples will be randomly sampled from the experience pool D for network training.

4.
Calculate the current network loss function of Critic according to Formula (1), is the target value, γ is the discount coefficient, and i is the sample number of the sample.

5.
Update the current network parameters of the Actor according to Formula (2), where ∇J is the gradient. 6. According to Formula (3), the target network parameter θ µ of Actor and the target network parameter θ Q of Critic are updated using the soft update strategy, where τ is the constant coefficient, which is used to adjust the soft update factor. 7.
Repeat the above steps until DDPG algorithm training is complete.
As one of the mainstream algorithms of deep reinforcement learning, the DDPG algorithm is widely used in mobile robot path planning. Since the algorithm adopts continuous state space and action space, it is especially suitable for the actual motion process of mobile robots, showing great potential in complex environments.

Long Short-Term Memory
Long Short-Term Memory (LSTM) is a special kind of Recurrent Neural Network (RNN). On the basis of fully connected neural networks, RNN adds the sequential relationship before and after and endows the network with the ability of "memory" [36]. The output of LSTM at the current moment should take into account not only the input at the current moment, but also the previous information. However, too much "memory" will also increase the computing burden of the network. LSTM introduces three gating mechanisms on the basis of RNN, namely forget gate, input gate and output gate. Through the gating mechanisms, the information at every moment is judged and adjusted in a timely fashion and updated to determine the retention degree of input information [37,38], thus reducing the burden of network computing. The structure of LSTM is shown in Figure 2.
where J ∇ is the gradient. ( 1 ) ' where τ is the constant coefficient, which is used to adjust the soft update factor. 7. Repeat the above steps until DDPG algorithm training is complete.
As one of the mainstream algorithms of deep reinforcement learning, the DDPG algorithm is widely used in mobile robot path planning. Since the algorithm adopts continuous state space and action space, it is especially suitable for the actual motion process of mobile robots, showing great potential in complex environments.

Long Short-Term Memory
Long Short-Term Memory (LSTM) is a special kind of Recurrent Neural Network (RNN). On the basis of fully connected neural networks, RNN adds the sequential relationship before and after and endows the network with the ability of "memory" [36]. The output of LSTM at the current moment should take into account not only the input at the current moment, but also the previous information. However, too much "memory" will also increase the computing burden of the network. LSTM introduces three gating mechanisms on the basis of RNN, namely forget gate, input gate and output gate. Through the gating mechanisms, the information at every moment is judged and adjusted in a timely fashion and updated to determine the retention degree of input information [37,38], thus reducing the burden of network computing. The structure of LSTM is shown in Figure 2. LSTM can make full use of previous information and is suitable for processing and predicting applications with long time sequences. Mobile robot path planning is a typical long-sequence decision-making problem. In reference [31], when the DDPG algorithm is used to carry out path planning for mobile robots, each layer in the network structure adopted is a fully connected layer. It can navigate the mobile robot to the desired target position without colliding with any obstacles. However, due to the limited range of environments that mobile robots can observe and the lack of previous "memory", path planning can only rely on the current state of the mobile robot, resulting in the planned path being too tortuous, which seriously affects the efficiency of the robot. In this paper, LSTM is introduced into the DDPG network structure proposed in reference [31], and the path LSTM can make full use of previous information and is suitable for processing and predicting applications with long time sequences. Mobile robot path planning is a typical long-sequence decision-making problem. In reference [31], when the DDPG algorithm is used to carry out path planning for mobile robots, each layer in the network structure adopted is a fully connected layer. It can navigate the mobile robot to the desired target position without colliding with any obstacles. However, due to the limited range of environments that mobile robots can observe and the lack of previous "memory", path planning can only rely on the current state of the mobile robot, resulting in the planned path being too tortuous, which seriously affects the efficiency of the robot. In this paper, LSTM is introduced into the DDPG network structure proposed in reference [31], and the path planning is carried out by making comprehensive use of the past and current states of mobile robots.

The Proposed Path Planning
In order to solve the shortcomings of traditional DDPG algorithms in path planning, LSTM is introduced to optimize the structure of DDPG network, and the reward function is redesigned to speed up network training. Then, the states of the mobile robot are preprocessed by different normalization methods, such as the input of DDPG network model, and the mixed noise composed of Gaussian noise and OU noise is added to the actions output from the network model to improve the exploratory nature of the mobile robot.

Introduction of LSTM
When using the DDPG algorithm for path planning, the action of the mobile robot can only be determined by the current state of the robot, which will easily lead to confusion in the exploration trajectory. On the basis of reference [31], we take advantage of LSTM's ability to memorize past states of mobile robots and introduce it into the learning process of mobile robots. To be specific, in the Actor network, the first fully connected layer is replaced by the LSTM network. When the Actor network receives the input state of the mobile robot, it is processed by the LSTM first, then processed by the two fully connected layers, and finally outputs the actions of the robot. In the Critic network, we replace the fully connected layer that processes the state with an LSTM network. When the Critic network receives the input state and action of mobile robot, the state is processed by the LSTM network, and the action is processed by the fully connected layer. The results of the above two layers are processed by a fully connected layer, and then the Q value is output. By comprehensively considering the current state and past state of the mobile robot, the action output by Actor network can be evaluated more accurately. In this way, the actions of the robot are controlled not only by the current state of the robot, but also by the previous state, so that the actions of the robot have time correlation, which can effectively avoid the planned path being too tortuous. The LSTM-DDPG network structure designed in our paper is shown in Figure 3.
planning is carried out by making comprehensive use of the past and current states of mobile robots.

The Proposed Path Planning
In order to solve the shortcomings of traditional DDPG algorithms in path planning, LSTM is introduced to optimize the structure of DDPG network, and the reward function is redesigned to speed up network training. Then, the states of the mobile robot are preprocessed by different normalization methods, such as the input of DDPG network model, and the mixed noise composed of Gaussian noise and OU noise is added to the actions output from the network model to improve the exploratory nature of the mobile robot.

Introduction of LSTM
When using the DDPG algorithm for path planning, the action of the mobile robot can only be determined by the current state of the robot, which will easily lead to confusion in the exploration trajectory. On the basis of reference [31], we take advantage of LSTM's ability to memorize past states of mobile robots and introduce it into the learning process of mobile robots. To be specific, in the Actor network, the first fully connected layer is replaced by the LSTM network. When the Actor network receives the input state of the mobile robot, it is processed by the LSTM first, then processed by the two fully connected layers, and finally outputs the actions of the robot. In the Critic network, we replace the fully connected layer that processes the state with an LSTM network. When the Critic network receives the input state and action of mobile robot, the state is processed by the LSTM network, and the action is processed by the fully connected layer. The results of the above two layers are processed by a fully connected layer, and then the Q value is output. By comprehensively considering the current state and past state of the mobile robot, the action output by Actor network can be evaluated more accurately. In this way, the actions of the robot are controlled not only by the current state of the robot, but also by the previous state, so that the actions of the robot have time correlation, which can effectively avoid the planned path being too tortuous. The LSTM-DDPG network structure designed in our paper is shown in Figure 3. In the Actor network and Critic network of DDPG, the target network has the same structure as the current network, and both of them adopt the structure of LSTM and fully connected layer. In the Actor network, the first layer is LSTM, the second layer is the fully In the Actor network and Critic network of DDPG, the target network has the same structure as the current network, and both of them adopt the structure of LSTM and fully connected layer. In the Actor network, the first layer is LSTM, the second layer is the fully connected layer with 400 nodes, and the third layer is the fully connected layer with 300 nodes. ReLU is used as the activation function of the fully connected layer, and Batch Norm layer is added after each layer to ensure the stability of the algorithm. In the Critic network, states are input to the LSTM layer, actions are input to the fully connected layer with 400 nodes, and then both of them are processed through the fully connected layer with 300 nodes. ReLU is also used as the activation function of the fully connected layer.

Design of the Reward Function
The reward function is a benchmark to evaluate the action taken by the mobile robot and plays a guiding role in the whole learning process. The design of the reward function should not only consider that the mobile robot can reach the target point through the optimal path, but also consider the safety of the mobile robot. After the current network of Actor outputs the action according to the robot state, the state should be updated according to the execution result of the action, and the reward value is calculated. If the mobile robot reaches the target point, the maximum positive reward will be given. If the mobile robot encounters an obstacle during its movement, it should be punished. If the mobile robot neither encounters the obstacle nor reaches the target point, the reward value should be calculated according to the distance between the mobile robot and the starting point and the target point, so that the mobile robot can keep approaching the target point. In other words, every action of the mobile robot should receive timely feedback, so as to speed up the convergence of algorithm. To achieve this goal, the reward function designed in this paper is shown in Formula (4), Reach target point C 2 Hit an obstacle −rel_dis + ori_dis Other , where C 1 is a positive constant, C 2 is a negative constant, rel_dis is the distance from the mobile robot to the target point, and ori_dis is the distance from the mobile robot to the starting point. In this paper, C 1 and C 2 are set to 150 and −100, respectively.

State Normalization of Mobile Robot
The state space is the feedback of the whole environment of the mobile robot and is the basis for the mobile robot to select actions. The mobile robot mentioned in this paper interacts with the environment through the laser sensor. The detection range of the laser sensor is from −90 degrees to 90 degrees straight ahead, and the detection distance is at least 0.2 m. If the distance between the mobile robot and the obstacle is less than 0.2 m, it is considered to have collided with the obstacle. The data detected by the laser sensor contain 10 dimensions, as shown in Formula (5). The detection range of the laser sensor is shown in Figure 4. scan_range = [scan1, scan2, scan3, scan4, scan5, scan6, scan7, scan8, scan9, scan10], (5) where scan_range is the detection range of the laser sensor installed on the mobile robot, and scani is the data detected in the ith orientation.  In order to complete the path planning of the mobile robot, it is necessary to know whether the mobile robot will encounter obstacles, and also other states of the mobile robot, such as the action of the mobile robot at the previous time step (including linear velocity and angular velocity), the relative distance and angle between the mobile robot and the target point, the yaw angle, and the difference angle between the mobile robot and the target point, etc. Among them, the difference angle is shown in Formula (6). In order to In order to complete the path planning of the mobile robot, it is necessary to know whether the mobile robot will encounter obstacles, and also other states of the mobile robot, such as the action of the mobile robot at the previous time step (including linear velocity and angular velocity), the relative distance and angle between the mobile robot and the target point, the yaw angle, and the difference angle between the mobile robot and the target point, etc. Among them, the difference angle is shown in Formula (6). In order to improve the learning efficiency of the mobile robot, the states of the robot are preprocessed in different normalization methods, as shown in Formulas (7)-(10), respectively.
where di f f _angle is the difference angle between the mobile robot and the target point, rel_eheta is the relative angle between the mobile robot and the target point, and yaw is the yaw angle of the mobile robot.
where rel_dis is the relative distance between the mobile robot and the target point, and diagonal_dis is the diagonal length of the simulation map.
where rel_eheta is the relative angle between the mobile robot and the target point.
where yaw is the yaw angle of the mobile robot.
where di f f _angle is the difference angle between the mobile robot and the target point.
To sum up, in the DDPG algorithm, the state space of the mobile robot is set as 16-dimensional data and defined as the input of the neural network. The normalized state S t can be defined as: Considering the motion smoothness of mobile robot and the continuity of output actions, the output of the DDPG network model proposed in this paper is continuous linear velocity and angular velocity to guide the movement of the mobile robot. Since the limits of angular velocity and linear velocity should not be too large in the simulation environment, the maximum angular velocity is set to 0.5 rad/s and the maximum linear velocity is set to 0.25 m/s. The output action of the model is shown as Formula (12), where a t is the action of mobile robot at time t, while v t and ϕ t are linear velocity and angular velocity, respectively.

Mixed Noise Design
DDPG adopts deterministic policy with poor exploration of the environment. In order to increase the randomness of the learning process, DDPG will add a certain amount of noise to the output actions to improve the exploration ability of the mobile robot. At present, Gaussian noise and OU noise are commonly used in DDPG. Gaussian noise produces irrelevant exploration in a time sequence; that is, the selection of the front and rear actions are independent. The OU noise is a random process, and its calculation formula is shown in Formula (13). It can produce time-sequence-related exploration; that is, the action of the next step will be affected by the action of the previous step. Different from Gaussian noise, OU noise does not make the actions of two adjacent steps of the mobile robot very different, but makes the mobile robot explore near the mean of action sampling. Although this allows the mobile robot to continuously explore in one direction, it will increase the movement time of the robot when the action taken is not optimal in the current view.
where θ is the learning rate of the random process, a t is the action at time t, a is the average value of the action sampling data, δ is the random weight of OU, and W t is the Wiener process.
In order to optimize the DDPG exploration policy and improve the exploration efficiency of the robot, we combine Gaussian noise and OU noise to form mixed noise. The Actor network output action a t based on the mixed noise is shown in Formula (14), where var is the Gaussian variance to ensure that the mobile robot has uniform and stable detection ability in each episode. At the same time, with the progress of the training process, the mobile robot begins to adapt to the task scene, which requires the exploration rate to be gradually reduced, as shown in Formula (15), This paper proposes a path planning algorithm for a mobile robot based on improved DDPG, which can solve the problems of a long training time and slow convergence of traditional path planning model. The algorithm flow is shown in Figure 5.

Environment Construction of Simulation Experiment
ROS is selected as the simulation experimental platform in this paper, Python and

Environment Construction of Simulation Experiment
ROS is selected as the simulation experimental platform in this paper, Python and TensorFlow frameworks are used to realize the proposed algorithm, and Gazebo7 is used to establish the simulation environment, as shown in Figure 6. The black dot is the mobile robot, the blue part is the detection range of the laser sensor, and the gray part is the obstacle. Figure 6a shows the established square-shaped simulation environment without obstacles, which is mainly used to train mobile robots to realize path planning in a limited space. Figure 6b adds large obstacles to the environment of Figure 6a to train the mobile robot to realize path planning in an environment with obstacles. Figure 6c adds random small obstacles to the environment of Figure 6a, which is mainly used to test the training effect of the mobile robot in the above two environments.

Environment Construction of Simulation Experiment
ROS is selected as the simulation experimental platform in this paper, Python and TensorFlow frameworks are used to realize the proposed algorithm, and Gazebo7 is used to establish the simulation environment, as shown in Figure 6. The black dot is the mobile robot, the blue part is the detection range of the laser sensor, and the gray part is the obstacle. Figure 6a shows the established square-shaped simulation environment without obstacles, which is mainly used to train mobile robots to realize path planning in a limited space. Figure 6b adds large obstacles to the environment of Figure 6a to train the mobile robot to realize path planning in an environment with obstacles. Figure 6c adds random small obstacles to the environment of Figure 6a, which is mainly used to test the training effect of the mobile robot in the above two environments.  We test the proposed path planning algorithm from three aspects of convergence speed, training time and success rate and analyze the experimental results in detail. The convergence speed and training time are used as evaluation criteria of the training efficiency of the algorithm, and the success rate is used to verify the effectiveness of the algorithm. In the training process of the algorithm, the convergence speed and training time can reflect how many episodes are needed to obtain the optimal solution. The faster the convergence speed is, the shorter the training time is, which means the higher the training efficiency is. The success rate refers to the percentage of mobile robots that can successfully reach the target point from the starting point according to the path planning algorithm adopted. The higher the success rate is, the better the performance of the algorithm.

Effect Analysis of Our Algorithm
In order to verify the performance of our algorithm, the DDPG algorithm proposed in reference [31], our algorithm with an improved network structure (LSTM-DDPG) and our algorithm after adding mixed noise further (MN-LSTM-DDPG) are all trained in the simulation environment, respectively.
Firstly, 2000 episodes of training are conducted in the simulation Figure 6a, and the reward value of the mobile robot is recorded after each episode of training, as shown in Figure 7. The results in Figure 7a show that in an environment without obstacles, the reward value of the algorithm proposed in [31] gradually tends to be stable with the increase in training episodes, but it still fails to converge after 2000 episodes of training. Moreover, the reward value fluctuates greatly in the first 800 episodes and is mostly negative, indicating that the mobile robot is learning how to approach the target point but fails many times. After 800 episodes, the reward value gradually tends to be positive, indicating that the mobile robot can reach the target point through training, but it still collides with obstacles. Figure 7b is the training result after LSTM is introduced into the network of the algorithm proposed in [31]. As can be seen, with the increase in training time, the reward value obtained by the mobile robot gradually increases from a negative value to a positive value, and finally tends to be stable, but the convergence rate is slow. After LSTM is introduced into the algorithm, the mixed noise composed of Gaussian noise and OU noise is further introduced into the output results of the network, and the reward function is optimized. Figure 7c shows the training result of the algorithm. As can be seen, with the increase in training time, the reward value obtained by the mobile robot gradually increases and finally tends to be stable. Compared with Figure 7b, the convergence speed is significantly faster.
the reward value fluctuates greatly in the first 800 episodes and is mostly negative, indicating that the mobile robot is learning how to approach the target point but fails many times. After 800 episodes, the reward value gradually tends to be positive, indicating that the mobile robot can reach the target point through training, but it still collides with obstacles. Figure 7b is the training result after LSTM is introduced into the network of the algorithm proposed in [31]. As can be seen, with the increase in training time, the reward value obtained by the mobile robot gradually increases from a negative value to a positive value, and finally tends to be stable, but the convergence rate is slow. After LSTM is introduced into the algorithm, the mixed noise composed of Gaussian noise and OU noise is further introduced into the output results of the network, and the reward function is optimized. Figure 7c shows the training result of the algorithm. As can be seen, with the increase in training time, the reward value obtained by the mobile robot gradually increases and finally tends to be stable. Compared with Figure 7b, the convergence speed is significantly faster.    Figure 8 shows the average rewards returned every 10,000 steps by the algorithm proposed in reference [31] and LSTM-DDPG and MN-LSTM-DDPG proposed in this paper in the path planning training in simulation Figure 6a. The blue line represents the algorithm proposed in [31], the green line represents the LSTM-DDPG algorithm, and the red line represents the MN-LSTM-DDPG algorithm. As can be seen, MN-LSTM-DDPG has the fastest convergence speed in the path planning of mobile robots, requiring only 120,000 steps, while the algorithm proposed in [31] requires 200,000 steps to converge, and the convergence is not stable. algorithm proposed in [31], the green line represents the LSTM-DDPG algorithm, and the red line represents the MN-LSTM-DDPG algorithm. As can be seen, MN-LSTM-DDPG has the fastest convergence speed in the path planning of mobile robots, requiring only 120,000 steps, while the algorithm proposed in [31] requires 200,000 steps to converge, and the convergence is not stable.   In order to verify the success rate and generalization ability of the model trained in simulation Figure 6a, 200 tests are performed on the three algorithms in simulation Figure 6a and simulation Figure 6c, respectively. Figures 9 and 10 show the movement process of the mobile robot in these two environments when MN-LSTM-DDPG is used for path planning. The black dot is the mobile robot, the blue part is the detection range of the laser sensor, the gray part is the obstacle, and the green circle is the final target point. As can be seen, the mobile robot can avoid obstacles from the starting point and reach the target point accurately with the optimal path.   Table 2 that the path planning success rate of the algorithm proposed in [31] is only 86%, while the success rate of the MN-LSTM-DDPG algorithm proposed in this paper can reach 100%. In terms of time, compared with the algorithm proposed in [31], the path planning time of the MN-LSTM-DDPG algorithm proposed in this paper is shortened by 21.48%. However, when the model trained in simulation Figure 6a is applied to simulation Figure 6c, the testing effect of each algorithm is not ideal, as shown in Table 3. Therefore, it is necessary to train the algorithm in an environment with obstacles.   Table 2 that the path planning success rate of the algorithm proposed in [31] is only 86%, while the success rate of the MN-LSTM-DDPG algorithm proposed in this paper can reach 100%. In terms of time, compared with the algorithm proposed in [31], the path planning time of the MN-LSTM-DDPG algorithm proposed in this paper is shortened by 21.48%. However, when the model trained in simulation Figure 6a is applied to simulation Figure 6c, the testing effect of each algorithm is not ideal, as shown in Table 3. Therefore, it is necessary to train the algorithm in an environment with obstacles.  Table 3. Comparison of test results of the improved algorithm in simulation Figure 6c.

Algorithm Success Rate (100%) Testing Time (h)
Algorithm in [31] 108 In order to verify the effect of the proposed algorithm in an obstacle environment, the algorithm proposed in [31], the LSTM-DDPG algorithm and the MN-LSTM-DDPG algorithm proposed in this paper were respectively trained for 2000 episodes in simulation Figure 6b, and the average reward returned by the mobile robot every 10,000 steps was recorded, as shown in Figure 11. As can be seen from the figure, the improved MN-LSTM-DDPG algorithm in this paper can achieve convergence after 110,000 steps of training in the path planning of mobile robots in the simulation Figure 6b However, the algorithm proposed in [31] needs 200,000 training steps to converge, and the convergence is unstable; the convergence speed is significantly slower than the algorithm proposed in this paper. Additionally, from the training results in Table 4, it can be seen that the path planning training time of the algorithm proposed in the [31] is 21.73 h, while the path planning training time of the improved MN-LSTM-DDPG algorithm in this paper is only 19.70 h, and the training speed shows a significant improvement. In terms of the number of training steps, the number of training steps of the algorithm proposed in [31] is 374,316, while the number of training steps of the improved MN-LSTM-DDPG algorithm in this paper is only 346,667, which significantly improves the convergence speed of the algorithm. planning training time of the algorithm proposed in the [31] is 21.73 h, while the path planning training time of the improved MN-LSTM-DDPG algorithm in this paper is only 19.70 h, and the training speed shows a significant improvement. In terms of the number of training steps, the number of training steps of the algorithm proposed in [31] is 374,316, while the number of training steps of the improved MN-LSTM-DDPG algorithm in this paper is only 346,667, which significantly improves the convergence speed of the algorithm. Figure 11. Average reward returned by reference [31] and the improved algorithm every 10,000 steps during training in simulation Figure 6b.   After training the model in an environment with obstacles, 200 tests were performed in simulation Figure 6b and simulation Figure 6c, respectively, to verify the generalization ability of the model. The test process is shown in Figures 12 and 13. The two figures respectively show the process of mobile robot avoiding obstacles from the starting point to reach the target range.     Table 5, in terms of success rate, the success rate of the algorithm proposed in [31] is 76%, while the success rate of the MN-LSTM-DDPG algorithm proposed in this paper can reach 87%, which is significantly higher than that in reference [31]. In terms of time, the test time of the algorithm in this paper is also 17.96% faster than the algorithm in [31]. It can be seen from Table 6 that the test effect of each algorithm in simulation Figure 6c is better than that in simulation Figure 6b. This is because when the obstacle is too large, the number of steps that the mobile robot needs to move increases. Since the maximum number of steps of the mobile robot is limited in the simulation environment, when the maximum number of steps has not reached the target point, it is regarded as a failure.

Comparison and Analysis with Other Algorithms
In order to fully evaluate the performance of the proposed algorithm, experiments are conducted to compare the proposed algorithm with those in references [32,33]. In the simulation Figure 6a, each algorithm is trained for 2000 episodes, and the average reward returned by the mobile robot every 10,000 steps is recorded. The results are shown in Figure 14. As can be seen from the figure, in the barrier-free environment, when the algorithm proposed in [32] carries out path planning training, it tends to converge at about 110,000 steps, but the reward value after convergence still shows a downward trend, and the training effect is not good. The algorithm proposed in [33] tends to stabilize after 210,000 steps of training. However, the algorithm proposed in this paper can converge and become stable after 120,000 steps of training, and the training effect is the best. Table 7 records the comparison of training time and the number of steps of each algorithm. As can be seen from the table, in terms of training time, the training time of the algorithm proposed in [32] is 39.63 h. The training time of the algorithm proposed in [33] is 24.67 h. However, the training time of the algorithm proposed in this paper is only 22.75 h, which is significantly faster than the algorithm proposed in [32] and better than the algorithm proposed in [33]. In terms of the number of training steps, the training steps of the algorithm proposed in references [32,33] are 395,344 and 446,596, respectively, while the training steps of the algorithm proposed in this paper are 417,701. The algorithm proposed in [32] will fall into local optimum during training, which leads to the mobile robot turning in place, and in this paper, the proposed algorithm can effectively avoid the phenomenon. Compared with reference [32], this paper proposed that although the algorithm steps of training increased, the training time is significantly reduced, and the training speed is still faster than the algorithm in [32]. However, the algorithm proposed in this paper can effectively avoid this phenomenon. Compared with the algorithm proposed in [32], although the number of training steps is increased, the training time is significantly reduced, and the training speed is still faster than the algorithm proposed in [32]. This indicates that the training effect of the algorithm proposed in this paper is better than the algorithm proposed in the references [32,33] in a barrier-free environment. speed is still faster than the algorithm proposed in [32]. This indicates that the training effect of the algorithm proposed in this paper is better than the algorithm proposed in the references [32,33] in a barrier-free environment.  To verify the success rate and generalization ability of the model trained in simulation Figure 6a Tables 8 and 9. It can be seen from Table 8 that the path planning success rate of the algorithm proposed in [32] is 87.5%, and that of the algorithm proposed in [33] is 90%. However, the path planning success rate of the algorithm proposed in this paper can reach 100%, which is significantly higher than that in references [32,33]. In terms of time, the algorithm proposed in this paper takes the same time as the algorithm proposed in [33], which is significantly shorter than that in [32]. However, it can be seen from the test results in Table 9 that the model trained by each algorithm in a barrier-free environment is not ideal when tested in an obstacle environment, but the algorithm proposed in this paper still performs better than the other two algorithms.  To verify the success rate and generalization ability of the model trained in simulation  Tables 8 and 9. It can be seen from Table 8 that the path planning success rate of the algorithm proposed in [32] is 87.5%, and that of the algorithm proposed in [33] is 90%. However, the path planning success rate of the algorithm proposed in this paper can reach 100%, which is significantly higher than that in references [32,33].
In terms of time, the algorithm proposed in this paper takes the same time as the algorithm proposed in [33], which is significantly shorter than that in [32]. However, it can be seen from the test results in Table 9 that the model trained by each algorithm in a barrierfree environment is not ideal when tested in an obstacle environment, but the algorithm proposed in this paper still performs better than the other two algorithms. In order to verify the advantages of the algorithm in this paper in an environment with obstacles, the algorithm in this paper and the algorithm proposed in references [32,33] were trained for 2000 episodes in simulation Figure 6b. The average rewards returned by the mobile robot every 10,000 steps were recorded, and the results are shown in Figure 15. As can be seen from the figure, when the algorithm proposed in [32] performs path planning in simulation Figure 6b, it needs 240,000 steps of training to converge. The algorithm proposed in [33] needs 140,000 training steps to become stable, while the algorithm proposed in this paper can achieve convergence after 110,000 training steps, and the convergence effect is obviously better than that in references [32,33]. It can also be seen from Table 10 that the training time and number of training steps taken by the algorithm proposed in this paper are significantly less than those in references [32,33]. In order to verify the advantages of the algorithm in this paper in an environment with obstacles, the algorithm in this paper and the algorithm proposed in references [32] and [33] were trained for 2000 episodes in simulation Figure 6b. The average rewards returned by the mobile robot every 10,000 steps were recorded, and the results are shown in Figure 15. As can be seen from the figure, when the algorithm proposed in [32] performs path planning in simulation Figure 6b, it needs 240,000 steps of training to converge. The algorithm proposed in [33] needs 140,000 training steps to become stable, while the algorithm proposed in this paper can achieve convergence after 110,000 training steps, and the convergence effect is obviously better than that in references [32,33]. It can also be seen from Table 10 that the training time and number of training steps taken by the algorithm proposed in this paper are significantly less than those in references [32] and [33].

Algorithm
Training Time (h) Training Steps (Step) Algorithm in [32] 21.03 365,978 Algorithm in [33] 21.67 379,379 Our proposed 19.70 346,667 Figure 15. Average reward returned by references [32,33] and our proposed algorithm every 10,000 steps during training in simulation Figure 6b. Table 10. Comparison of training time and steps of each algorithm in simulation Figure 6b.

Algorithm Training Time (h) Training Steps (Step)
Algorithm in [32] 21.03 365,978 Algorithm in [33] 21.67 379,379 Our proposed 19.70 346,667 In order to verify the success rate of the trained model in the obstacle environment, each algorithm was tested 200 times each in simulation Figure 6b and simulation Figure 6c, and the results are shown in Tables 11 and 12. As can be seen from the test results in Table 11, in terms of success rate, the algorithm proposed in [32] is 82.5%, the algorithm proposed in [33] is 81%, and the algorithm proposed in this paper can reach 87%, which is significantly higher than the algorithm proposed in references [32,33]. In terms of test time, the test time of the algorithm proposed in this paper is longer than that proposed in [33]. In the testing process, since the starting and ending points of the mobile robot are randomly selected, their relative positions will have a certain influence on the testing time. It can be seen from Table 12, in simulation Figure 6c, the success rate of the algorithm proposed in this paper can reach 90.5%, higher than the algorithm proposed in references [32,33], which indicates that the algorithm proposed in this paper has obvious advantages in path planning in an environment with obstacles. Table 11. Comparison of test results of each algorithm in simulation Figure 6b.

Conclusions
Since LSTM has the ability of "memory", this paper uses LSTM to optimize the DDPG network structure. By designing mixed noise and more reasonable reward function, mobile robot path planning models can be rapidly trained. This effectively improves the exploration efficiency of a mobile robot in a complex environment and ensures that the mobile robot can reach the target point in a shorter time and by a better path. However, the algorithm in this paper only considers static obstacles in the environment, and dynamic obstacles are also important factors to be considered in many application scenarios. How to effectively avoid the impact of dynamic obstacles on path planning is important research content for the future.