Path Planning of a Mobile Robot for a Dynamic Indoor Environment Based on an SAC-LSTM Algorithm

This paper proposes an improved Soft Actor–Critic Long Short-Term Memory (SAC-LSTM) algorithm for fast path planning of mobile robots in dynamic environments. To achieve continuous motion and better decision making by incorporating historical and current states, a long short-term memory network (LSTM) with memory was integrated into the SAC algorithm. To mitigate the memory depreciation issue caused by resetting the LSTM’s hidden states to zero during training, a burn-in training method was adopted to boost the performance. Moreover, a prioritized experience replay mechanism was implemented to enhance sampling efficiency and speed up convergence. Based on the SAC-LSTM framework, a motion model for the Turtlebot3 mobile robot was established by designing the state space, action space, reward function, and overall planning process. Three simulation experiments were conducted in obstacle-free, static obstacle, and dynamic obstacle environments using the ROS platform and Gazebo9 software. The results were compared with the SAC algorithm. In all scenarios, the SAC-LSTM algorithm demonstrated a faster convergence rate and a higher path planning success rate, registering a significant 10.5 percentage point improvement in the success rate of reaching the target point in the dynamic obstacle environment. Additionally, the time taken for path planning was shorter, and the planned paths were more concise.


Introduction
Mobile robot path planning is a crucial technique that enables robots to navigate through an environment while avoiding obstacles.This is achieved by planning a collisionfree path [1] from the robot's current position to a desired destination based on environmental information obtained through sensors.Traditional path planning methods encompass a variety of algorithms with different principles and application scenarios.These methods can be classified into global and local path planning algorithms, including graph search methods, random sampling methods, bionic algorithms, artificial potential field methods, simulated annealing algorithms, neural network methods, and dynamic window methods.
Classical graph search algorithms include Dijkstra [2], A* [3], and D* [4].The Dijkstra algorithm solves the shortest path problem from a single point to all other vertices in a directed graph using a greedy algorithm.Although it can obtain the optimal path, it has a high computational cost and low efficiency.The A* algorithm, based on the Dijkstra algorithm, improves efficiency by adding the heuristic information, but its effectiveness in complex environments is not guaranteed.Moreover, this algorithm is only suitable for static environments and it performs poorly in dynamic environments.The D* algorithm is an improved version of the A* algorithm that can be applied in dynamically changing scenarios.
Classical random sampling methods such as Probabilistic Roadmap (PRM) [5] and Rapidly Exploring Random Tree (RRT) [6] have been widely used in robot path planning and motion control fields.The Lazy PRM [7] algorithm is an improved version of the PRM algorithm that enhances efficiency by reducing the number of calls to the local planner.Liu et al. [8] improved the RRT algorithm by using a goal-biased sampling strategy to determine the nodes and introduced an event-triggered step length extension based on the hyperbolic tangent function to improve node generation efficiency.Euclidean distance and angle constraints were used in the cost function of node connection optimization.Finally, the path was optimized further using path pruning and Bezier curve smoothing methods, leading to an improved convergence and accuracy.
Bionic algorithms are heuristic optimization algorithms based on the evolution and behavior of biological organisms in nature, mainly including genetic algorithms [9] and ant colony algorithms [10].Liang et al. [11] integrated the ant colony algorithm and genetic algorithm to propose a hybrid path planning algorithm, which used a genetic algorithm to generate initial paths and then used an ant colony algorithm to optimize them, significantly improving the accuracy and efficiency of path planning.
The artificial potential field method [12] was first applied in the mobile robot path planning field in 1986.Zha et al. [13] improved the artificial potential field method by adding a distance factor between the target point and the vehicle in the repulsive force function, and a safety distance within the influence range of obstacles.The experimental results showed that when the vehicle was driving within the safety distance of obstacles, it would be subject to increased repulsive force, ensuring the safety of the vehicle during driving.Zhao et al. [14] proposed a multi-robot path planning method based on an improved artificial potential field and a fuzzy inference system, which overcame the problem of smooth path planning existing in traditional artificial potential field methods by using the incremental potential field calculation method.
Afifi et al. [15] proposed a vehicle path planning method based on a simulated annealing algorithm, which solved the vehicle path planning problem under time window constraints, using the simulated annealing algorithm for path optimization and search.Jun et al. [16] proposed a particle swarm optimization combined with simulated annealing (PSO-ICSA) to self-adaptively adjust the coefficients, enabling high-dimension objects to enhance the global convergence ability.
Zhang et al. [17] proposed an indoor mobile robot path planning method based on deep learning.They used deep learning models to extract features from sensor data to predict the robot's motion direction and speed, achieving good results in experiments.During robot path planning, when the environment changes, the robot re-executes the algorithm, significantly increasing the time to find the optimal path.Dimensionality reduction seems to be a solution to this problem.Ferreira et al. [18] proposed a path planning algorithm based on a deep learning encoder model.They built a CNN encoder that uses nonlinear correlations to reduce data dimensions, eliminating unnecessary information and accelerating the efficiency of finding the shortest path.
Bai et al. [19] combined a dynamic window algorithm with the A* algorithm to propose an unmanned aerial vehicle path planning method.Experiments proved that the improved algorithm significantly reduced both path planning length and time.Lee et al. [20] proposed a dynamic window approach based on finite distribution estimation.Experimental results showed that this method could achieve a reliable obstacle avoidance effect for mobile robots and exhibited good performance in multiple scenarios.
Most of the above methods heavily rely on environmental map information.When faced with unknown environments, these methods may not achieve ideal results.When mobile robots are in unknown environments, due to the lack of environmental cognition, they must have certain exploration and autonomous learning abilities to efficiently complete path planning tasks.Therefore, studying mobile robot path planning algorithms that rely on little or no map information of the environment and have autonomous learning abilities has become one of the current key research topics.
Recently, artificial intelligence technologies represented by deep learning and reinforcement learning have developed rapidly.Reinforcement learning algorithms do not rely on map information and can learn path planning strategies in unknown environments by Sensors 2023, 23, 9802 3 of 28 interacting with the environment through trial and error.However, reinforcement learning is prone to the problem of dimensionality disaster in the path planning process.Deep learning is an end-to-end model that can fit the mapping relationship between high-dimensional input and output data, and is suitable for dealing with high-dimensional data problems.Deep reinforcement learning combines the advantages of deep learning and reinforcement learning and has a huge advantage over other path planning methods when dealing with complex unknown environments.Nevertheless, path planning methods based on deep reinforcement learning still face problems such as sparse rewards, a slow learning rate, and difficult convergence in application process.
This paper focuses on the indoor mobile robot path planning problem in complex and unknown dynamic environments with several static and dynamic obstacles, using the Soft Actor-Critic (SAC) algorithm as the main method.The SAC algorithm, while powerful, exhibits certain limitations, namely (1) difficulty in processing complex or dynamic environmental information; (2) inadequacy in handling long-term dependencies in path planning; and (3) lack of predictive capability for future environmental states.To address these shortcomings, an improved SAC-LSTM algorithm is proposed.The main contributions of this paper are as follows.First, the LSTM network with memory capability is introduced into the SAC algorithm, allowing the agent to make more reasonable decisions by combining historical and current states and predict the dynamic changes in the environment, such as the future positions of moving obstacles.Second, the burn-in training mechanism is introduced to solve the problem of memory impairment caused by the hidden state being zeroed during the training process of the LSTM network, stabilizing the learning process, especially in the early stages of training.Third, by combining the prioritized experience replay mechanism, the problem of low sampling efficiency of the algorithm is solved, and the convergence speed of the algorithm is accelerated.Fourth, a complex dynamic test scenario is constructed for indoor mobile robots, featuring multiple stationary and moving obstacles of various sizes and shapes, as well as different motion trajectories, making the test scenario more realistic.
The rest of this paper is organized as follows: Section 2 provides an overview of related works regarding path planning methods in dynamic environments.Section 3 describes the proposed SAC-LSTM system framework and algorithms in detail.Section 4 presents the experimental results and performance analysis.Finally, Section 5 concludes the paper and discusses future research directions.

Related Work
In recent years, researchers have started to apply deep reinforcement learning (DRL) algorithms to the field of path planning to solve complex problems.In 2016, Tai et al. [21] first applied the DQN algorithm to indoor mobile robots, which could complete path planning tasks in indoor scenarios, but the algorithm had low generalization.Wang et al. [22] introduced an improved DQN algorithm combined with artificial potential field methods to design reward functions, improving the efficiency of mobile robot path planning.However, it could not achieve continuous action output for robots.
Lei et al. [23] proposed a path planning algorithm using a DDQN framework with environment information obtained through LiDAR.It designed a new reward function to address the instability issue during training, improving algorithm stability, but its application was limited to simple scenarios with no guarantee of efficiency in complex situations.Tai et al. [24] used an asynchronous deterministic policy gradient algorithm to build a mapless path planner with input from the mobile robot's LiDAR-scanned environment information.After a period of training, the mobile robot could successfully reach the designated target, but the planned path was relatively tortuous.
In [25], a deep reinforcement learning-based online path planning algorithm was proposed, successfully achieving path planning for drones in dynamic environments.Zhang et al. [26] combined the advantages of DRL and interactive RL algorithms and proposed a deep interactive reinforcement method for autonomous underwater vehicle (AUV) path tracking, achieving path tracking in a Gazebo-simulated environment.In [27], the DDPG algorithm was extended to a parallel deep deterministic policy gradient algorithm (PDDPG) and applied to multi-robot mapless collaborative navigation tasks.
Wang et al. [28] proposed an end-to-end modular DRL architecture that decomposed navigation tasks in complex dynamic environments into local obstacle avoidance and global navigation subtasks, using DQN and dual-stream DQN algorithms to solve them, respectively.Experiments demonstrated that this modular architecture could efficiently complete navigation tasks.Gao et al. [29] introduced an incremental training mode to address the low training efficiency in DRL path planning.In [30], incorporating a curiosity mechanism into the A3C algorithm provided an additional reward for the exploration behavior of mobile robots, addressing the reward sparsity issue to some extent.
De Jesus et al. [31] proposed a mobile robot path planning algorithm based on SAC, achieving path planning in different scenarios built on ROS.However, the algorithm still faced the problem of sparse environmental rewards.Park, K.-W.et al. [32] employed the SAC algorithm to solve the path planning problem for multi-arm manipulators to avoid fixed and moving obstacles, and used the LSTM network to predict the position of the moving obstacles.The simulation and experimental results showed the optimal path and good prediction of obstacle position.Although the simulation and experimental scenario considered moving obstacles, there was only one moving obstacle, lacking verification for multiple obstacles.Additionally, the multi-arm manipulators used a camera in conjunction with the OpenCV vision algorithm to detect obstacles, which has limited effectiveness in detecting and predicting multiple moving obstacles.
In addition to LSTM, the metaheuristic-based recurrent neural network (RNN) [33] has also been applied to control mobile robotic systems.The metaheuristic-based RNNs often integrate optimization algorithms (such as genetic algorithms and particle swarm optimization) with neural network characteristics.The Beetle Antennae Olfactory Recurrent Neural Network (BAORNN) [34] is a metaheuristic-based control framework used for simultaneous tracking control and obstacle avoidance in redundant manipulators.A key feature of this framework is that it unifies tracking control and obstacle avoidance into a single constrained optimization problem, actively rewarding the optimizer to avoid obstacles by introducing a penalty term in the objective function.The distance calculation is based on the Gilbert-Johnson-Keerthi algorithm, which calculates the distance between the manipulator and obstacles by directly using their three-dimensional geometric shapes.In contrast, the RNN may offer more flexibility in solving specific control and optimization problems, but might not match the LSTM's proficiency in handling complex sequential data and long-term dependencies.
As seen from the above literature, DRL-based path planning methods have the advantages of not relying on map information and autonomous learning capabilities, making them highly suitable for path planning tasks in unknown environments.However, during application, there are still issues such as sparse rewards, slow learning rates, and difficulty in converging the algorithm.In particular, in dynamic or complex environments, the SAC algorithm may require more accurate environmental models for effective learning.This could be difficult to achieve in practice, particularly in environments with high uncertainty or rapid changes.The SAC algorithm might be insufficient in effectively dealing with highly dynamic or non-stationary environments, where environmental states and dynamics can change rapidly.

Framework of Path Planning
The task of the mobile robot path planning algorithm is to plan an optimal path from the starting point to the destination, while minimizing the robot's movement cost, avoiding obstacles, and enabling the robot to reach the destination as quickly as possible.The deep reinforcement learning algorithm is designed to learn a strategy that enables the agent to maximize its reward through interactions with the environment.Applying this algorithm to the mobile robot path planning task essentially transforms the path planning problem into a reinforcement learning problem.The strategy that allows the robot to quickly reach the target point is learned by having the mobile robot try different actions and interact with the environment.The process of interacting with the environment is a Markov Decision Process (MDP), which requires defining the state space, action space, and reward function.This article proposes the application of the SAC-LSTM algorithm to the mobile robot path planning task, as depicted in Figure 1.In this framework, the SAC-LSTM algorithm combines historical and current states to select an action and controls the mobile robot to execute the action in the environment.After interacting with the environment, the mobile robot receives a reward value and a state value.During the training process, important experience samples are prioritized using the prioritized experience replay, and the algorithm is trained using the burn-in mechanism to obtain an optimal strategy for quickly reaching the target point.

Framework of Path Planning
The task of the mobile robot path planning algorithm is to plan an optimal path from the starting point to the destination, while minimizing the robot's movement cost, avoiding obstacles, and enabling the robot to reach the destination as quickly as possible.The deep reinforcement learning algorithm is designed to learn a strategy that enables the agent to maximize its reward through interactions with the environment.Applying this algorithm to the mobile robot path planning task essentially transforms the path planning problem into a reinforcement learning problem.The strategy that allows the robot to quickly reach the target point is learned by having the mobile robot try different actions and interact with the environment.The process of interacting with the environment is a Markov Decision Process (MDP), which requires defining the state space, action space, and reward function.This article proposes the application of the SAC-LSTM algorithm to the mobile robot path planning task, as depicted in Figure 1.In this framework, the SAC-LSTM algorithm combines historical and current states to select an action and controls the mobile robot to execute the action in the environment.After interacting with the environment, the mobile robot receives a reward value and a state value.During the training process, important experience samples are prioritized using the prioritized experience replay, and the algorithm is trained using the burn-in mechanism to obtain an optimal strategy for quickly reaching the target point.

SAC-LSTM Algorithm
The Soft Actor-Critic (SAC) algorithm is an off-policy Actor-Critic algorithm based on the maximum entropy model framework.Off-policy learning improves sample efficiency and reduces training time by utilizing data generated from a behavioral policy to train the target policy.The Actor-Critic framework allows the algorithm to be applied to continuous state and action spaces, thus expanding the range of actions and states that can be selected.To prevent premature convergence and avoid local optima, the SAC algorithm uses a stochastic policy.Additionally, introducing entropy during the learning process improves the algorithm's ability to explore the environment and enhances its performance and robustness.

SAC-LSTM Algorithm
The Soft Actor-Critic (SAC) algorithm is an off-policy Actor-Critic algorithm based on the maximum entropy model framework.Off-policy learning improves sample efficiency and reduces training time by utilizing data generated from a behavioral policy to train the target policy.The Actor-Critic framework allows the algorithm to be applied to continuous state and action spaces, thus expanding the range of actions and states that can be selected.To prevent premature convergence and avoid local optima, the SAC algorithm uses a stochastic policy.Additionally, introducing entropy during the learning process improves the algorithm's ability to explore the environment and enhances its performance and robustness.

Maximum Entropy Principle
Entropy is a measure of uncertainty or randomness in a system, used in information theory.It is inversely proportional to system certainty and directly proportional to system randomness.Let x be a random variable with probability density function P(X).The entropy H(P) of X can be calculated according to Classical reinforcement learning algorithms adopt a learning strategy that seeks to maximize the expected value of cumulative rewards.This objective is expressed in the following expression where the action-value function Q(s, a) represents the expected return obtained by taking a certain action a in a state s based on the policy function π.The reward value R(s, a) is the expected value of the sum of immediate rewards for all possible actions at any given time t, denoted as R(s, a in which s and a represent the state and action at time t, respectively.The objective of the SAC algorithm in reinforcement learning is to maximize the entropy-regularized reward, which is the sum of cumulative reward and policy entropy.In other words, the SAC algorithm incorporates policy entropy in addition to maximizing the cumulative reward in classical reinforcement learning algorithms.The process of finding the optimal policy in the SAC algorithm is represented by Equation (3).
In Equation (3), α is the entropy regularization coefficient, which adjusts the relative importance of reward and entropy.A higher value of α indicates a larger proportion of entropy, prompting the intelligent agent to explore the environment and employ diverse actions to achieve its objectives.Conversely, a lower value of α implies a reduced emphasis on entropy, causing the intelligent agent to rely more on existing actions to accomplish its objectives.A trajectory τ is a sequence of states and actions, where τ = (s 0 , a 0 , s 1 , a 1 , . . .). γ ∈ (0, 1) is the discount factor.Similarly, compared to classical reinforcement learning algorithms, the SAC algorithm adds entropy rewards to both the value function V π and the action-value function Q π .The specific definitions are shown in Equations ( 4) and (5), The relationship between V π (s) and Q π (s) is as follows: and the Bellman equation for Based on the above steps, the policy network can iteratively update its network parameters using the iterative Bellman equation in order to obtain the optimal policy.

Updating Process of SAC Algorithm
This paper utilizes an entropy-weighted SAC algorithm with automatic adjustment of α.The algorithm consists of two Q networks (Critic networks) and one Actor network (Policy network).The subsequent sections will present the updating processes of the Q networks and the Actor network.

1.
Updating Process of the Q Networks The Q networks are updated by sampling experiences (s t , a t , r t , s t+1 , d) from the experience replay buffer.The estimation of the state-action value for the Q networks is calculated by where a t+1 is the prediction of the action a t+1 by the Actor network.The SAC algorithm employs the mean squared loss function as its loss.Define D = {τ i } i=1,...,N as a set of trajectories, where each trajectory is obtained by letting the agent act in the environment using the policy π θ .The loss function for the Q networks is defined as follows: where |D| = N is the number of trajectories in D.
The SAC algorithm incorporates the clipped double-Q technique during the training of the Q networks.Consequently, Q target (r t , s t+1 , d) takes the minimum Q-value between the two Q approximators.The specific definition is as follows:

Updating Process of the Actor Network
The SAC algorithm utilizes a squashed Gaussian policy to select actions, which means that action samples are obtained according to a θ (s, ξ) = tanh(µ θ (s) + σ θ (s) ξ), ξ ∼ N (0, I), (11) where θ represents the parameters of the Actor network π θ , µ θ (s) and σ θ (s) respectively correspond to the mean and standard deviation of the action distribution outputted by the Actor network π θ , and ξ denotes a random noise that follows a normal distribution.The reparameterization technique is used to optimize the policy in order to rewrite the expectation over actions into an expectation over noise, as follows: To obtain the policy loss, the final step is to substitute Q π θ (s, a θ (s, ξ)) with one of two function approximators.The policy of the Actor network π θ is thus optimized according to where min is the minimum of the two Q approximators.

LSTM Network
During the process of mobile robot path planning, individual states often fail to provide sufficient information to guide the robot in making optimal decisions.As a result, this paper proposes enhancements to the neural network of the SAC algorithm by incorporating LSTM (Long Short-Term Memory) neural networks [35].By introducing LSTM networks, the algorithm gains the ability to retain and utilize past states alongside the current state, enabling it to make more informed and rational decisions.
Conventional fully connected neural networks are limited in their ability to effectively address time-dependent problems, as their outputs are solely determined by the input at the current time step.However, the Recurrent Neural Network (RNN) provides a solution to this challenge.By incorporating recurrent networks within their architecture, RNNs have demonstrated remarkable efficacy in handling time-dependent problems.Nevertheless, during the training process, RNNs are susceptible to issues such as gradient vanishing or exploding.LSTM networks introduce the gated mechanisms on the foundation of RNNs, allowing for selective information retention and addressing the limitations of traditional RNNs.The LSTM structure comprises three essential gates: the forget gate f, the input gate i, and the output gate o.The forget gate f determines how much information from the previous time step's memory cell C t−1 should be retained for the current time step's memory cell C t .The input gate i regulates the amount of current time step's information to be stored in the candidate state C t .The output gate o controls the extent to which information from the current time step's memory cell C t should be conveyed to the current hidden state Y t .The diagram depicting the structure of the LSTM neural networks is illustrated in Figure 2.
During the process of mobile robot path planning, individual states often fail to provide sufficient information to guide the robot in making optimal decisions.As a result, this paper proposes enhancements to the neural network of the SAC algorithm by incorporating LSTM (Long Short-Term Memory) neural networks [35].By introducing LSTM networks, the algorithm gains the ability to retain and utilize past states alongside the current state, enabling it to make more informed and rational decisions.
Conventional fully connected neural networks are limited in their ability to effectively address time-dependent problems, as their outputs are solely determined by the input at the current time step.However, the Recurrent Neural Network (RNN) provides a solution to this challenge.By incorporating recurrent networks within their architecture, RNNs have demonstrated remarkable efficacy in handling time-dependent problems.Nevertheless, during the training process, RNNs are susceptible to issues such as gradient vanishing or exploding.LSTM networks introduce the gated mechanisms on the foundation of RNNs, allowing for selective information retention and addressing the limitations of traditional RNNs.The LSTM structure comprises three essential gates: the forget gate f, the input gate i, and the output gate o.The forget gate f determines how much information from the previous time step's memory cell  2. In Figure 2, t refers to the current time.Xt−1, Xt, and Xt+1 correspond to the inputs of the previous, current, and next time steps, respectively.Similarly, Yt−1, Yt, and Yt+1 denote the outputs of the previous, current, and next time steps, respectively.In addition, Ct−1, Ct, and Ct+1 represent the memory cells at the previous, current, and next time steps, respectively.σ is the sigmoid function, defining output values between 0 and 1.The output values of the function tanh range from −1 to 1.
The workflow of LSTM is as follows.To begin with, the forget gate selectively discards information from the memory cell Ct−1 of the previous time step.In order to obtain the forget coefficient, the previous hidden state information Yt−1 and the present input information Xt are both passed through the sigmoid function.The forget gate ft is obtained by In Figure 2, t refers to the current time.X t−1 , X t , and X t+1 correspond to the inputs of the previous, current, and next time steps, respectively.Similarly, Y t−1 , Y t , and Y t+1 denote the outputs of the previous, current, and next time steps, respectively.In addition, C t−1 , C t , and C t+1 represent the memory cells at the previous, current, and next time steps, respectively.σ is the sigmoid function, defining output values between 0 and 1.The output values of the function tanh range from −1 to 1.
The workflow of LSTM is as follows.To begin with, the forget gate selectively discards information from the memory cell C t−1 of the previous time step.In order to obtain the forget coefficient, the previous hidden state information Y t−1 and the present input information X t are both passed through the sigmoid function.The forget gate f t is obtained by where W f and b f represents the weight and bias of the layer network, respectively.The second step involves generating the information required to update the current memory unit C t .This process is divided into two parts.Firstly, the update value i t is generated through the sigmoid layer of the input gate.Secondly, a new candidate value C t is generated using the tanh layer.The specific calculations are defined by Equations ( 15) and ( 16).
The current memory unit C t at the current time step is defined by Equation (17).It is computed as the sum of two products: the product of the memory unit at the previous Sensors 2023, 23, 9802 9 of 28 time step C t−1 and the forget gate control signal f t , and the product of the input gate value i t and the candidate value C t .
The final step involves determining the output of the LSTM model.Firstly, the control signal o t of the output gate is obtained using the sigmoid function.Secondly, the current memory unit C t at the current time step is scaled by applying the tanh function.Multiplying these two values yields the current output value Y t .The specific calculations are defined by Equations ( 18) and (19).
Figure 3 illustrates the network structure of the SAC algorithm after incorporating LSTM.This diagram represents the final network architecture of the SAC-LSTM algorithm, as subsequent enhancements do not pertain to the network structure.
[ ] ( ) [ ] ( ) The current memory unit t C at the current time step is defined by Equation (17).It is computed as the sum of two products: the product of the memory unit at the previous time step Ct−1 and the forget gate control signal ft, and the product of the input gate value t i and the candidate value t C  .
The final step involves determining the output of the LSTM model.Firstly, the control signal t o of the output gate is obtained using the sigmoid function.Secondly, the current memory unit t C at the current time step is scaled by applying the tanh function.Multi- plying these two values yields the current output value t Y .The specific calculations are defined by Equations ( 18) and (19).
[ ] ( ) ( ) Figure 3 illustrates the network structure of the SAC algorithm after incorporating LSTM.This diagram represents the final network architecture of the SAC-LSTM algorithm, as subsequent enhancements do not pertain to the network structure.The network architecture of the SAC-LSTM algorithm consists of two components: the Actor network and the Critic network.The Actor network begins with a fully connected layer, FC1, comprising 256 neurons.It takes the state feature vector s as input and utilizes the ReLU activation function to enhance feature extraction.The subsequent layer is a memory-capable LSTM layer with 256 neurons, enabling the algorithm to incorporate historical and current states for improved decision making.Following the LSTM layer, there is a fully connected layer, FC2, with 256 neurons that applies the ReLU activation function to process the LSTM layer's outputs.The final layer is another fully connected layer with four neurons, producing the mean µ and standard deviation σ, which are used to resample from a Gaussian distribution N = (µ, σ).The resulting action a is obtained by applying the tanh activation function.
The Critic network receives both the state vector s and the action vector a as inputs.The FC1 and LSTM layers mirror those of the Actor network.The FC2 layer, a fully connected layer with 16 neurons, extracts the features from a, using the ReLU activation function.The FC3 layer takes the concatenated features from the LSTM and FC2 layers as input, comprising 272 neurons and utilizing the ReLU activation function.Finally, the output layer consists of a single neuron, which outputs the Q-value for updating the Actor network.

Burn-In Mechanism
Updating LSTM networks requires a series of consecutive sequential samples.However, the high correlation among consecutive sequence samples can lead to increased variance in parameter updates.Therefore, the currently predominant approach is to employ the random order update method, as utilized in DRQN.This approach involves selecting a complete episode of experiences from the experience replay buffer and randomly choosing a fixed-length continuous sequence from that episode to train the algorithm.Prior to each training iteration, the LSTM's hidden state h int is initialized to zero. Figure 4 provides a schematic representation of the random order update method, wherein the orange circles represent individual experience tuples (s t , a t , r t+1 , s t+1 , d), the black boxes represent sequence experiences of length L, and DRL refers to a deep reinforcement learning algorithm incorporating an LSTM network.This updating method offers the advantage of simplicity and low complexity.However, resetting the LSTM network's hidden state to zero before training can lead to impaired memory within the LSTM network, subsequently impacting the algorithm's performance.
used to resample from a Gaussian distribution ( , ) N μ σ

=
. The resulting action a is obtained by applying the tanh activation function.
The Critic network receives both the state vector s and the action vector a as inputs.The FC1 and LSTM layers mirror those of the Actor network.The FC2 layer, a fully connected layer with 16 neurons, extracts the features from a, using the ReLU activation function.The FC3 layer takes the concatenated features from the LSTM and FC2 layers as input, comprising 272 neurons and utilizing the ReLU activation function.Finally, the output layer consists of a single neuron, which outputs the Q-value for updating the Actor network.

Burn-in Mechanism
Updating LSTM networks requires a series of consecutive sequential samples.However, the high correlation among consecutive sequence samples can lead to increased variance in parameter updates.Therefore, the currently predominant approach is to employ the random order update method, as utilized in DRQN.This approach involves selecting a complete episode of experiences from the experience replay buffer and randomly choosing a fixed-length continuous sequence from that episode to train the algorithm.Prior to each training iteration, the LSTM's hidden state int h is initialized to zero. Figure 4 pro- vides a schematic representation of the random order update method, wherein the orange circles represent individual experience tuples , the black boxes represent sequence experiences of length L, and DRL refers to a deep reinforcement learning algorithm incorporating an LSTM network.This updating method offers the advantage of simplicity and low complexity.However, resetting the LSTM network's hidden state to zero before training can lead to impaired memory within the LSTM network, subsequently impacting the algorithm's performance.Therefore, the burn-in mechanism utilized in the R2D2 algorithm [36] is introduced in this paper.The burn-in mechanism serves as a warm-up mechanism, initializing the LSTM network's hidden state int h with a portion of historical data prior to training.Therefore, the burn-in mechanism utilized in the R2D2 algorithm [36] is introduced in this paper.The burn-in mechanism serves as a warm-up mechanism, initializing the LSTM network's hidden state h int with a portion of historical data prior to training.Figure 5 depicts the conceptual diagram illustrating the application of the burn-in period in deep reinforcement learning algorithms.The black box represents a sequence of data, while the green circles represent the burn-in data comprising l b items.The red circles represent the training data consisting of l t items.When the sequence data are sampled by the DRL algorithm, the first l b items are used to update the LSTM's hidden state h t within the DRL framework.Subsequently, the remaining l t items are utilized to train the DRL algorithm.By incorporating the burn-in mechanism, the hidden state of the LSTM network is updated before training, thereby circumventing the issue of impaired memory capacity caused by zeroing the LSTM network's hidden state prior to training.Consequently, this integration enhances the performance of the algorithm.

Training DRL
This study treats a fixed-length sequence (s t , a t , r t+1 , s t+1 , d),. .., (s t+L , a t+L , r t+1+L , s t+1+L , d) of L as a single experience.As the experience tuples during the burn-in period are not utilized for network training, an approach is adopted to prevent experience wastage.Specifically, the storage format of experiences is designed in such a way that there is a duplication of half the experience tuple between two adjacent experiences.Figure 6 of L as a single experience.As the experience tuples during the burn-in period are not utilized for network training, an approach is adopted to prevent experience wastage.Specifically, the storage format of experiences is designed in such a way that there is a duplication of half the experience tuple between two adjacent experiences.Figure 6 provides a visual representation of the specific storage format, with each circle representing an individual experience tuple .

Prioritized Experience Replay
During the exploration of the environment, the agent accumulates training data, which is then stored in an experience replay buffer in the form of experience tuples.However, these samples tend to be sparse, and there exists a strong correlation among consecutively collected samples.Hence, in the context of deep reinforcement learning training, This study treats a fixed-length sequence { } of L as a single experience.As the experience tuples during the burn-in period are not utilized for network training, an approach is adopted to prevent experience wastage.Specifically, the storage format of experiences is designed in such a way that there is a duplication of half the experience tuple between two adjacent experiences.Figure 6 provides a visual representation of the specific storage format, with each circle representing an individual experience tuple .

Prioritized Experience Replay
During the exploration of the environment, the agent accumulates training data, which is then stored in an experience replay buffer in the form of experience tuples.However, these samples tend to be sparse, and there exists a strong correlation among consecutively collected samples.Hence, in the context of deep reinforcement learning training,

Prioritized Experience Replay
During the exploration of the environment, the agent accumulates training data, which is then stored in an experience replay buffer in the form of experience tuples.However, these samples tend to be sparse, and there exists a strong correlation among consecutively collected samples.Hence, in the context of deep reinforcement learning training, it is essential to employ random sampling of the samples within the experience replay buffer.This approach serves to enhance the efficiency of data utilization, disrupt the inter-sample correlation, and effectively mitigate the occurrence of overfitting in neural networks.
The SAC algorithm employs random sampling during training, which enhances the efficiency of data utilization and mitigates neural network overfitting to a certain extent.However, the assumption of equal importance for all experience samples in random sampling is not aligned with reality.In practice, different experience samples possess varying levels of significance.For instance, experiences with high success rates or frequent failures hold greater value for the algorithm's learning process, as they can expedite convergence.Consequently, this paper integrates the concept of prioritized experience replay with the SAC algorithm.This integration allows the agent to discern the importance of experience samples and prioritize frequent sampling of high-value samples, thus accelerating the convergence speed of the algorithm.
In the context of reinforcement learning, TD-error is commonly utilized to quantify the importance of samples, with a higher TD-error value indicating a greater degree of significance for the respective sample.By prioritizing the learning from samples with larger TD-errors, the algorithm can expedite the rate of learning.Specifically, in the prioritized experience replay DQN algorithm, the TD-error of each experience tuple (s t , a t , r t , s t+1 , d) is defined as the error δ t between the current Q-value and the target Q-value, as illustrated in Equation (20), in which Q target and Q respectively represent the target Q-network and the current Q-network.
Unlike the DQN algorithm, the SAC algorithm incorporates two Q-networks.This paper defines the absolute value |δ t | of the TD-error for an experience tuple as the average of the absolute TD-errors from the two Q-networks, as precisely specified in Equation (21).
As the LSTM network is introduced, the storage format of experiences has transformed into the form depicted in Figure 6.Moreover, in conjunction with the burn-in mechanism, the TD-error of an experience sample {(s t , a t , r t+1 , s t+1 , d), . . . ,(s t+L , a t+L , r t+1+L , s t+1+L , d)} is defined as the absolute TD-error of the subsequent l t -term experience tuples.This specific definition is presented in Equation (22).
The sampling probability [37] for an experience sample is expressed by In Equation (23), the exponent α p serves as the coefficient for regulating prioritization.When α p = 0, the sampling method reverts to uniform sampling.p i > 0 is the priority of transition i based on TD-error, employing a proportional prioritization approach defined in Equation (24).
In Equation (24), ε is typically a small positive value that ensures the inclusion of experience samples with a TD-error of 0. However, prioritizing samples with larger TDerror values can disrupt the probability distribution of training samples.This approach may introduce bias and potentially hinder the convergence of the neural network.Therefore, it is necessary to incorporate importance sampling to adjust the learning rate of the samples.The specific definition is provided in Equation (25).
In Equation (25), N represents the capacity of the experience replay buffer, while β is a hyperparameter for error correction that ranges between 0 and 1.By distinguishing the importance of experiences using the aforementioned approach, it enhances the learning efficiency of the algorithm.

SAC-LSTM Algorithm Workflow
Based on the SAC algorithm, this paper introduces the SAC-LSTM algorithm by incorporating LSTM neural networks, burn-in, and prioritized experience replay.The workflow of the algorithm and its corresponding pseudocode are presented in Figure 7 and Algorithm 1, respectively.The agent interacts with the environment, generating experience tuples (s t , a t , r t , s t+1 , d) that are subsequently stored in the experience replay buffer.By employing prioritized experience replay, the algorithm effectively samples experiences.
The integration of the burn-in mechanism, as indicated by the green circle in Figure 7, enhances the efficiency of the training process.
Algorithm 1: SAC-LSTM algorithm pseudocode 1: randomly initialize the parameter θ of the actor network and the parameters φ 1 , φ 2 of the Critic network, clear the experience replay buffer D 2: initialize the target networks φ targ,1 ← φ 1 , φ targ,2 ← φ 2 , set the length of the burn-in data and the lengths l b , l t of the training data, set the capacity N of the experience replay buffer 3: for episode = 1 to M do 4: initialize the observation s 1 and the hidden state h 1 5: for t = 1 to T do 6: obtain the observation s t and select an action a t using the current policy network 7: perform action a t , obtain next observation s t+1 , receive reward r t determine whether the current state is a terminal state through the signal d 8: store (s t , a t , r t , s t+1 , d) into D 9: end for 10: assign priority P t = max i<t P i to experience [(s t , a t , r t , s t+1 , d), . . . ,(s t+L , a t+L , r t+L , s t+1+L , d)] 11: sample N experiences from D based on their priority P(j) and reset the hidden state to zero 12: Calculate the importance weight w i for each experience sample 13: Scan the previous l b experiences for each sample and obtain the initial hidden state h t 14: Calculate the target Q-function values y i 1 , . . ., y i l t using the last l t experiences: Update the priority p i ← |δ i | based on the TD-error 16: Update Q network using gradient descent method with the following formula: 17: Update policy network using gradient descent method with the following formula: In Equation ( 24),  is typically a small positive value that ensures the inclusion of experience samples with a TD-error of 0. However, prioritizing samples with larger TDerror values can disrupt the probability distribution of training samples.This approach may introduce bias and potentially hinder the convergence of the neural network.Therefore, it is necessary to incorporate importance sampling to adjust the learning rate of the samples.The specific definition is provided in Equation ( 25).
( ) In Equation ( 25), N represents the capacity of the experience replay buffer, while  is a hyperparameter for error correction that ranges between 0 and 1.By distinguishing the importance of experiences using the aforementioned approach, it enhances the learning efficiency of the algorithm.

SAC-LSTM Algorithm Workflow
Based on the SAC algorithm, this paper introduces the SAC-LSTM algorithm by incorporating LSTM neural networks, burn-in, and prioritized experience replay.The workflow of the algorithm and its corresponding pseudocode are presented in Figure 7  that are subsequently stored in the experience replay buffer.By employing prioritized experience replay, the algorithm effectively samples experiences.The integration of the burn-in mechanism, as indicated by the green circle in Figure 7, enhances the efficiency of the training process.

Motion Model of Mobile Robot
Turtlebot3 is a cost-effective and open-source mobile robot that offers a simple yet powerful design.It boasts high expandability and the ability to easily integrate additional sensors as needed.Consequently, this paper selects Turtlebot3 as the platform for algorithm deployment.Operating on a differential drive system, this robot can execute turns, maintain Sensors 2023, 23, 9802 14 of 28 a constant velocity, and rotate in place by controlling the differential motion of its two rear wheels.The model structure is depicted in Figure 8.

Motion Model of Mobile Robot
Turtlebot3 is a cost-effective and open-source mobile robot that offers a simple yet powerful design.It boasts high expandability and the ability to easily integrate additional sensors as needed.Consequently, this paper selects Turtlebot3 as the platform for algorithm deployment.Operating on a differential drive system, this robot can execute turns, maintain a constant velocity, and rotate in place by controlling the differential motion of its two rear wheels.The model structure is depicted in Figure 8.The pose of a mobile robot in the world coordinate system is represented by the coordinates [ , , ] T x y ψ .Here, x and y correspond to the central point of the robot in the Xand Y-axes of the world coordinate system.The parameter ψ denotes the orientation of the robot, indicating the angle between its forward direction (aligned with the linear velocity v ) and the X-axis of the world coordinate system.The distance between the two drive wheels of the robot is denoted as W. Assuming the left and right wheel velocities are L v and R v , respectively, the linear velocity v of the mobile robot can be determined us- ing the following equation: The lateral angular velocity ω of the mobile robot is given by The instantaneous radius R of robot during motion is given by The motion equation in global coordinates is

State Space and Action Space
The state represents the agent's perception of the environment and serves as the foundation for action selection.Designing a well-defined state space is crucial for the agent to The pose of a mobile robot in the world coordinate system is represented by the coordinates [x, y, ψ] T .Here, x and y correspond to the central point of the robot in the Xand Y-axes of the world coordinate system.The parameter ψ denotes the orientation of the robot, indicating the angle between its forward direction (aligned with the linear velocity v) and the X-axis of the world coordinate system.The distance between the two drive wheels of the robot is denoted as W. Assuming the left and right wheel velocities are v L and v R , respectively, the linear velocity v of the mobile robot can be determined using the following equation: The lateral angular velocity ω of the mobile robot is given by The instantaneous radius R of the mobile robot during motion is given by The motion equation in global coordinates is

State Space and Action Space
The state represents the agent's perception of the environment and serves as the foundation for action selection.Designing a well-defined state space is crucial for the agent to learn an optimal strategy.In the context of path planning tasks performed by a mobile robot, the robot relies on sensor input to perceive the surrounding environment.By leveraging this information, the algorithm generates a collision-free path from the initial position to the goal.Consequently, the chosen state space in this paper revolves around two key aspects: sensor perception information and the position of the goal.
For our experimental setup, we employed the Turtlebot3 mobile robot equipped with a laser scanner for environment perception.The laser scanner provides a detection range of [0 • , 360 • ], resulting in a 360-dimensional data representation.To mitigate the issue of high dimensionality without compromising the effectiveness of capturing environmental information, we adjusted the detection range of the laser scanner to [−90 • , 90 • ] and limited the detection distance to the range of [0.1 m, 3.5 m].Specifically, the laser scanner's scan data is represented by 20 dimensions, as illustrated in Figure 9.
learn an optimal strategy.In the context of path planning tasks performed by a mobile robot, the robot relies on sensor input to perceive the surrounding environment.By leveraging this information, the algorithm generates a collision-free path from the initial position to the goal.Consequently, the chosen state space in this paper revolves around two key aspects: sensor perception information and the position of the goal.
For our experimental setup, we employed the Turtlebot3 mobile robot equipped with a laser scanner for environment perception.The laser scanner provides a detection range of [0°, 360°], resulting in a 360-dimensional data representation.To mitigate the issue of high dimensionality without compromising the effectiveness of capturing environmental information, we adjusted the detection range of the laser scanner to [−90°, 90°] and limited the detection distance to the range of [0.1 m, 3.5 m].Specifically, the laser scanner's scan data is represented by 20 dimensions, as illustrated in Figure 9.In order to expedite the attainment of the designated target point, it is imperative to provide guidance to the robot regarding its trajectory.Thus, this paper adopts the utilization of the heading angular deviation In order to expedite the attainment of the designated target point, it is imperative to provide guidance to the robot regarding its trajectory.Thus, this paper adopts the utilization of the heading angular deviation Rad t di f f between the frontal orientation of the mobile robot and the target point and the distance between the robot and the target point as the fundamental states for this purpose.By employing the mobile robot's odometry, the coordinates (x t robot , y t robot ) of the robot can be determined, along with the coordinates (x goal , y goal ) of the target point as shown in Figure 10.Consequently, the distance dis t between the robot and the target point can be computed.The heading angle yaw t can be obtained from the odometer, but the quadratic data obtained cannot be used directly, and have to be converted into a Eulerian angle first.
The angle Rad t goal between the mobile robot and the target point can be obtained by the inverse tangent function.
The heading angular deviation is the difference between the angle Rad t goal and the orientation angle ϕ.
In this paper, we enhance the state representation of a mobile robot by incorporating the linear velocity and angular velocity from the previous time step.This addition allows for a correspondence between the rewards provided by the environment and the actions performed by the mobile robot.Ultimately, the state space of the mobile robot is defined as a vector comprising radar detection data, heading angular deviation, linear velocity, and angular velocity from the previous time step, as well as the distance between the target point and the mobile robot.
Many previous deep reinforcement learning algorithms have used a discretized action space in which the linear and angular velocities of mobile robots were divided into several orders of magnitude.Although this approach is relatively simple, it ignores the fact that mobile robots output continuous actions.To make the simulation more realistic, this paper presents a network model that produces continuous angular and linear velocities.The linear velocity has a range of 0 to 0.22 m/s, while the angular velocity ranges from −2 to 2 rad/s, where positive angular velocity indicates clockwise rotation and negative angular velocity indicates counterclockwise rotation.The proposed approach is intended to bring the simulation closer to real-world scenarios.

Reword Function
The reward function plays a critical role in the success of deep reinforcement learning algorithms as it serves as the benchmark for evaluating agent performance.Similar to constraints used in traditional path planning tasks, the reward function guides agents by indicating which actions to avoid and which ones to pursue given the current state.In this paper, the proposed reward function is composed of two distinct components whose sum constitutes the overall reward: The specific expressions of R a and R b are shown in Equations ( 36) and (37).d o represents the distance threshold for reaching the target point, while d min represents the threshold for avoiding collisions with obstacle.d max is the safe distance threshold from the obstacle.Furthermore, d t denotes the current distance between the moving robot and the target point, and d t−1 represents the previous time's distance.Additionally, d t smin is the minimum distance recorded from the radar scan at the current time.The reward coefficients are indicated as η 1 , η 2 , and η 3 .
This paper proposes a design approach where mobile robots reaching a designated target point or encountering obstacles are associated with sparse rewards, while dense rewards are assigned for other scenarios.This particular design methodology ensures algorithmic stability throughout the training process.
The sparse reward setting is relatively straightforward, involving the assignment of an immediate reward to the mobile robot upon reaching the target point or colliding with an obstacle.When the distance between the mobile robot and the target point is below a threshold value d o , it is considered to have successfully reached the target point, resulting in a positive reward r a .Conversely, if the distance between the mobile robot and the obstacle is below a threshold value d min , a collision is assumed, leading to a negative reward r c .
The dense rewards comprise the distance reward, orientation angle reward, and safety reward.The distance reward, denoted as η 1 (d t−1 − d t ), is contingent upon the mobile robot's velocity while moving towards the target point, with higher rewards for faster velocities and lower rewards for slower velocities.The orientation angle reward, denoted as η 2 (π − |ϕ|), is determined by the orientation angle ϕ, where the reward is maximized at 0 when the mobile robot is directly facing the target point, and minimized at π when the mobile robot is facing away from the target point.The sum η 1 (d t−1 − d t ) + η 2 (π − |ϕ|) of the distance reward and orientation angle reward aims to expedite the mobile robot's arrival at the target point.Additionally, the safety reward, represented by η 3 d t smin − d max , imposes a negative reward when the distance between the mobile robot and the obstacle is below the designated threshold d max .Moreover, the negative reward increases as the distance decreases.The purpose of the safety reward is to maintain a safe distance from obstacles during the path planning process.
The pseudocode for the reward function is presented in Algorithm 2.

Path Planning Algorithm Workflow
The path planning algorithm, based on deep reinforcement learning, essentially involves learning the optimal strategy for the mobile robot to reach the target point quickly through interactive exploration with the environment.The algorithm pseudocode for path planning, utilizing SAC-LSTM, is depicted in Algorithm 3. initialize system parameters 2: introduce LSTM network to optimize the network architecture 3: for N t < N tmax do 4: initialize the robot's position 5: get environmental information 6: perform actions through the policy network 7: obtain new state and reward 8: store experiences in the experience buffer 9: sample experiences based on their priority 10: train the network using the burn-in mechanism At the beginning of each episode, the mobile robot is positioned at a predefined starting point and subsequently navigates towards a randomly generated target destination.Throughout this process, the mobile robot relies on a laser radar to perceive its surrounding environment and utilizes the state information as input to the policy network of the SAC-LSTM algorithm, producing a corresponding robot action.Following interaction with the environment, the mobile robot transitions to the next state and receives a reward based on a predefined reward function.The acquired experiences are stored in the experience replay buffer, and significant experiences are extracted using a prioritized experience replay mechanism.The SAC-LSTM algorithm is trained using the burn-in strategy.Upon reaching the target point, a new target point is randomly generated in the environment, and the robot moves from its current location towards the newly selected target.The training episode terminates only when the robot collides with an obstacle or when the current step count N s reaches the designated maximum value N smax .The training process concludes when the number of training episodes N t surpasses the predetermined maximum value N tmax .

Simulation Platform
The experimental software environment for this paper included Ubuntu 18.04, CUDA 10.1, Pytorch 3.7, and ROS Kinetic.The hardware utilization comprised an AMD R7-5800H CPU and a GeForce GTX 3060 GPU with 6G of memory.The experiment employed the Turtlebot3 robot based on ROS, which obtains the environmental information around it with the help of laser radar.The 3D model of the mobile robot was loaded into the 'empty.world' in ROS, as depicted in Figure 11.
This paper utilizes Gazebo9 software to set up experimental environments.Three experimental environments, depicted in Figure 12, were built for testing the algorithm's effectiveness in mobile robot path planning in indoor environments.These environments are an obstacle-free environment, a static obstacle environment, and a dynamic obstacle environment; all three are square areas measuring 8 m in length.The obstacle-free environment has only four surrounding walls.The static obstacle environment features twelve stationary obstacles, consisting of four cylinders with a diameter of 0.3 m and a height of 0.5 m, four small cubes measuring 0.5 m in edge-length, and four large cubes measuring 0.8 m in edge-length, which are added to the obstacle-free environment.In contrast, the dynamic obstacle environment features moving obstacles.The four cylinders rotate counterclockwise at a speed of 0.5 rad/s around the origin of the coordinate system (the center of the environment), as illustrated by the smaller white dashed circles in Figure 12.Concurrently, four large cubes move clockwise at the same speed, following the trajectory shown by the larger white dashed circles with arrows in Figure 12.Meanwhile, the four small cubes within this environment remain stationary.
the current step count Ns reaches the designated maximum value Nsmax.The training process concludes when the number of training episodes Nt surpasses the predetermined maximum value Ntmax.

Simulation Platform
The experimental software environment for this paper included Ubuntu 18.04, CUDA 10.1, Pytorch 3.7, and ROS Kinetic.The hardware utilization comprised an AMD R7-5800H CPU and a GeForce GTX 3060 GPU with 6G of memory.The experiment employed the Turtlebot3 robot based on ROS, which obtains the environmental information around it with the help of laser radar.The 3D model of the mobile robot was loaded into the 'empty.world' in ROS, as depicted in Figure 11.This paper utilizes Gazebo9 software to set up experimental environments.Three experimental environments, depicted in Figure 12, were built for testing the algorithm's effectiveness in mobile robot path planning in indoor environments.These environments are an obstacle-free environment, a static obstacle environment, and a dynamic obstacle environment; all three are square areas measuring 8 m in length.The obstacle-free environment has only four surrounding walls.The static obstacle environment features twelve stationary obstacles, consisting of four cylinders with a diameter of 0.3 m and a height of 0.5 m, four small cubes measuring 0.5 m in edge-length, and four large cubes measuring 0.8 m in edge-length, which are added to the obstacle-free environment.In contrast, the dynamic obstacle environment features moving obstacles.The four cylinders rotate counterclockwise at a speed of 0.5 rad/s around the origin of the coordinate system (the center of the environment), as illustrated by the smaller white dashed circles in Figure 12.Concurrently, four large cubes move clockwise at the same speed, following the trajectory shown by the larger white dashed circles with arrows in Figure 12.Meanwhile, the four small cubes within this environment remain stationary.Deep reinforcement learning involves agents gathering training data by interacting with their real-world environments.In a 3D experimental environment created by Gazebo, data acquisition can become a time-consuming and computationally expensive process.Consequently, this paper employs a time acceleration technique to reduce the duration of training.By default, Gazebo's real-time update rate is 1000, and the max step size value is 0.001.When these settings are multiplied, the resulting ratio between the simulation time and the actual time is 1.However, in order to expedite simulations, the value for max step size is adjusted to 0.005, leading to a fivefold increase in simulation speed.
The specific experimental parameters are shown in Table 1.Deep reinforcement learning involves agents gathering training data by interacting with their real-world environments.In a 3D experimental environment created by Gazebo, data acquisition can become a time-consuming and computationally expensive process.Consequently, this paper employs a time acceleration technique to reduce the duration of training.By default, Gazebo's real-time update rate is 1000, and the max step size value is 0.001.When these settings are multiplied, the resulting ratio between the simulation time and the actual time is 1.However, in order to expedite simulations, the value for max step size is adjusted to 0.005, leading to a fivefold increase in simulation speed.
The specific experimental parameters are shown in Table 1.

Computational Complexity
To quantify the computational complexity of SAC and SAC-LSTM algorithms, the contribution of each component to the overall computational load can be analyzed based on five key factors, as follows: 1.

Computational Load of Neural Networks
Due to the larger batch size (512) and experience buffer capacity (20,000), the SAC algorithm requires processing more data in the network training steps, leading to an increased computational load.When it comes to the SAC-LSTM algorithm, the inclusion of LSTM layers introduces additional time-dependency computations.A smaller batch size (32) might reduce the computational load per training iteration, but the burn-in process and LSTM's time dependencies increase the computational load per batch.

Prioritized Experience Replay
Prioritized experience replay requires additional computations to maintain a priority queue and update it after each learning step, increasing the computational load, especially with a larger experience buffer.

Learning Parameters
The parameters, a learning rate of 0.001 and a discount factor of 0.99, mainly affect the convergence speed and stability of the algorithm, but have a relatively minor direct impact on computational load.

Reward Mechanism
The reward coefficients and thresholds influence the efficiency of learning, but have a limited direct impact on computational load.

Optimizer
The Adam optimizer is a computationally efficient optimizer but has a higher computational complexity compared to simpler ones such as SGD (Stochastic Gradient Descent).
Overall, the SAC-LSTM algorithm involves higher per-batch computational loads due to time-dependency processing with LSTM, smaller batch sizes, and burn-in processing.The SAC algorithm, although having larger data volumes per batch, might have a slightly lower per-batch computational load than SAC-LSTM, owing to the absence of complex time-series processing.When running both algorithms on the same computer platform, SAC-LSTM consumes approximately 10% more computation time per batch compared to SAC.This finding aligns closely with the qualitative analysis results mentioned above.

Obstacle-Free Experiment
In this paper, we chose the average reward as the evaluation metric for our algorithm.A higher value confirms better performance of the algorithm.The average reward is calculated from Equation (38).Within each episode, T = min(N smax , T) represents the number of steps taken by the agent, and r is the reward received for each step.Two reinforcement learning algorithms, SAC and SAC-LSTM, were trained for 1000 episodes in the obstacle-free environment.Their average reward curves are illustrated in Figure 13.During the initial training phase, both algorithms demonstrated a decreasing trend in their average reward.This was attributed to the robot frequently circling in place or moving away from the target point, which was observed within the first 10 episodes.However, after that point, both algorithms exhibited a remarkable increase in their average reward.The SAC-LSTM algorithm reached its reward peak and convergence after 130 episodes, whereas the SAC algorithm achieved the same after 180 episodes.Notably, the average reward of the SAC-LSTM algorithm surpassed that of the SAC algorithm after convergence.This observation suggests that the SAC-LSTM algorithm enabled the robot to reach the target point more frequently during each episode's path planning process, resulting in better path planning performance.Although both algorithms accomplished the task in the obstacle-free environment, the SAC-LSTM algorithm converged faster and yielded a higher average reward after convergence.Illustrated in Figure 14 is the motion process of the mobile robot towards the target point in the obstacle-free environment based on the SAC-LSTM algorithm.In the figure, the black dot represents the mobile robot, while the blue sector illustrates the range of the radar scan.The red square indicates the location of the target point.Notably, the mobile robot achieves optimal pathway and accurately reaches the target point from the starting point.

Static Obstacle Experiment
The two algorithms were trained for 1000 episodes in the obstacle environment illustrated in Figure 12b.The plotted average reward curves of the two algorithms are depicted in Figure 15.Notably, the static obstacle environment is significantly more complex than the obstacle-free environment since collision with the surrounding obstacles is more frequent during robot exploration.As expected, greater exploration time was required.The SAC-LSTM algorithm began to converge from the 20th episode.Interestingly, its converging process had a higher average reward than the SAC algorithm, finishing the process at the 380th episode.On the other hand, the SAC algorithm had reduced fluctuations and began to converge from the 30th episode, while the converging process completed at the 500th episode.Impressively, the SAC-LSTM algorithm yielded a faster convergence rate and a higher average reward after convergence than the SAC algorithm.Two algorithms were tested 200 times in the static obstacle environment, where the target points were randomly generated.The test results are presented in Table 3, indicating that the SAC algorithm's success rate was only 88.5%, while the proposed SAC-LSTM algorithm achieved a success rate of 95.5%.

Static Obstacle Experiment
The two algorithms were trained for 1000 episodes in the obstacle environment illustrated in Figure 12b.The plotted average reward curves of the two algorithms are depicted in Figure 15.Notably, the static obstacle environment is significantly more complex than the obstacle-free environment since collision with the surrounding obstacles is more frequent during robot exploration.As expected, greater exploration time was required.The SAC-LSTM algorithm began to converge from the 20th episode.Interestingly, its converging process had a higher average reward than the SAC algorithm, finishing the process at the 380th episode.On the other hand, the SAC algorithm had reduced fluctuations and began to converge from the 30th episode, while the converging process completed at the 500th episode.Impressively, the SAC-LSTM algorithm yielded a faster convergence rate and a higher average reward after convergence than the SAC algorithm.

Static Obstacle Experiment
The two algorithms were trained for 1000 episodes in the obstacle environment illustrated in Figure 12b.The plotted average reward curves of the two algorithms are depicted in Figure 15.Notably, the static obstacle environment is significantly more complex than the obstacle-free environment since collision with the surrounding obstacles is more frequent during robot exploration.As expected, greater exploration time was required.The SAC-LSTM algorithm began to converge from the 20th episode.Interestingly, its converging process had a higher average reward than the SAC algorithm, finishing the process at the 380th episode.On the other hand, the SAC algorithm had reduced fluctuations and began to converge from the 30th episode, while the converging process completed at the 500th episode.Impressively, the SAC-LSTM algorithm yielded a faster convergence rate and a higher average reward after convergence than the SAC algorithm.Two algorithms were tested 200 times in the static obstacle environment, where the target points were randomly generated.The test results are presented in Table 3, indicating that the SAC algorithm's success rate was only 88.5%, while the proposed SAC-LSTM algorithm achieved a success rate of 95.5%.Two algorithms were tested 200 times in the static obstacle environment, where the target points were randomly generated.The test results are presented in Table 3, indicating that the SAC algorithm's success rate was only 88.5%, while the proposed SAC-LSTM algorithm achieved a success rate of 95.5%.

Dynamic Obstacle Experiment
The dynamic obstacle environment was constructed as shown in Figure 12c to simulate a realistic path planning scenario.The SAC and SAC-LSTM algorithms underwent 1000 episodes of training, with the training results presented in Figure 17.As shown in the figure, the difficulty of path planning increased due to the moving obstacles, resulting in significant fluctuations in the learning curves of both algorithms.The SAC-LSTM algorithm began to converge after the 75th episode when the fluctuations decreased, and it achieved convergence by the 510th episode.The SAC algorithm began to converge after the 95th episode and achieved convergence by the 710th episode.The proposed SAC-LSTM algorithm demonstrated a faster convergence rate than the SAC algorithm, with a slightly higher average reward after convergence.

Dynamic Obstacle Experiment
The dynamic obstacle environment was constructed as shown in Figure 12c to simulate a realistic path planning scenario.The SAC and SAC-LSTM algorithms underwent 1000 episodes of training, with the training results presented in Figure 17.As shown in the figure, the difficulty of path planning increased due to the moving obstacles, resulting in significant fluctuations in the learning curves of both algorithms.The SAC-LSTM algorithm began to converge after the 75th episode when the fluctuations decreased, and it achieved convergence by the 510th episode.The SAC algorithm began to converge after the 95th episode and achieved convergence by the 710th episode.The proposed SAC-LSTM algorithm demonstrated a faster convergence rate than the SAC algorithm, with a slightly higher average reward after convergence.
A total of 200 tests were also conducted using both algorithms in the dynamic obstacle environment, with the target points generated randomly.Table 4 shows the success rates of both algorithms, with the SAC algorithm achieving a success rate of only 78.5%, while the SAC-LSTM algorithm achieved a success rate of 89%. the figure, the difficulty of path planning increased due to the moving obstacles, resulting in significant fluctuations in the learning curves of both algorithms.The SAC-LSTM algorithm began to converge after the 75th episode when the fluctuations decreased, and it achieved convergence by the 510th episode.The SAC algorithm began to converge after the 95th episode and achieved convergence by the 710th episode.The proposed SAC-LSTM algorithm demonstrated a faster convergence rate than the SAC algorithm, with a slightly higher average reward after convergence.Figure 18 illustrates the movement process of a mobile robot driven by the SAC-LSTM algorithm, from its starting point to the target point in a dynamic obstacle environment.As shown, the mobile robot was able to avoid the moving obstacles and reach the target point via the optimal path.
As completing path planning tasks in dynamic environments is more challenging and requires higher performance from path planning algorithms, and as dynamic environments better reflect the actual work environment of mobile robots, in order to better test the performance of the two algorithms, an additional set of experiments was conducted on top of the initial experiment measuring success rate, which tested the path length, planning time, and number of times the robot reached the target point.Ten target points were randomly generated in the dynamic obstacle environment, all located near the moving obstacles at the periphery, so that each time the robot moved to a target point, interference from the moving obstacles was encountered, increasing the reliability of the experiment.The specific locations of the target points are illustrated in Figure 19.A total of 200 tests were also conducted using both algorithms in the dynamic obstacle environment, with the target points generated randomly.Table 4 shows the success rates of both algorithms, with the SAC algorithm achieving a success rate of only 78.5%, while the SAC-LSTM algorithm achieved a success rate of 89%. Figure 18 illustrates the movement process of a mobile robot driven by the SAC-LSTM algorithm, from its starting point to the target point in a dynamic obstacle environment.As shown, the mobile robot was able to avoid the moving obstacles and reach the target point via the optimal path.As completing path planning tasks in dynamic environments is more challenging and requires higher performance from path planning algorithms, and as dynamic environments better reflect the actual work environment of mobile robots, in order to better test the performance of the two algorithms, an additional set of experiments was conducted on top of the initial experiment measuring success rate, which tested the path length, planning time, and number of times the robot reached the target point.Ten target  The target point numbers in Figure 19 were sorted according to the Euclidean distance between each target point and the starting point of the mobile robot, which is set as the origin.Thirty experimental trials were conducted for each of the target points, with the path length, planning time, and success rates recorded.The path length and planning time were averaged from the successfully completed path planning tasks, and specific test results can be found in Tables 5 and 6.Among the 10 target points tested, the SAC-LSTM algorithm achieved both shorter average path lengths and planning times, as well as higher success rates, compared to the SAC algorithm.This indicates that the path planning performance of the SAC-LSTM algorithm is superior to that of the SAC algorithm, enabling the mobile robot to reach its designated target point in less time and with a shorter route.Notably, for the 10th target point, the SAC algorithm was unable to guide the mobile robot to its target point based solely on the current state information, whereas the SAC-LSTM algorithm, which incorporates the LSTM network, has memory capability to consider both historical and current states to make better decisions, and thus guided the robot to complete its path planning task.The target point numbers in Figure 19 were sorted according to the Euclidean distance between each target point and the starting point of the mobile robot, which is set as the origin.Thirty experimental trials were conducted for each of the target points, with the path length, planning time, and success rates recorded.The path length and planning time were averaged from the successfully completed path planning tasks, and specific test results can be found in Tables 5 and 6.Among the 10 target points tested, the SAC-LSTM algorithm achieved both shorter average path lengths and planning times, as well as higher success rates, compared to the SAC algorithm.This indicates that the path planning performance of the SAC-LSTM algorithm is superior to that of the SAC algorithm, enabling the mobile robot to reach its designated target point in less time and with a shorter route.Notably, for the 10th target point, the SAC algorithm was unable to guide the mobile robot to its target point based solely on the current state information, whereas the SAC-LSTM algorithm, which incorporates the LSTM network, has memory capability to consider both historical and current states to make better decisions, and thus guided the robot to complete its path planning task.
Based on the results of the three simulation experiments, the trained mobile robot was able to successfully complete the path planning task in all three environments, and the improved SAC-LSTM algorithm demonstrated significant enhancements in both path planning success rate and convergence speed.In particular, in dynamic and complex scenarios, the SAC-LSTM algorithm exhibited shorter planning time, shorter planning paths, and a higher number of instances where the target point was reached.

Conclusions
This paper presents the SAC-LSTM algorithm and develops a path planning algorithm framework for mobile robots, addressing the limitations of the SAC algorithm in path planning tasks.The proposed algorithm incorporates an LSTM network with memory capability, a burn-in mechanism, and a prioritized experience replay mechanism.By integrating historical and current states, the LSTM network enables more effective path planning decisions.The burn-in mechanism preheats the LSTM network's hidden state before training, addressing memory depreciation and enhancing the algorithm's performance.The prioritized experience replay mechanism accelerates algorithm convergence by emphasizing crucial experiences.A motion model for the Turtlebot3 mobile robot was established, and the state space, action space, reward function, and overall process of the SAC-LSTM algorithm were designed.enhance the realism of the experimental scenarios, three environments were created, including obstacle-free, static obstacle, and dynamic obstacle scenarios.There are multiple stationary and moving obstacles of various sizes and shapes in the dynamic scenarios.The algorithm was subsequently trained and tested in these settings.The experimental results demonstrated that the SAC-LSTM algorithm outperformed the SAC algorithm in convergence speed and path planning success rate across all three scenarios with roughly the same computational cost.Furthermore, in an additional dynamic obstacle experiment, the SAC-LSTM algorithm exhibited shorter planning times, more efficient paths, and an increased number of instances where the target point was reached, indicating superior path planning performance.
Despite these advancements, certain limitations persist within this paper.The experiments relied solely on 2D lidar for environmental data, which may lead to inaccuracies when dealing with irregularly shaped obstacles.Future research could employ multi-sensor fusion to obtain more comprehensive environmental information.Additionally, the paper does not address the several sources of noise and uncertainty present in real-world environments, including sensor noise, environmental fluctuations, and uncertainty in obstacle movement.The algorithm needs sufficient robustness to handle these factors to perform well in practical settings.Moreover, the experiments were conducted exclusively in a simulated environment.To fully assess the effectiveness of the mobile robot in completing path planning tasks, it is essential to transfer the trained model to a real-world scenario.
Fig- ure 5 depicts the conceptual diagram illustrating the application of the burn-in period in deep reinforcement learning algorithms.The black box represents a sequence of data,
set the length of the burn-in data and the lengths , bt llof the training data, set the capacity N of the experience replay buffer 3 for episode = 1 to M do

Figure 8 .
Figure 8. Motion model of the differential drive robot.

Figure 8 .
Figure 8. Motion model of the differential drive robot.

Figure 9 .
Figure 9. Detection range of the laser sensor on the mobile robot.

Figure 9 .
Figure 9. Detection range of the laser sensor on the mobile robot.

dis t = x goal − x t robot 2 + y goal − y t robot 2 ( 30 ) 30 Figure 10 .Figure 10 .
Figure 10.State of the mobile robot.The heading angle
robot's starting point is the origin, and the target point is generated at random.During a single episode, once the robot reaches the target point, the environment generates a new target point randomly.As a result, the robot may reach multiple target points within a single episode.

Figure 14 .
Figure 14.Movement process of the mobile robot in the obstacle-free environment.

Figure 15 .
Figure 15.Average reward in static obstacle environment.

Figure 14 .
Figure 14.Movement process of the mobile robot in the obstacle-free environment.

Figure 14 .
Figure 14.Movement process of the mobile robot in the obstacle-free environment.

Figure 15 .
Figure 15.Average reward in static obstacle environment.

Figure 15 .
Figure 15.Average reward in static obstacle environment.

Figure 16
Figure 16 illustrates the movement process of a mobile robot based on the SAC-LSTM algorithm, from its starting point to the target point in the static obstacle environment.As shown, the mobile robot avoided the obstacles in the environment and reached the target point via the optimal path.

Figure 16
Figure 16 illustrates the movement process of a mobile robot based on the SAC-LSTM algorithm, from its starting point to the target point in the static obstacle environment.As shown, the mobile robot avoided the obstacles in the environment and reached the target point via the optimal path.

Figure 16 .
Figure 16.Movement process of the mobile robot in the static obstacle environment.

Figure 16 .
Figure 16.Movement process of the mobile robot in the static obstacle environment.

Figure 17 .
Figure 17.Average reward in the dynamic obstacle environment.

30 Figure 17 .
Figure 17.Average reward in the dynamic obstacle environment.

Figure 18 .
Figure 18.Movement process of the mobile robot in the dynamic obstacle environment.

Figure 18 .
Figure 18.Movement process of the mobile robot in the dynamic obstacle environment.

C
− should be retained for the current time step's memory cell t C .The input gate i regulates the amount of current time step's information to be stored in the candidate state t C  .The output gate o controls the extent to which information from the current time step's memory cell t C should be conveyed to the current hidden state t Y .The diagram depicting the structure of the LSTM neural net- works is illustrated in Figure provides a visual representation of the specific storage format, with each circle representing an individual experience tuple (s t , a t , r t+1 , s t+1 , d).by the DRL algorithm, the first b l items are used to update the LSTM's hidden state t h within the DRL framework.Subsequently, the remaining t l items are utilized to train the DRL algorithm.By incorporating the burn-in mechanism, the hidden state of the LSTM network is updated before training, thereby circumventing the issue of impaired memory capacity caused by zeroing the LSTM network's hidden state prior to training.Consequently, this integration enhances the performance of the algorithm.

Table 1 ,
and respectively.The agent interacts with the environment, generating experience tuples
SAC-LSTMAlgorithm: 1 randomly initialize the parameter  of the actor network and the parameters 1  , 2  of the Critic network, clear the experience replay buffer 2 initialize the target networks

Table 1 .
Experimental parameters of mobile robot.

Table 1 .
Experimental parameters of mobile robot.

Table 5 .
Test results of the SAC-LSTM algorithm.

Table 6 .
Test results of the SAC algorithm.

Table 5 .
Test results of the SAC-LSTM algorithm.

Table 6 .
Test results of the SAC algorithm.