Mobile Robot Navigation Based on Noisy N-Step Dueling Double Deep Q -Network and Prioritized Experience Replay

: Effective real-time autonomous navigation for mobile robots in static and dynamic environments has become a challenging and active research topic. Although the simultaneous localization and mapping (SLAM) algorithm offers a solution, it often heavily relies on complex global and local maps, resulting in significant computational demands, slower convergence rates, and prolonged training times. In response to these challenges, this paper presents a novel algorithm called PER-n 2 D 3 QN, which integrates prioritized experience replay, a noisy network with factorized Gaussian noise, n-step learning, and a dueling structure into a double deep Q -network. This combination enhances the efficiency of experience replay, facilitates exploration, and provides more accurate Q -value estimates, thereby significantly improving the performance of autonomous navigation for mobile robots. To further bolster the stability and robustness, meaningful improvements, such as target “soft” updates and the gradient clipping mechanism, are employed. Additionally, a novel and powerful target-oriented reshaping reward function is designed to expedite learning. The proposed model is validated through extensive experiments using the robot operating system (ROS) and Gazebo simulation environment. Furthermore, to more specifically reflect the complexity of the simulation environment, this paper presents a quantitative analysis of the simulation environment. The experimental results demonstrate that PER-n 2 D 3 QN exhibits heightened accuracy, accelerated convergence rates, and enhanced robustness in both static and dynamic scenarios.


Introduction
Mobile robots play an increasingly crucial role in various aspects of our daily lives and work, including agricultural operations, catering services, medical assistance, military actions, autonomous driving, etc.The successful execution of tasks in these applications hinges on the mobile robots' adept navigation through static or dynamic environments, coupled with collision avoidance [1,2].Effective navigation requires mobile robots to accurately sense and respond to their surroundings, especially in dynamic settings, where robots confront evolving terrains, dynamic obstacles, and unforeseen scenarios, necessitating the capability to explore and learn to proficiently adapt to changing environments.
Simultaneous localization and mapping (SLAM) has emerged as a solution, integrating the mobile robot's position and environmental state data from sensors to construct precise local and global maps for autonomous navigation [3].However, this process relies on costly sensors and manual examinations to guarantee that the produced maps satisfy the required standards for path planning and localization, which may hamper its efficiency by consuming time, resources, and computational power [4,5].Recently, visual SLAM has been used for robot navigation, relying on a visual camera to capture environmental images and calculate the robot's position and direction.Nevertheless, its susceptibility to external factors, such as light, humidity, temperature variations, and dust, coupled with computational demands in complex environments, raises concerns about accuracy and performance [6].
In response, deep reinforcement learning (DRL) was proposed and has gained prominence in autonomous navigation [7].This paradigm introduces an innovative approach to decision-making from raw sensor inputs in diverse environments through the fusion of deep learning (DL) and reinforcement learning (RL).On the one hand, DRL leverages DL's success in high-dimensional information analysis, especially in image analysis, speech recognition, and natural language processing, which has yielded a spectrum of supervised and unsupervised DL algorithms tailored for the movement of mobile robots [8].Such algorithms have proven effective in tasks like path planning and navigation [9].On the other hand, DRL harnesses the efficacy of RL in addressing challenges in sequential decisionmaking [10].This empowers the agent to gradually learn optimal decision strategies to maximize cumulative rewards or achieve specific goals through interactions with its environment [11,12].
As an end-to-end DRL algorithm, the deep Q-Network (DQN) was introduced [13], and its success in learning policies from high-dimensional inputs was demonstrated.To enhance learning efficiency and stability, the algorithm utilizes an experience replay buffer to store and reuse previous experiences.Additionally, a secondary target network is introduced to address issues of oscillation and divergence during the learning process.Despite the DQN's success in diverse applications and its widespread use in autonomous navigation [14,15], it has some drawbacks, notably action value overestimation derived from Q-learning updates.This overestimation arises because the action with the highest value in the Q-network is usually selected in the next state, and the same Q-network is used to select actions and calculate action values.This approach can lead to overestimation issues, as it does not consider the possibility of estimation errors.To address this issue, the double deep Q-network (DDQN) was proposed [16], utilizing two distinct Q-networks for action selection and value estimation to prevent overestimation.Meanwhile, DDQN finds applications in training mobile robots for tasks like optimal navigation and obstacle avoidance [17,18].This research initially used the DQN algorithm to solve autonomous navigation for mobile robots.Although the accuracy can reach over 95% in open and enclosed simulation environments, with the increase in obstacles, the accuracy with the original DQN algorithm is very low.Then, the DDQN algorithm was utilized to enhance the accuracy, but the performance improvement was still minor.On the other hand, the rainbow DDQN with multiple improvement techniques has greatly improved the performance of the DDQN in Atari games [19] with high-dimensional, complex state and action spaces.However, autonomous navigation for mobile robots may have relatively low-dimensional state and action spaces, but it requires a fast response with a low computational load.The rainbow DDQN may be overly complex and inefficient for our purpose and may also increase the difficulty of implementation and parameter adjustment.More importantly, the rainbow DDQN may require more computational resources and training time, which might be challenging in resource-constrained environments or scenarios requiring rapid iteration.
In autonomous navigation problems for mobile robots, there is usually a high correlation between states.However, the traditional experience replay method in the DDQN randomly samples with equal probability, ignoring the importance of different samples, thus leading to a long training time and making it difficult to converge.Additionally, in the DDQN algorithm, the agent adopts an ϵ-greedy algorithm for exploration, which conducts random exploration with a probability of ϵ and selects the current optimal action with a probability of 1 −ϵ, leading the performance to heavily rely on the size of ϵ.Furthermore, the ϵ-greedy algorithm selects the optimal action based on the current maximum Q-value at each time step, ignoring opportunities to explore other unknown actions, meaning that it may fall into a local optimal solution and fail to discover the global optimal solution.Although the DDQN can alleviate the overestimation problem to a certain extent, in mobile robot autonomous navigation problems, states typically exhibit a strong correlation.
There still exists the potential for overestimation in certain actions because the Q-values of different actions may be highly similar, making it difficult to accurately estimate their differences.In this case, the network may overestimate the values of certain actions, leading to unstable learning.
To solve the aforementioned problems and propose an efficient mapless autonomous navigation algorithm for mobile robots with high accuracy in both static and dynamic environments, this paper introduces a novel variant of the DDQN named prioritized experience replay noisy n-step dueling DDQN (PER-n 2 D 3 QN).As implied by its name, this proposed method incorporates a key enhancement by replacing traditional experience replay with prioritized experience replay (PER), where each experience is assigned a priority based on its significance or informativeness.This prioritization mechanism allows the agent to focus on the experiences that most significantly contribute to its learning objectives, enhancing the sample efficiency [20,21].Furthermore, a noisy network [22] with factorized Gaussian noise injected into the parameters of the online network is introduced to solve the exploration limitation in the DDQN, which can automatically adjust the randomness of action selection to find a better balance between exploration and exploitation.Additionally, an n-step temporal difference learning mechanism is adopted to alleviate the estimation error of the target Q-value, moving much closer to the real target Q-value [23], coupled with a dueling network structure to further alleviate overestimation issues and improve the stability and accuracy of learning by decomposing the Q-value into a state-value function and an advantage function, allowing the network to better learn the relationship between the state value and action advantage [24,25].
The main contributions of this paper are outlined as follows: (1) The PER-n 2 D 3 QN algorithm, integrating PER, a noisy network, n-step learning, and a dueling architecture, is employed in autonomous navigation for mobile robots, significantly boosting navigation performance and accuracy.
(2) To enhance the stability and robustness of the algorithm, some enhancements are also applied to the proposed algorithm, such as a target "soft" update and a gradient clipping mechanism.
(3) A novel, effective target-oriented reshaping reward function is employed to guide the mobile robot toward the designated point while avoiding obstacles, thereby expediting convergence.
(4) In order to more specifically reflect the complexity of the simulation environment, this paper presents a detailed quantitative analysis of the simulation environment for the first time.
(5) The performance of PER-n 2 D 3 QN is validated through numerical simulation experiments with the ROS and Gazebo, and the results demonstrate that PER-n 2 D 3 QN exhibits superior performance in autonomous navigation for mobile robots compared to the DQN and DDQN, coupled with showing the whole path processed by the PER-n 2 D 3 QN algorithm within an episode.
The rest of this paper is structured as follows: Section 2 presents the problem statement and related research.Section 3 details the methodology of the proposed PER-n 2 D 3 QN.Section 4 provides implementation details for PER-n 2 D 3 QN.Section 5 presents the experimental setting and result analysis of the DQN, DDQN, and PER-n 2 D 3 QN algorithms.Finally, Section 6 concludes the paper with insights and future research directions.

Problem Statement
The increasing deployment of mobile robots across various sectors highlights the critical need to enable them to autonomously navigate while avoiding collisions.Traditional mobile robot navigation faces challenges, including difficulties in environment perception, path planning, and real-time decision-making.To solve these problems, DRL algorithms, particularly those focused on goal-searching mapless navigation, have gained widespread use in autonomous robot navigation [26][27][28].
Among them, DDQN technology has demonstrated significant potential.However, a notable challenge remains in achieving real-time and high-accuracy mobile robot navigation in both static and dynamic environments using the original DDQN algorithms.Hence, this paper aims to address this challenge by advancing research in the field of mobile robot autonomous navigation, striving to offer more efficient, accurate, and safe solutions for mobile robot navigation tasks.

Reinforcement Learning Basics
In reinforcement learning (RL), the Markov Decision Process (MDP) is often employed to model stochastic state transitions of the environment and subsequent decisionmaking [29].An MDP comprises four essential elements: S, A, R, and P. At each time step t, the agent observes the environment's state s t ∈ S, where S is the state space.Based on this observation, the agent selects an action a t from the action space A. After executing an action a t , the agent receives a reward, r t ∈ R, where R represents the rewards obtained by the agent from interacting with the environment.Subsequently, the state transitions to s t+1 from s t with a certain probability p t ∈ P. In RL, the tuple (s t , a t , r t , s t+1 ) is referred to as an experience sequence.The relationship between S, A, R, and P is depicted in Figure 1.The structure of an MDP, where s t is the environment state at time step t, a t denotes the action taken by the agent in state s t , and r t is the reward received from the environment at time step t.
Although the agent receives a reward for each step in the process of interacting with the environment, the ultimate goal is to maximize the sum of rewards, defined as the expected return in RL.By introducing the discount factor γ ∈ [0, 1], the expected return is defined as the discounted accumulated rewards obtained since the current time step t as follows: where the value of the discount factor γ influences how much importance the agent assigns to immediate and future rewards.A higher γ implies a more forward-looking agent, while a lower γ indicates a more immediate-reward-focused agent.Policy π determines the action the agent should take in different environments and is a mapping from a state to the probability of choosing each action [30].The state-value function V π (s) under a certain policy π can be defined as follows: where E[.] is an expectation function.Similarly, under a certain policy π, the action-value function Q π (s, a) is the expected return after executing action a starting from state s: According to Equation (2), different policies π may correspond to different values in the state-value function.Among all policies, there is at least one policy, π * , that maximizes the value of the state-value function in state s, and π * is called the optimal policy.The statevalue function under the optimal policy π * is called the optimal state-value function V * (s): Similarly, the action-value function under the optimal policy π * is called the optimal action-value function Q * (s, a) and is defined as follows:

One-Step and N-Step TD Methods
For the purpose of estimating the action-value function, an off-policy temporal difference (TD) algorithm called Q-learning was proposed [14].Q-learning uses Q-tables to store Q-values for each state-action pair and updates Q(s t , a t ) as follows: where η ∈ [0, 1] is the learning rate, controlling the magnitudes for updating Q-values.
In Equation ( 6), y t is called the TD target value, which is used to approximate the discounted accumulated reward G t .For a one-step TD algorithm, the time-step gap between the estimated state-action pairs is only one time step, and just one time-step reward r t is promptly utilized in each update [31].The one-step TD target value y t can be expressed as follows: In the one-step TD method, if y (1) t is not equal to G t , then the TD target value is slightly biased.To solve this problem and incorporate longer trajectories containing more information about future rewards, the n-step TD algorithm was proposed [23].This approach enhances the learning efficiency of the one-step TD algorithm by bootstrapping over more time steps.N-step TD methods, such as n-step Q-learning [14] and n-step sarsa [23], consider the accumulated rewards in the next n steps and the state estimation after n time steps to update the action-value function.This allows the algorithm to exploit long-term, high-reward strategies from past experiences and further improve the efficiency of the reward signal.The n-step TD target value y (n) t can be written as follows:

Deep Q-Network
In the case of a high state dimension, the Q-table in Q-learning algorithms becomes impractically large, consuming significant time, memory, and computing resources for storage, searches, and updates.To solve these problems, the DQN algorithm incorporates deep approximators into RL using a multi-layered neural network Q(s, a; θ) instead of the Q-table in Q-learning, where θ represents the network parameters.
The DQN introduces two deep neural networks: one is the online network Q(s, a; θ), which is used to select the optimal action during agent-environment interactions, and the other is the target network Q ′ (s, a; θ ′ ), which shares the same structure as the online network and updates its parameters θ ′ by copying the parameters θ from the online network every T steps, referred to as the target "hard" update.The combination of the online network and target network helps alleviate the training instability.Moreover, the DQN improves the utilization of previous experiences by introducing experience replay technology, which reduces sample correlation and enhances learning efficiency.
Training the online Q-network Q(s, a; θ) is reformulated as an optimization problem, aligning with the approach to solving the Bellman equation [32].Gradient descent is employed to minimize the loss function.In the DQN, the one-step TD target value y t DQN estimates the accumulated discounted returns at time step t.The TD error δ t represents the difference between the TD target value y t DQN and Q(s, a; θ).The mean square error (MSE) method is typically used to formulate the loss function.When the agent takes action a t in state s t and obtains the immediate reward r t , transitioning the environment to state s t+1 , y t DQN , δ t DQN , and L(θ) can be expressed as follows: The gradient descent algorithm serves as the core optimization method to minimize the loss function for training deep neural networks.It determines the gradient direction at the current position by calculating the partial derivatives of each parameter in the loss function and updates the parameters opposite to the gradient direction.The gradient descent method iteratively minimizes the loss function locally, gradually decreasing its value to achieve the goal of minimizing the loss.However, throughout the neural network training process, gradients may sometimes become too large, leading to instabilities in the training.Thus, to address this concern, gradient clipping is implemented as a mitigating measure.The gradient of the loss function ∇ θ t L(θ) is expressed as follows: The parameters of the online neural network are updated using gradient descent as follows: The DQN algorithm uses the maximum greedy policy to select the optimal action and estimate the target Q-value, leading to overestimation issues.The DDQN algorithm addresses this problem by using two separate Q-networks.The online Q-network Q(s, a; θ) selects the optimal action, and the target Q-network Q ′ (s, a; θ ′ ) evaluates the target value.Testing the DDQN on various games has demonstrated more accurate Q-value estimations and higher rewards [16].In the DDQN, the traditional one-step TD target value y DDQN_1 t can be given as follows: To further increase the learning efficiency, this paper introduces the n-step TD method into the DDQN and forms the following n-step TD target y DDQN_n t in the proposed PER-nD 3 QN method:

DDQN with Dueling Network Structure
Although the DDQN can alleviate the overestimation problem to a certain extent, in mobile robot autonomous navigation problems, states typically exhibit a strong correlation.There still exists the potential for overestimation in certain actions because the Q-values of different actions may be highly similar, making it difficult to accurately estimate their differences.In this case, the network may overestimate the value of certain actions, leading to unstable learning.The dueling network structure is different from double Q-learning, as it distinguishes valuable states without requiring the estimation of values for all actions, which is particularly useful when certain actions have no impact on the environment.
The dueling network consists of two separate evaluation streams: one evaluates the state value V(s), and the other estimates the action advantage function A(s,a).The final output integrates both streams to yield the action value Q(s,a) as follows: where α and β are top layer parameters connected to the two separate streams, and θ denotes the remaining parameters of fully connected layers [24].One stream of the dueling network outputs a scalar V(s;θ,β), while the other stream outputs a vector A(s,a;θ,α) with the same dimension as the action space.
In the implementation of the dueling structure, there is an issue of non-uniqueness for A and V given a particular Q-value, which may lead to instability and exploding gradients in the training.To resolve this issue, the maximum value of the action advantage function is subtracted, and Q(s, a; θ, α, β) is expressed as follows: This dueling network can be trained with the DDQN and can leverage additional advancements, such as improved exploration strategies, to facilitate the learning process.While the DDQN improves experience utilization through experience replay, it treats all experiences uniformly for learning and assumes that each experience has the same probability of being chosen.However, in our preliminary tests, it is noticeable that experiences with larger TD errors may play more important roles in improving the learning efficiency of the agent, a notion also supported by other research [20].Recognizing the varying importance of experiences based on TD error, this paper incorporates a prioritized experience replay (PER) strategy, which prioritizes experiences with higher TD error, thereby enhancing the agent's learning efficiency.

Prioritized Experience Replay
In the training process, PER samples experiences based on their priorities, which are estimated by their TD errors [33].Experience replays with higher TD errors are given higher priorities and are more likely to be sampled.However, this approach, solely relying on greedy TD-error prioritization, may lead to a loss of sample diversity by focusing only on a small subset of experiences.To address this problem, a stochastic sampling algorithm is introduced, combining fully greedy prioritization and uniform random sampling [20].The sampling probability P(i) for experience i can be defined as follows: where p i is the priority of experience i, N is the number of experiences in the prioritized experience replay buffer, and the hyper-parameter ι determines the extent of greedy prioritization.When ι = 0, it degenerates into purely uniform sampling.The priority of experience p i can be evaluated using two methods: proportional prioritization and rank-based prioritization [20].Proportional priority is defined as where δ i is the TD error of experience i, and ρ is a very small positive constant preventing the denominator from being zero when the TD error is zero.Rank-based priority is expressed as p i = 1 rank(i) , where rank(i) is the rank number of experience i when the experience replay buffer is sorted according to |δ i |.
PER introduces bias by altering the original distribution of experience transactions.Importance sampling (IS) is employed to correct this bias [34].The weight of IS, W i , is expressed as , where N is the number of experiences in the prioritized experience replay buffer, and ϱ is a hyper-parameter determining the degree to which the effect of PER is offset in the convergence results.ϱ increases in steps of 0.001 until it reaches a maximum value of 1.If ϱ = 1, there is no difference between PER and conventional experience replay in the DDQN.To increase stability, W i is usually normalized to generate the more efficient weight of IS, w i , expressed as follows: In our experiments, we set ι = 0.6 and ϱ = 0.4 based on suggestions from prior research [35].This paper integrates a fully connected layer dueling structure and PER into the DDQN algorithm.During the learning phase, a batch of experience sequences is selected from the prioritized experience replay memory using Equation (18).Unlike the DDQN, the gradient in the proposed method is multiplied by the importance sampling weight w i in Equation (19), defined as Following this, the parameter θ of the online network Q is updated using stochastic gradient descent:

Target "Soft" Update
The traditional DDQN algorithm employs a "hard" update policy, which directly copies the parameters of the online network to the target network every T step.However, this approach is prone to instability and convergence issues.Therefore, an alternative strategy known as a "soft" update gained prominence in DRL, involving the gradual update of the target network by maintaining a constant ratio with the parameters of the online network [36].In the "soft" update, the parameter θ ′ is smoothly adjusted using the equation θ ′ = (1 − τ)θ ′ + τθ.Here, τ determines the smoothness of the update, and setting τ to 1 reduces the update rule to a pure "hard" update.
Up to this point, we have presented the proposed PER-n 2 D 3 QN method.The comprehensive structure with its processing flow of the proposed algorithm is depicted in Figure 2. The online network takes the environment state s t observed by the agent at time step t as input and separates into two streams at the last fully connected layer.After training with these parameters, a scalar stream V(s t ) and a vector action advantage stream A(s t , a) are generated.Both streams are integrated to obtain the state-action value Q(s t , a), and the action with the maximum Q value is selected as the optimal action.After taking action a t , the agent receives a reward, r t , from the environment, and the environment transitions to state s t+1 .The experience (s t , a t , r t , s t+1 ) is stored in the one-step experience deque.During training, a batch of n-step prioritized experiences are sampled from the prioritized experience buffer, and the loss function is calculated to update the parameters of the online network.Subsequently, the TD target error is calculated to update the latest priority of the sampled prioritized experience in the buffer.The target network is updated from the online network using the "soft" update method.
When the agent observes its state s t in the environment at time step t, the online network selects and executes the optimal action a t with argmax a ′ Q(s t+1 , a ′ ; θ t ).Subsequently, the agent receives an immediate reward, r t , and the state transitions to a new one, s t+1 .Following this interaction, the one-step experience (s t , r t , a t , s t+1 ) is stored in the onestep experience deque, with a maximum capacity of n.Once this deque is filled with n experiences, these n consecutive experiences are used to generate an n-step experience (s t , r <n> t , a t , s t+n ).The notation r <n> t represents the discounted accumulated rewards of subsequent n-step actions starting from time step t, calculated as r <n> t = ∑ n−1 k=0 γ k r t+k , where s t+n is the state at time step t + n.The n-step experience (s t , r <n> t , a t , s t+n ) is then stored in the prioritized experience buffer.To maintain a length of n consecutive experiences in the one-step experience deque, the earliest experience is removed from the end of the deque when a new experience is added from the top.

Exploration Policy
To facilitate learning, the agent should prioritize exploration during the early stages of training to gather more information and identify relationships between different states and actions.As training progresses, the agent should transition to exploiting learned knowledge, taking the action with argmax In both the original DQN and DDQN algorithms, the agent utilizes the ϵ-greedy policy [11], choosing the action argmax a ′ Q(s t , a ′ ; θ) with a probability of 1 − ϵ and opting for a random action with a probability of ϵ [37,38].The initial value of ϵ is set to 1, and it decays at a rate of 0.99 per episode until reaching the minimum value of ϵ min , which is adjusted to 0.01 to mitigate the impact of exploration on the results during the experiment [39].The ϵ-greedy policy can be expressed as follows: where c is a random value from a uniform distribution in [0, 1].While the ϵ-greedy strategy continues exploration during the entire learning process, it may lead to resource wastage, especially when a good estimation of certain actions is already available and further exploration is unnecessary.To address this limitation, this paper introduces a noisy network.It is a neural network whose weights and biases are perturbed by a parametric noise function to enhance the agent's exploration efficiency, imposing an exploration noise onto the outputted Q-values.These parameters are updated using gradient descent [22], making agents more efficient in exploring environments and aiding in discovering meaningful exploration strategies to solve complex tasks.
In a neural network with p inputs and q outputs, the linear layer can be expressed as follows: where x ∈ R p is the input of the layer, y ∈ R q is the output of the layer, and the weight matrix ω ∈ R q×p and the bias b ∈ R q need to be learned and updated.In the noisy network, ω is replaced with µ ω , µ b , σ ω , and σ b need to be learned and updated in the noisy network using gradient descent.ε ω ∈ R q×p and ε b ∈ R q are random noise variables.In the noisy network, the linear layer with p inputs and q outputs is depicted in Figure 3.
and σ b ∈ R q are the learnable parameters of the noisy network.ε ω ∈ R q×p and ε b ∈ R q are random noise variables.⊙ represents the multiplication of the corresponding elements in the matrix.The output of the noisy layer is y Factorized Gaussian noise is introduced to generate the random noise variables ε ω and ε b .To produce a matrix ε ω ∈ R q×p , one vector ε q with q Gaussian random variables needs to be multiplied by the other vector ε p with p random Gaussian variables.Subsequently, a total of q + p Gaussian variables are needed.The elements ε ω(i,j) in ε ω and ε b(i) in ε b can be generated as follows: where the function f (m) = sgn(m) |m|.Although ε ω and ε b are Gaussian random variables, their values are fixed in the neural network.All parameters ε q(i) and ε p(j) conform to a Gaussian distribution as ε q(i) , ε p(j) ∼N(0, 1).For factorized noisy networks, µ ω and µ b are initialized by random samples from independent uniform distributions, i.e., µ ω , All the elements in σ ω , σ b are set to 0.5 √ p based on suggestions from prior research [22].
When replacing the traditional linear layers with noisy layers in the n-step DDQN, the online network is described as Q(s, a, ε; ζ), where ε is the random variable, and ζ is the learnable parameters of the online network.Similarly, the target network is expressed as Q ′ (s, a, ε ′ ; ζ ′ ).The loss function of the n-step DDQN with noise, L(ξ) Noisy , can be expressed as follows:

Action Space
The DDQN algorithm is one of the value-based RL algorithms that can only be utilized in discrete action spaces.Therefore, the movements of mobile robots must be discretized.Based on prior research [40,41] and our experimental attempts, the action space can be discretized into five actions: a large left turn, a small left turn, a forward motion, a small right turn, and a large right turn.Each action consists of two dimensions: line velocity and regular velocity.The linear velocity of the mobile robot remains constant at v 0 = 0.15 m/s, while the angular velocity varies for each action, denoted by (2w 0 , w 0 , 0, −w 0 , −2w 0 ), where the lower angular velocity w 0 = 0.75 rad/s.Therefore, the action space can be represented as follows: A = {(v 0 , 2w 0 ), (v 0 , w 0 ), (v 0 , 0), (v 0 , −w 0 ), (v 0 , −2w 0 )}. (27)

State Space
The state represents the environmental information perceived by the agent, directly affecting its decisions during learning.In achieving efficient and collision-free navigation for the mobile robot in complex environments, designing the state space also plays an important role.The DDQN is typically more suitable for discrete state spaces, leveraging Q values within a discrete action space to optimize the policy [42].
Our experiments use the Waffle model of the TurtleBot3 mobile robot, a popular opensource platform designed for education, research, and hobbyist use and specifically tailored to work with the robot operating system (ROS).The Waffle model, equipped with sensors, employs a laser beam to detect obstacles in a 360 o direction, calculating distances based on the return time and providing a comprehensive view of the surroundings.During robot navigation, the real-time state information is perceived through a LIDAR sensor and an odometer [43].The LIDAR sensor captures distance and heading information between the robot, goal, and obstacles, while the odometer supplies the robot's position.The efficiency of the algorithm is greatly influenced by the size of the state space.If the state space is too small, the algorithm will have low accuracy.Otherwise, it will take too much more time for training and convergence.
After multiple attempts and referencing a related paper [43], our implementation utilizes only the distances from the LIDAR sensor at 24 equidistant points, reducing the state space to 28.The state space is defined as where d L T represents 24 distances to the obstacles uniformly scanned by the LIDAR sensor and is expressed as . ϕ is the heading angle between the mobile robot and the target.d g is the distance to the goal.d o is the minimum distance to the nearest obstacle detected by the LIDAR sensor, and n is the index corresponding to the minimum distance, which is used to calculate the angle of the nearest obstacle relative to the robot.The acquisition of the mobile robot state is illustrated in Figure 4.

Target-Oriented Reward Function
In the context of RL, the reward function serves the purpose of guiding the agent to learn the value of different actions with a given state [44].In the domain of autonomous navigation for mobile robots, a straightforward method involves utilizing sparse rewards, where positive rewards are only received by the agent when reaching the goal, while negative rewards are given if it collides with an obstacle.This can result in slower or erratic learning.However, in complicated environments with abundant dense or dynamic obstacles, navigating to the target becomes exceedingly challenging for the mobile robot.Operating in such environments with sparse rewards often leads to the mobile robot becoming disoriented or immobilized.Fortunately, reward shaping provides the agent with a smoother and denser form of non-sparse reward, enabling the agent to receive more intermediate reward signals during task execution.This offers more guidance and expedites the learning process [45].
In this paper, a novel target-oriented reshaping reward function is proposed, which adjusts rewards based on factors like the distance and direction between the robot and the target and the distance between the robot and the obstacles.For instance, if the mobile robot is closer to the target, a higher reward might be given to incentivize the mobile robot to reach the goal more quickly.However, if the mobile robot is closer to the obstacle, it may receive a lower reward as a penalty to prevent it from colliding with the obstacle.This targetoriented reshaping reward function encompasses three scenarios.Firstly, if the minimum distance between the mobile robot and the obstacle is less than l a , it is deemed a collision, and the agent receives a reward of −500.Secondly, when the current distance between the mobile robot and the target is less than l b , the robot is considered to have successfully reached the goal, and the agent is rewarded with 1000 [46].Otherwise, the reward is the sum of r 1 and r 2 , where r 1 is calculated based on the angle of the current action and the current distance to the target.r 2 is set to −5 when the minimum distance between the mobile robot and the obstacle is less than 0.5; otherwise, r 2 is set to 1.The reward function can be expressed as follows: To facilitate result evaluation and comparison, the DQN, DDQN, and PER-n 2 D 3 QN algorithms share this same reward function [41].In summary, the pseudocode of the proposed PER-nD 3 QN is outlined in Algorithm 1.

Algorithm 1 Prioritized Experience Reply Noisy n-step Dueling DDQN algorithm
1: Set the state dimension to 28, the size of the action space to 5, the capacity of the deque D to n = 5, and the parameters mini-batch size k, learning rate η, prioritization parameters ι and ϱ, noisy network parameters µ ω , ], and all elements in σ ω , σ b to 0.5 Sample experience j based on the priority P(j) shown in Equation (18) 15: Sample noisy variables ε and ε ′ for Q and Q ′ 16: Compute the TD target error δ j 18: Compute the IS weight w j = W j /max i W i by Equation (19) 19: Update parameters with θ j+1 = θ j − η • w j • δ j ∇ θ j Q(s j , a j ; θ j ) end for 25: end for

Experimental Settings
To evaluate the performance of the proposed algorithm PER-n 2 D 3 QN, we conducted experiments simulating mobile robot navigation tasks across three scenarios using the DQN, DDQN, and PER-n 2 D 3 QN methods (the codes are available at https://github.com/iAerialRobo (accessed on 1 January 2024)).The experiments were conducted on a laptop with an 8 GB NVIDIA GTX4060 GPU and 16 GB RAM employing the open-source robot operating system (ROS) [47,48] and open-source 3D simulator Gazebo [49].The model is implemented in Python 3.8.10 using PyTorch 1.11, with experiment-specific parameters detailed in Table 1.
The agent employed is the Waffle model of the TurtleBot3 mobile robot, navigating three distinct scenarios depicted in Figure 5.In scenario 1, as shown in Figure 5a, a static environment with simple obstacles is presented, featuring four cylindrical obstacles around the mobile robot.Furthermore, the area of the entire closed area is 4 × 4 m 2 , and the radius of the cylindrical obstacle in the area is 0.15 m.Scenario 2 in Figure 5b retains a static setting but introduces a more complex environment with numerous rectangular obstacles.In addition, the area of the entire closed area is 5 × 5 m 2 , and the length of the rectangular obstacle in the area is 1 m, while the width and height are the same as those of the outer wall.Lastly, scenario 3 in Figure 5c incorporates two dynamic cylindrical obstacles with a radius of 0.12 m, moving with a line velocity of 0.5 m/s.In these three scenarios, all the target positions are generated randomly.

Quantitative Analysis of Simulation Environment
In scientific research and engineering design, quantitative analysis provides comprehensive and robust support.To better illustrate the complexity variety between different scenarios in the simulation environment, we projected obstacles onto the Xand Y-axes and quantitatively analyzed environmental complexity based on the proportion of obstacle projections throughout the environment with the removal of the width of the outermost wall.The analysis results are also depicted in Figure 5.
In scenario 1, the projection distribution of obstacles on the xand y-axes is relatively sparse, with only two distribution areas with an obstacle, and the maximum projection proportion on the x-axes is only 0.0811 and the same with the y-axes.Additionally, the area of obstacles accounts for 0.516% of the entire simulation environment area.In scenario 2, the projection distribution of obstacles on the xand y-axes has significantly increased.Furthermore, the maximum projection proportions on the xand y-axes are 0.2766 and 0.4255, respectively.In addition, the obstacles encompass 5.432% of the total simulation environment area.Scenario 3 adds dynamically moving obstacles based on scenario 2, and we only show the obstacle projection distribution in the initial state.As the positions of dynamic obstacles change, their projected positions on the xand y-axes change in real time.When calculating the proportion of the obstacle area to the total area, to distinguish the difference between dynamic obstacles and static obstacles, we use the relative value of the linear speed of the dynamic obstacle compared to that of the mobile robot as the weight, and the product of this weight and the area of the dynamic obstacles is used as the final area of the dynamic obstacle.Thus, the obstacles encompass 19.47% of the total simulation environment area.The quantitative analysis of simulation environment complexity allows for a more accurate description of environmental complexity and provides deeper insights into the performance of the three algorithms.

Results and Discussion
To assess the performance of the DQN, DDQN, and proposed PER-n 2 D 3 QN methods across three scenarios, we analyze the average score (AS) and success rate (SR).Figure 6 depicts the performance comparison of the three algorithms after 10 training trials in scenario 1, with each trial comprising 1100 episodes.Table 2 details the average reward and success rate statistics.DQN achieves an average score of 828.18 with a standard deviation (SD) of 248.05, and it increases to 1049.51 for the DDQN, coupled with a standard deviation of 198.96.Remarkably, PER-n 2 D 3 QN reaches a substantially higher average score of 3384.88 with a standard deviation of 224.69.The average success rate follows a similar trend, improving from 62.74% with a standard deviation of 5.88% for the DQN to 63.92% with a standard deviation of 4.32% for the DDQN and, finally, to 98.91% with a standard deviation of 3.41% for PER-n 2 D 3 QN.In this simpler static scenario 1, there is marginal improvement in the average score between the DQN and DDQN, with little difference in the success rate.However, PER-n 2 D 3 QN outperforms both the DQN and DDQN significantly in terms of both the average score and success rate.Next, we evaluated the performance of the three algorithms in scenario 2, with 1600 episodes per trial and 10 repeated trials.Figure 7a and Figure 7b, respectively, illustrate the results for the average score and success rate, while Table 3 provides specific statistics.The DQN starts with an average score of −309.56 coupled with a standard deviation of 199.66, and it improves to −209.01 with a standard deviation of 244.97 for the DDQN.In contrast, PER-n 2 D 3 QN achieves a substantially increased average score of 2862.23,along with a standard deviation of 236.93.Additionally, the success rate of the DQN is a mere 2.97% with a standard deviation of 2.47%, improving to 15.08% with a standard deviation of 11.21% for the DDQN and eventually reaching 98.32% with a standard deviation of 3.83% for PER-n 2 D 3 QN.Similar to scenario 1, the DDQN performs better than the DQN in the more complex static scenario 2. The increment in the average score between the DQN and DDQN is 100.55, and the average success rate improves by 12.11%.However, the DDQN's accuracy and stability remain relatively low.Conversely, PER-n 2 D 3 QN excels in scenario 2 with high accuracy and stability.Finally, we evaluated the three methods in dynamic scenario 3, also with 10 trials, and each trial comprised 2000 episodes.Figure 8a and Figure 8b, respectively, show the results for the average score and success rate, with detailed statistics in Table 4.The DQN and DDQN perform poorly in scenario 3. Specifically, the DQN achieves a success rate of only 1.59% with a standard deviation of 0.85%, and the DDQN's success rate rises to a mere 8.95% with a standard deviation of 4.52%.In contrast, PER-n 2 D 3 QN achieves a high average score of 2571.28 with a standard deviation of 108.20 and a success rate of 92.63% with a standard deviation of 0.93%.Although the DDQN outperformed DQN in dynamic scenario 3, with an improvement in the average score of 58.68 and a 7.36% increase in the average success rate, its accuracy and stability still lag behind those in scenario 2. On the other hand, PER-n 2 D 3 QN continued to excel in scenario 3, demonstrating high accuracy and stability.In the simpler static scenario 1, the average score increased by 222.52% and the success rate improved by 54.74% when comparing the proposed algorithm PER-n 2 D 3 QN with the DDQN.In the more complex static scenario 2, the average score increased by 1469.42%, and the success rate was enhanced by 551.99% when comparing the DDQN to PER-n 2 D 3 QN.In dynamic scenario 3, the average score rose by 1068.10%, and the success rate experienced a substantial 934.97% enhancement from the DDQN to PER-n 2 D 3 QN.In the detailed performance comparison of the DQN, DDQN, and PER-n 2 D 3 QN across the three scenarios, it is evident that the PER-n 2 D 3 QN method exhibits superior and stable performance in mobile robot navigation, particularly in complex static and dynamic scenarios.The method demonstrates remarkable improvement in the average score and success rate compared to the DQN and DDQN, especially in scenarios involving static obstacles and dynamic elements.Additionally, the visualized path using the policy trained by the PER-n 2 D 3 QN method in scenario 1 within an episode is depicted in in Figure 9. Similarly, the paths in scenarios 2 and 3 are shown in Figures 10 and 11, respectively.In each episode, the mobile robot can take up to 300 steps.However, if a collision occurs, the current episode terminates.Upon reaching the current target, the mobile robot will continue to search for the next new target starting from the current position until it has taken 300 steps.

Conclusions
This paper presents an advanced approach to autonomous navigation for mobile robots termed PER-n 2 D 3 QN, which is an extension of the DDQN.It significantly enhances the performance of autonomous navigation, including the average score, path efficiency, and accuracy, by incorporating prioritized experience replay, a noisy network, n-step learning, and a dueling network structure into the DDQN.Furthermore, an effective target-oriented reshaping reward function is employed to expedite learning, coupled with useful techniques, including a target "soft" update and a gradient clipping mechanism, to improve the stability and robustness of learning.These improvements boost the efficiency of experience replay, promote exploration, and deliver a more precise Q-value estimate, thereby significantly enhancing the performance of autonomous navigation for mobile robots.In addition, to further capture the complexity of the simulation environment, this paper presents a comprehensive quantitative analysis of the simulation environment for the first time.Subsequently, to validate the proposed algorithm, we conducted extensive experiments using the DQN, DDQN, and PER-n 2 D 3 QN on the ROS and Gazebo platforms for mobile robot autonomous navigation tasks.
The trained models were assessed across three distinct scenarios: two static simulation settings and one dynamic simulation setting.A comparative analysis of performance metrics, including the average reward and success rate, reveals that PER-n 2 D 3 QN exhibits remarkable improvements and enhanced robustness, particularly across all three scenarios.Although the experiment focuses on wheeled robots, this approach can extend to various types of mobile robots, including legged robots and aerial robots, as it learns directly from sensor inputs and does not rely on the type or model of the robots.Future research will delve into autonomous navigation for mobile robots in more complex tasks involving coordination and collaboration among multiple agents to improve traffic flow efficiency and safety and reduce training time.

Figure 1 .
Figure 1.The structure of an MDP, where s t is the environment state at time step t, a t denotes the action taken by the agent in state s t , and r t is the reward received from the environment at time step t.

Figure 2 .
Figure 2.The diagram of the proposed n-step dueling DDQN with prioritized experience replay (PER-n 2 D 3 QN).The online network takes the environment state s t observed by the agent at time step t as input and separates into two streams at the last fully connected layer.After training with these parameters, a scalar stream V(s t ) and a vector action advantage stream A(s t , a) are generated.Both streams are integrated to obtain the state-action value Q(s t , a), and the action with the maximum Q value is selected as the optimal action.After taking action a t , the agent receives a reward, r t , from the environment, and the environment transitions to state s t+1 .The experience (s t , a t , r t , s t+1 ) is stored in the one-step experience deque.During training, a batch of n-step prioritized experiences are sampled from the prioritized experience buffer, and the loss function is calculated to update the parameters of the online network.Subsequently, the TD target error is calculated to update the latest priority of the sampled prioritized experience in the buffer.The target network is updated from the online network using the "soft" update method.

Figure 3 .
Figure 3.The linear layer structure of the noisy network, whereµ ω ∈ R q×p , µ b ∈ R q , σ ω ∈ R q×p ,and σ b ∈ R q are the learnable parameters of the noisy network.ε ω ∈ R q×p and ε b ∈ R q are random noise variables.⊙ represents the multiplication of the corresponding elements in the matrix.The output of the noisy layer is y= (µ ω + σ ω ⊙ ε ω ) • x + (µ b + σ ⊙ ε b ).

Figure 4 .
Figure 4.The acquisition of the mobile robot state, where ϕ and d g represent the heading and distance between the robot and the goal.

Figure 5 .
Static and dynamic simulation scenarios, where the black point represents the mobile robot, and the red square represents the target.(a) Scenario 1 with static obstacles.(b) Scenario 2 with static obstacles.(c) Scenario 3 with dynamic obstacles.

Figure 6 .
Experimental results in scenario 1 when trained 1100 episodes (shadowed color: Standard Error).(a) Average score over 10 trials.(b) Average success rate over 10 trials.

Figure 7 .
Experimental results in scenario 2 when trained for 1600 episodes (shadowed color: Standard Error).(a) Average score over 10 trials.(b) Average success rate over 10 trials.

Figure 8 .
Experimental results in scenario 3 when trained for 2000 episodes (shadowed color: Standard Error).(a) Average score over 10 trials.(b) Average success rate over 10 trials.

Figure 9 .
The whole path conducted by the PER-n 2 D 3 QN algorithm in scenario 1 within one episode.The red square symbol represents the obstacles detected by the radar, the purple sphere represents the target point, and the green curve represents the path taken by the robot within an episode.(a) Case 1.(b) Case 2. (c) Case 3.

Figure 10 .Figure 11 .
The whole path conducted by the PER-n 2 D 3 QN algorithm in scenario 2 within one episode.The red square symbol represents the obstacles detected by the radar, the purple sphere represents the target point, and the green curve represents the path taken by the robot within an episode.(a) Case 1.(b) Case 2. (c) Case 3. The whole path conducted by the PER-n 2 D 3 QN algorithm in scenario 3 within one episode.The red square symbol represents the obstacles detected by the radar, the purple sphere represents the target point, and the green curve represents the path taken by the robot within an episode.(a) Case 1.(b) Case 2. (c) Case 3.

28 . 2 :
Initialize priority p 1 = 1, learning rate = η, and dueling noisy online network Q and target network Q ′ with parameters ξ and ξ ′ .3: for episode = 1 → M do Store experience (s t , r t , a t , s t+1 ) in D

Table 2 .
Experimental results when trained for 1100 episodes in scenario 1.

Table 3 .
Experimental results when trained for 1600 episodes in scenario 2.

Table 4 .
Experimental results when trained for 2000 episodes in scenario 3.