End-to-End Autonomous Navigation Based on Deep Reinforcement Learning with a Survival Penalty Function

An end-to-end approach to autonomous navigation that is based on deep reinforcement learning (DRL) with a survival penalty function is proposed in this paper. Two actor–critic (AC) frameworks, namely, deep deterministic policy gradient (DDPG) and twin-delayed DDPG (TD3), are employed to enable a nonholonomic wheeled mobile robot (WMR) to perform navigation in dynamic environments containing obstacles and for which no maps are available. A comprehensive reward based on the survival penalty function is introduced; this approach effectively solves the sparse reward problem and enables the WMR to move toward its target. Consecutive episodes are connected to increase the cumulative penalty for scenarios involving obstacles; this method prevents training failure and enables the WMR to plan a collision-free path. Simulations are conducted for four scenarios—movement in an obstacle-free space, in a parking lot, at an intersection without and with a central obstacle, and in a multiple obstacle space—to demonstrate the efficiency and operational safety of our method. For the same navigation environment, compared with the DDPG algorithm, the TD3 algorithm exhibits faster numerical convergence and higher stability in the training phase, as well as a higher task execution success rate in the evaluation phase.


Introduction
Autonomous intelligent mobile robots [1] are suitable for completing various tasks in dangerous and complex environments; such tasks include urban rescue, public security patrol, and epidemic prevention.Motion planning, in which information obtained from the external environment by sensors is used to evaluate the optimal collision-free path between a starting point and an ending point, is an essential technique in autonomous navigation.Traditional motion planning [2] is a core component of the pipeline framework and must be integrated with other subtasks-such as perception, localization, and control-to accomplish autonomous driving tasks; these tasks [3] are often inflexible and require substantial computational resources and numerous manual heuristic functions because they involve real-time sensory information.The pipeline framework for autonomous driving consists of many interconnected modules, and in the end-to-end method, the entire framework is treated as a single learning task.In end-to-end autonomous motion [2,4], raw sensor data are directly used as the input to a neural network, which outputs low-level control commands.This method is attractive because complex modules do not need to be designed; instead, a network with a simple structure is constructed, and the entire motion process is optimized into a single machine-learning task.
On the basis of their principles and era of development, robot planning methods can be divided into traditional and learning-based methods.Traditional methods include graph search algorithms, sample-based algorithms, interpolating curve algorithms, and Sensors 2023, 23, 8651 3 of 27 provide dense reward signals to the agent.The reward function is difficult to configure for some specific and complex planning tasks.In inverse RL (IRL) [27], expert trajectories are utilized for inversely learning the reward function, and policies are then optimized.IRL is an imitation-learning paradigm [28], as is behavior cloning, which relies on a supervised learning process and suffers from a mismatch between the participant's strategy and the expert's strategy.
AC-based deep RL (DRL) approaches-including the deep deterministic policy gradient (DDPG) [29], twin-delayed DDPG (TD3) [30], and soft AC (SAC) [31] algorithms-are often used to optimize action sequences for robots.Vecerik et al. [32] proposed a general and model-free DDPG-based approach for solving high-dimensional robotics problems.Other studies [33,34] have proposed DDPG-based path-planning methods in which HER is used to overcome the performance degradation caused by sparse rewards.TD3 with traditional global path planning was employed in [35] to improve the generalization of the developed model.In [30] and [36], the TD3 and HRL methods were combined, the state-action space was divided in accordance with the information maximization criterion, and a deterministic option policy was then learned for each region of the state-action space.The SAC algorithm [31,37] uses a stochastic policy and obtains the optimal policy by optimizing the entropy.HER is used in the SAC algorithm to improve sampling efficiency and results in favorable exploration performance.
In an environment exposed to multiple obstacles, the agent learns how to avoid obstacles and reach its destination without collision.The DWA [15] is a traditional method of static obstacle avoidance.Due to complex conditions and equations, dynamic obstacles require a large amount of computing time to predict the next movement of the obstacle.CADRL [38] and MRCA [39] are representative obstacle avoidance methods based on reinforcement learning.CADRL aims to avoid pedestrians and needs to obtain pedestrian position, speed, and body shape information.The disadvantage is that it cannot be avoided if the pedestrian is not detected.MRCA can avoid various dynamic obstacles through the distance information of LiDAR without detecting dynamic objects.In a multi-agent learning environment, it is relatively difficult for MRCA to comprehensively avoid dynamic obstacles through multiple different avoidance strategies or no avoidance strategies.Liang et al. [40] used multiple perception sensors including 2-D LiDAR and depth cameras for smooth collision avoidance.Choi et al. [41] proposed a framework in decentralized collision avoidance in which each agent independently makes its decision without communication with others.A collision avoidance/mitigation system [42] was proposed to rapidly evaluate potential risks associated with all surrounding vehicles and to maneuver the vehicle into a safer region when faced with critically dangerous situations through a predictive occupancy map.A two-stage RL-based collision avoidance approach [43] was proposed.The first stage is a supervised training method with a loss function that encourages the agent to follow the well-known reciprocal collision avoidance strategy.The second stage is using a traditional RL training method to maximize reward function and refine the policy.A hybrid algorithm [44] of RL and force-based motion planning was presented to solve the distributed motion-planning problem in dense and dynamic environments.A new reward function provides conservative behavior by changing collision punishment to decrease the chance of collision in crowded environments.Many studies [26,40,41,45] adopt the method of imposing a large penalty on the reward when the robot collides with an obstacle or approaches an obstacle.The survival penalty imposes a negative reward at each time step, which encourages the robot to reach the goal as quickly as possible.Obstacle avoidance based solely on reinforcement learning may lead to path-finding problems in some cases.To solve this problem and improve navigation efficiency, the path planner is integrated with reinforcement learning-based multi-obstacle avoidance.
To address the severe problem of reward sparsity in complex environments, survival penalties can be applied as rewards to ensure successful learning.A survival penalty-based approach is used within an AC framework to solve a motion-planning problem in which the target position and orientation are aligned simultaneously.We design a comprehensive reward function and conduct it in three navigation scenarios to ensure that the goal can be approached while avoiding obstacles dynamically.
The remainder of this paper is organized as follows.Section 2 details the navigation problem and the requirements for a WMR.Section 3 briefly describes two AC frameworks, namely, DDPG and TD3.In the aforementioned section, we elaborate on the observation state, action space, and reward function and then construct the environment and agent for DRL.Section 4 presents details on the proposed model's training and evaluation and on simulation results for four scenarios.Section 5 discusses the rationality of the survival penalty function and the potential applications of this study.Finally, Section 6 provides the conclusions of this study.

Problem Statement and Preliminaries
A path is a route from one point in space to another.The path-planning problem is closely related to the collision avoidance problem.Although most studies have aimed to solve the path-planning problem in a flexible and efficient manner, they have considered only motion planning involving position transitions, without considering orientationrelated tasks.In this paper, we propose a planning approach in which orientation and localization tasks are tackled simultaneously for further investigating the potential of WMRs.We address attitude adjustment in the final goal of the WMR's task.As illustrated in Figure 1, our goal is for the WMR to move in a timesaving manner and without colliding with obstacles while possessing the required orientation.The proposed method can be applied to many situations.For example, the operation of an automated parking system is a position and orientation colocalization problem.The system must locate the target parking space, define a path that avoids obstacles, and park in the space with the correct orientation and high accuracy [46].The proposed method involves the planning of collisionfree paths during precise directional navigation.To meet requirements for position and orientation transformation, kinematic constraints as well as position and attitude constraints are considered.The attitude of the WMR is described by the following generalized coordinate vector with five components:  = [  ,   , ψ,   ,   ]  ∈  5 , where (  ,   ) are the inertial coordinates of the WMR's center of mass,  is the yaw angle of the WMR relative to the horizontal inertial axis , and   and   are the rotation angles of the right and left driving wheels, respectively.The width of the WMR is 2, the distance between the center of mass As depicted in Figure 1, a differential-drive WMR consists of two rear driving wheels parallel to the back-to-front axis of the body and a caster wheel that supports the robotic platform.The WMR is moved by two direct motors driving the wheels.The global inertia is fixed in the plane of motion, and the moving WMR has a local body frame.The global inertial frame and local body frame are denoted by X-Y and x-y, respectively.The origin of the local frame's coordinate system is the center of gravity of the WMR.The x-direction is the direction that the WMR is facing, whereas the y-direction points to the left if the WMR is facing forward.The WMR is assumed to be in contact with a nondeformable horizontal plane on which it purely rolls and never slips during motion.
The attitude of the WMR is described by the following generalized coordinate vector with five components: 5 , where (X c , Y c ) are the inertial coordinates of the WMR's center of mass, ψ is the yaw angle of the WMR relative to the horizontal inertial axis X, and θ R and θ L are the rotation angles of the right and left driving wheels, respectively.The width of the WMR is 2b, the distance between the center of mass and geometric center of the WMR platform is d, and the driving wheel radius is r.As shown in Figure 1, the centroid P C of the WMR does not coincide with its geometric center P O .The WMR is subject to three nonholonomic constraints when its wheels are rolling and not slipping.
θ L represent the angular velocities of the right and left driving wheels, is considered a control input, the kinematic model of the WMR [47] can be expressed as follows: . q = S(q)ω(t) (2) where The line velocity and angular velocity of the WMR at point P C are denoted by v and ω, respectively.The relationship between v, ω, .θ R , and .θ L is as follows: Equation ( 3) is substituted into (2) to obtain the ordinary form of a WMR with two actuated wheels, which is expressed as follows: .

Reinforcement Learning 3.1. DDPG Algorithm
As shown in Figure 2, the DDPG algorithm [29] combines a replay buffer, a deterministic policy, and an AC architecture to solve decision problems in a continuous task space.The term θ Q represents the parameter of the critic network, which approximates the state-action value Q s, a θ Q , whereas θ µ is the parameter of the actor network, which approximates the deterministic policy µ(s|θ µ ).The actor outputs the probability distribution of an action, whereas the critic uses the state-action value Q s, a θ Q to evaluate the performance of the actor and guide the actor's next action.For the TD method, the objective function of the critic network that minimizes the mean square error (MSE) between the target value   and the estimated value (  , (  |  )|  ) can be expressed as follows: The parameter   of the critic network is optimized using the gradient descent algorithm.In this case, the gradient is calculated as follows: Conversely, the parameter   of the actor network is optimized using the gradient ascent algorithm.In this case, the gradient is computed as follows: The objective function in the DDPG algorithm is maximization of the expected cumulative reward R(τ) in episode τ; this function is expressed as follows: In the DDPG approach, Monte Carlo-based updating is converted into a temporal difference (TD) method.The actual reward is compared with the critic's estimate to obtain the TD error, which is used to adjust the parameters of the critic network and thus obtain a more accurate estimate.When Monte Carlo methods are used, the actual reward cannot be determined until the end of the RL episode.In the TD method, only the next interaction state s t+1 must be reached before the TD target can be formed and Q s t , a t θ Q can be updated.The TD target value y t at time step t is expressed as follows: The parameters θ Q and θ µ of the critic and actor networks, respectively, are updated as follows: where α Q and α µ are the learning rates of the critic and actor networks, respectively.The target functions of the critic and actor networks are updated at low frequency toward the main networks as follows: where τ is a smoothing factor.These new target functions are obtained using a soft update method that improves the stability of network convergence.

TD3 Algorithm
A common mode of failure for the DDPG algorithm involves the learned Q function beginning to overestimate the Q values substantially.Errors of the Q function are introduced into the training of the actor network, which leads to policy violation.The TD3 algorithm [30], which is an improvement of the DDPG algorithm, can reduce overestimation of the value function.In the TD3 algorithm, delayed actor updates, dual critics and actors, and additional clip noise are employed to control actions.As shown in Figure 3, two critic functions, namely, Q 1 s, a θ Q 1 and Q 2 s, a θ Q 2 , are learned concurrently through MSE minimization in almost the same way that the single critic function is learned in the DDPG algorithm.The predicted value of y t that is selected is the smaller of the predictions of the two target critics: where j represents the two target critics, and the clip function constrains future actions within the lower and upper bounds specified for the controls.As an output, the clip function returns its argument unless this argument violates the bound, in which case the argument is set to be equal to the bound.Each critic computes the MSE of the TD error e within the batch as follows: where N is the batch sampled from the replay buffer R. The lowest prediction from a pair of critics is employed to prevent overestimation of the value function.The parameters of the critic functions θ Q 1 and θ Q 2 are updated through one step of gradient descent as follows: Using a small critical value of the target and regressing to that value helps to prevent overestimation of the Q function.The parameters of the target actor and target critic are updated less frequently in the TD3 algorithm than in the DDPG algorithm.
Using a small critical value of the target and regressing to that value helps to prevent overestimation of the Q function.The parameters of the target actor and target critic are updated less frequently in the TD3 algorithm than in the DDPG algorithm.

Setting of the State and Action
The motion planning required to perform a given WMR task can be fundamentally divided into navigation and obstacle avoidance.The navigation module obtains information on the relative attitude of the WMR with respect to the target, whereas the anticollision planning module receives raw LiDAR sensor data that contain information on the unknown and dynamic environment in which the WMR is moving [48].The WMR's system uses the aforementioned information for agent training and to determine the WMR's next action.The current pose and target pose are denoted as (  ,   , ) and � , ,  , ,   �, respectively (Figure 4a).To speed up the training and improve the convergence stability of the algorithm, the pose state vector   is set to be a six-dimensional vector normalized in the range [−1, 1].This vector is expressed as follows: (a)

Setting of the State and Action
The motion planning required to perform a given WMR task can be fundamentally divided into navigation and obstacle avoidance.The navigation module obtains information on the relative attitude of the WMR with respect to the target, whereas the anti-collision planning module receives raw LiDAR sensor data that contain information on the unknown and dynamic environment in which the WMR is moving [48].The WMR's system uses the aforementioned information for agent training and to determine the WMR's next action.
The current pose and target pose are denoted as (X c , Y c , ψ) and X c,goal , Y c,goal , ψ goal , respectively (Figure 4a).To speed up the training and improve the convergence stability of the algorithm, the pose state vector s pos is set to be a six-dimensional vector normalized in the range [−1, 1].This vector is expressed as follows: where x rel and y rel are the coordinates of the WMR's current position in the body coordinate system based on the WMR's target position, d rel is the distance between the current position and the target position, d rel,max is the maximum d rel value, ψ 2,rel is the azimuth angle of viewing the current WMR from the tail of the target, ψ 3,rel is the angle of viewing the target from the first-person perspective in WMR, and ψ 4,rel is the relative yaw angle with respect to the target.These parameters are expressed as follows:

Setting of the State and Action
The motion planning required to perform a given WMR task can be fundamentally divided into navigation and obstacle avoidance.The navigation module obtains information on the relative attitude of the WMR with respect to the target, whereas the anticollision planning module receives raw LiDAR sensor data that contain information on the unknown and dynamic environment in which the WMR is moving [48].The WMR's system uses the aforementioned information for agent training and to determine the WMR's next action.The current pose and target pose are denoted as (  ,   , ) and � , ,  , ,   �, respectively (Figure 4a).To speed up the training and improve the convergence stability of the algorithm, the pose state vector   is set to be a six-dimensional vector normalized in the range [−1, 1].This vector is expressed as follows: where   and   are the coordinates of the WMR's current position in the body coordinate system based on the WMR's target position,   is the distance between the current position and the target position,  , is the maximum   value,  2, is the azimuth angle of viewing the current WMR from the tail of the target,  3, is the angle of viewing the target from the first-person perspective in WMR, and  4, is the relative yaw angle with respect to the target.These parameters are expressed as follows: As the WMR successfully navigates itself to the target, all elements in the pose state vector s pos approach 0.
To plan an obstacle-free path, the robot's system must extract information from the environment by using sensor data and thereby generate commands that ensure obstacle avoidance.Figure 4b shows an example of LiDAR data obtained in the horizontal plane.Laser scanning is performed between 120 • and −120 • , and sensors are established to detect reflectance of laser beams at intervals of 12 • .If the sensor does not detect any object within a certain distance, the distance of the nearest obstacle is considered to be equal to the maximum detectable distance d max,LiDAR ; otherwise, the length is the distance between the where d i,LiDAR is the length detected by the ith laser sensor.The input state information s t at time step t includes the pose information s pos relative to the target and the obstacle information s LiDAR relative to the obstacle; this information comprises the following set: The action to be taken by the WMR is the set of the linear forward velocity v and the angular velocity ω around the centroid of the WMR.

Survival Penalty Function
When the reward function is used as a training metric to encourage or discourage action, this function has a strong influence on learning performance.In many studies [19,26,49], the distance d rel between the current position and the target position has been employed as the main parameter of the reward function.When a positive reward design is used, the closer the WMR is to the target, the higher the reward.Once the agent successfully reaches the target area, it obtains the highest reward.In the illustration presented in Figure 5a, the WMR is initially at the left side and reaches the final position on the right side.The maximum distance d max for a straight track is set as 10 intervals.The positive reward r pos is proportional to the distance from the starting point and is expressed as follows: Path planning is performed at regular intervals.When the WMR is traveling at a forward velocity of 1 and 2 intervals per time step, the agent receives a cumulative reward of 55 and 30, respectively, after completing episodic work.The goal in DRL is to guide an agent that is attempting to maximize the cumulative reward from the initial state distribution.A potential drawback of only implementing positive rewards is that the agent employs a slow movement policy to perform the situational task and thereby obtains a relatively high cumulative reward.In study [18], auxiliary negative rewards have also been assigned to penalize near-standstill action when the WMR is traveling below a certain speed.
The survival penalty function in which a negative relative distance d rel is used as the reward for straight-line forward motion is defined as follows: As displayed in Figure 5b, the highest penalty is awarded at the starting point, which is the farthest point from the target, whereas the lowest penalty is awarded at the end point.For forward velocities of 1 and 2 intervals per time step, the cumulative rewards are −55 and −30, respectively.The agent essentially moves the WMR from the starting point to the destination and thus completes the linear motion task as quickly as possible.The linear motion condition can be implemented by adopting the maximum driving speed strategy.During turning maneuvers, a smaller turning radius can be produced by reducing the driving speed and altering the rotation angle.Survival penalty functions have the intrinsic advantage of guiding the agent to complete the motion task in the fewest number of steps; thus, the decisions made are different to those made in positive reward design.To guide the WMR precisely into the target area and simultaneously achieve the correct position and orientation, the intersecting x rel -y rel space of the coordinate plane is roughly divided into four quadrants, as illustrated in Figure 6.In human driving behavior, a vehicle is driven into the correct position by approaching the target from the rear.We first consider the case in which the WMR is behind the target, that is, x rel < 0. We define a locking angle ψ lock as follows: where ψ 2,rel , ψ 3,rel , and ψ 4,rel are calculated using (18c), (18f), and (18g), respectively.When the locking angle ψ lock is less than 36 • , the agent might force the robot to head toward the locking angle and then accurately approach the desired orientation and position (Figure 6a).The smaller the locking angle, the smaller the penalty awarded.When the locking angle is greater than 36 • but less than 90 • , the main goal of the agent is to keep the WMR away from the y re f -axis by enforcing a clearage distance |x lock1 | and to move the WMR progressively closer to the x re f -axis.When the locking angle is greater than 90 • , x rel remains in the range [x lock2 , x lock1 ], and a change-of-orientation motion is performed to turn the vehicle toward the destination.To prevent the clearage distance from being too small or too large for large orientation adjustments, the two boundary distances x lock1 and x lock2 are set as −60 and −90, respectively.The negative reward when the WMR is in the second or third quadrant is as follows: Sensors 2023, 23, x 14 of 28 The survival penalty functions expressed in ( 25) and ( 26) indicate that negative rewards (punishments) are always awarded as the WMR transitions from a high-penalty state to a low-penalty state.An additional reward is provided at the end of each episode: 5 and −10 for successful completion of the navigation task and for a failed mission, respectively.

Collision Avoidance Constraints
During the training phase, when navigation paths are being explored, scenarios involving collisions between the WMR and the environment or obstacles are easily generated.In many studies [26,49], scholars have penalized collision situations by terminating the exploration episode and introducing additional negative rewards.The closer the detected obstacle is, the greater the punishment.By instead applying positive rewards, the agent can avoid unrewarding collision paths as much as possible and continue to explore paths that correctly achieve the navigation goal while accumulating large rewards.However, this strategy is not possible when applying the survival penalty method.One reason The constant negative bias in the first term of each penalty equation results in the WMR being primarily steered in the correct direction, with the lock angle being reduced and the punishment lowered as the WMR approaches the target.
The second case we consider is that in which the WMR is located in the first or fourth quadrant.The objective of the reward function in this case is to guide the WMR away from the dangerous passageway adjacent to the target area and into the rear of the target.When the alignment angle ψ 4,rel is greater than 144 • , the front of the WMR is pointed away from the positive x re f -axis direction and slightly toward the rear of the target.The size of the main penalty is determined by the values of ψ 4,rel and x rel , and this penalty guides the WMR to move to the second or third quadrant.An auxiliary penalty is added to Sensors 2023, 23, 8651 14 of 27 prevent the WMR from passing through the dangerous passageway close to the x re f -axis.The clear distance of y way is set as 60.When the WMR is facing the positive x re f -axis-that is, ψ 4,rel ≤ 144 • -the WMR performs a U-turn in the first or the fourth quadrant, which causes ψ 4,rel to increase.The negative reward when the WMR is in the first or the fourth quadrant is expressed as follows: The survival penalty functions expressed in ( 25) and ( 26) indicate that negative rewards (punishments) are always awarded as the WMR transitions from a high-penalty state to a low-penalty state.An additional reward is provided at the end of each episode: 5 and −10 for successful completion of the navigation task and for a failed mission, respectively.

Collision Avoidance Constraints
During the training phase, when navigation paths are being explored, scenarios involving collisions between the WMR and the environment or obstacles are easily generated.In many studies [26,49], scholars have penalized collision situations by terminating the exploration episode and introducing additional negative rewards.The closer the detected obstacle is, the greater the punishment.By instead applying positive rewards, the agent can avoid unrewarding collision paths as much as possible and continue to explore paths that correctly achieve the navigation goal while accumulating large rewards.However, this strategy is not possible when applying the survival penalty method.One reason for this restriction is that a sudden large penalty might cause the numerical calculation to become unstable and might make it difficult for convergence to be achieved.Another reason is that the agent can easily conclude that a path involving a collision but with few generation steps and a small cumulative penalty is the optimal path, thereby resulting in poor action generation and training failure.
We introduce a simple method that involves exploiting the survival penalty function to manage obstacle avoidance.The TD3 and DDPG algorithms are off-policy algorithms that create experience replay buffers to store historical experiences, randomly sample transitions from these experiences, and then employ these sample data to update the actor and critic networks.The existence of the experience replay buffer helps the agent to learn from previous experiences and improve the efficiency of sample utilization.Figure 7 displays two historical experiences: two consecutive episodes i and i + 1, where episode i involves a collision.The episodes i and i + 1 consist of n and m steps, respectively, each of which stores an experience [s t , a t , r t , s t+1 ] tuple.To increase the cumulative penalty for the collision episode, we can concatenate these two episodes into a new episode.At the end step n of collision episode i, the original next state s end is linked to the state of the first step of episode i + 1.The new episode thus consists of n + m steps.The TD method can be used to increase the cumulative penalty of the original collision episode i by the cumulative penalty of episode i + 1, whereas the cumulative penalty of episode i + 1 is maintained the same.
of which stores an experience [  ,   ,   ,  +1 ] tuple.To increase the cumulative penalty for the collision episode, we can concatenate these two episodes into a new episode.At the end step  of collision episode , the original next state   is linked to the state of the first step of episode  + 1.The new episode thus consists of  +  steps.The TD method can be used to increase the cumulative penalty of the original collision episode  by the cumulative penalty of episode  + 1, whereas the cumulative penalty of episode  + 1 is maintained the same.

Network Parameter Settings
Table 1 provides details on the architectures of the actor and critic networks used in the simulations performed in this study.The six-dimensional relative target information and 21-dimensional laser range findings of these networks are merged into a 27-dimensional input vector.The input layer of the actor network is followed by three fully connected layers, each of which contains 512 nodes and uses the rectified linear unit (ReLU)

Network Parameter Settings
Table 1 provides details on the architectures of the actor and critic networks used in the simulations performed in this study.The six-dimensional relative target information and 21-dimensional laser range findings of these networks are merged into a 27-dimensional input vector.The input layer of the actor network is followed by three fully connected layers, each of which contains 512 nodes and uses the rectified linear unit (ReLU) activation function.The outputs of the output layer include the linear and rotational velocities generated by the hyperbolic tangent function.The range covered by the linear velocity and angular velocity is [−1, 1].The output actions are multiplied by two scale parameters to determine the final linear and angular velocities that should be directly executed by the WMR.The critic network consists of an input layer, three fully connected layers, and an output layer.Each fully connected layer contains 512 nodes and uses the ReLU activation function.The state-action pair predicts the Q value.The input layer of the critic network is the same as that of the actor network.The output of the critic network's first fully connected layer is concatenated with the output of the actor network.The critic network, the inputs to which are state-action pairs, finally generates a continuous value through a linear activation function.Table 2 lists the hyperparameter settings employed during the simulation.The discount factor γ is set as 0.99, the learning rate of the actor and critic networks is set as 0.00001, and τ = 0.01 for the soft updating of the target networks.The size of the replay buffer is 40,000, and the batch size of updating is 128.Moreover, the maximum number of steps in each episode is set as 300.Simulations are performed on a computer with an Intel i7-11 CPU with 32 GB of memory and an Nvidia RTX 4090 GPU.Python 3.10 is used as the project interpreter.The deep-learning framework PyTorch 2.0 is employed for training the networks in a Windows system.

Virtual Environments
To prove the effectiveness and superiority of the proposed survival penalty function, simulations are performed that involve the DDPG and TD3 algorithms with the same reward functions and hyperparameters.Navigation tasks in four virtual environments are investigated.Table 3 presents the physical parameters of and constraints on the WMR in the learning environment.

Free Obstacle Scenario
In this scenario, we construct a virtual navigation environment in which the initial and target yaw angles have the same magnitude but opposite signs.The distance between the initial and target points is set as 300.The navigation task is accomplished using a survival penalty function.Figure 8 shows the cumulative reward curves when the yaw angle is set as 90 • and the DDPG and TD3 algorithms are employed.The red and blue curves generated by the DDPG and TD3 algorithms, respectively, reveal large fluctuations in the estimated cumulative rewards at the beginning of training.The TD3 algorithm converges faster than does the DDPG algorithm; the reward curves of the DDPG and TD3 algorithms become consistent after approximately 400 and 600 episodes of training, respectively.In the stable region (i.e., after the aforementioned numbers of episodes), the mean and variance of the blue curve are slightly lower and smaller, respectively, than those of the red curve.
The estimated cumulative reward is considerably larger than the ground-truth cumulative reward, which indicates that the AC framework always overestimates the cumulative reward.Because the TD3 algorithm employs two critic networks, this algorithm exhibits a smaller difference between the ground-truth reward and the estimated reward than does the DDPG algorithm.
generated by the DDPG and TD3 algorithms, respectively, reveal large fluctuations in the estimated cumulative rewards at the beginning of training.The TD3 algorithm converges faster than does the DDPG algorithm; the reward curves of the DDPG and TD3 algorithms become consistent after approximately 400 and 600 episodes of training, respectively.In the stable region (i.e., after the aforementioned numbers of episodes), the mean and variance of the blue curve are slightly lower and smaller, respectively, than those of the red curve.The estimated cumulative reward is considerably larger than the ground-truth cumulative reward, which indicates that the AC framework always overestimates the cumulative reward.Because the TD3 algorithm employs two critic networks, this algorithm exhibits a smaller difference between the ground-truth reward and the estimated reward than does the DDPG algorithm.Table 4 lists the simulation results obtained for autonomous navigation in virtual environments for various yaw angle pairs, including the estimated cumulative reward, ground-truth cumulative reward, ∆ reward, steps per episode, and success rate.The ∆ reward column is calculated by subtracting the ground-truth cumulative reward from the estimated cumulative reward.The estimated cumulative reward is always greater than the ground-truth cumulative reward.The autonomous path created by the TD3 algorithm has fewer execution steps than that created by the DDPG algorithm and thus leads to the navigation task being completed more quickly in the same operating environment.Similarly, the navigation success rate achieved using the TD3 algorithm is considerably higher than that achieved using the DDPG algorithm.Figure 10 shows the trajectories of the normalized forward linear velocity v and angular velocity ω at a yaw angle of 80 • .A maximum normalized linear speed of 1 represents maximum forward speed, and a minimum normalized linear speed of −1 represents zero forward speed.Positive and negative normalized angular velocities reveal the directional angular rates of the WMR to the left and right, respectively.The thick black dashed line in Figure 10 shows the evaluation results using the TD3 algorithm when the greedy noise ε is set to zero.The WMR starts at maximum forward linear speed and performs a right turn maneuver.After approximately 16 steps, the WMR adjusts to linear motion, and the normalized angular velocity ω approaches zero.After 36 steps, the WMR gradually turns right again and enters the target area.In the final stages of the navigation task, the WMR reduces linear velocity and varies angular velocity at high frequency to align with the target yaw angle.The WMR navigation path shown in Figure 9b uses 0.5 epsilon greedy noise to increase the path exploration capability.As shown in Figure 10, the fluctuation of the blue line produced by the TD3 algorithm with epsilon greedy noise 0.5 is significantly smaller than that of the red line generated by the DDPG algorithm.The total number of steps for the navigation simulation without/with epsilon greedy noise using the TD3 algorithm is 64 and 69, respectively.The TD3 algorithm has better noise immunity than the DDPG algorithm.In this study, the input state information s t and reward function r t do not refer to the speed information of the WMR.The trajectories of forward linear velocity v and angular velocity ω are not very smooth.Liang [40] imposed a penalized reward associated with angular velocity exceeding a threshold to reduce oscillatory behavior.Chai [26] proposed a hierarchical control framework, which consists of an upper motion-planning layer and a lower tracking layer to form a collision-free algorithm to achieve autonomous exploration with high motion performance.Navigation tasks are suitable for low-speed movements.

Parking Scenario
To demonstrate the general ability of the proposed method to adapt to other environments, we consider a parking scenario.The virtual environment that is simulated consists of two parallel rows of parking spaces in a space with upper and lower boundaries (Figure 11), and partitions exist on both sides of each parking space.The lower-left parking space is considered the starting point, and the remaining seven parking spaces are the

Parking Scenario
To demonstrate the general ability of the proposed method to adapt to other environments, we consider a parking scenario.The virtual environment that is simulated consists of two parallel rows of parking spaces in a space with upper and lower boundaries (Figure 11), and partitions exist on both sides of each parking space.The lower-left parking space is considered the starting point, and the remaining seven parking spaces are the target positions.At the origin and destination, the front of the WMR is pointing inside and outside the parking lot, respectively.Figure 11a,b display the autonomous paths planned by the DDPG and TD3 algorithms, respectively, for movement from the fixed initial configuration to the seven parking spaces.In none of the paths does the WMR collide with the partitions on either side of the parking spaces.The WMR can move in a straight line, straight offset, or U-turn and successfully enter the desired parking space.Table 5 summarizes the statistics for each of the seven autonomous parking scenarios, with 100 evaluations performed using an epsilon greedy noise of 0.5.The DDPG and TD3 algorithms produce almost the same ground-truth cumulative reward for each case.For all episodes, the TD3 algorithm exhibits considerably lower overestimation of cumulative rewards than the DDPG algorithm does.The TD3 algorithm can accomplish path planning with few steps in each navigation round and exhibits a high success rate.

Intersection Scenario
A virtual environment is designed that simulates a general intersection, as illustrated in Figure 12.The WMR starts from entrance 4 of the intersection (on the left in Figure 12), and the four possible destinations are the four exits, namely, the exits that can be reached by turning left, going straight, turning right, or making a U-turn.Figure 12a displays a general two-way intersection, whereas Figure 12b illustrates the same intersection but with an impassable circle in its center.The WMR should not collide with the lane separators or the central part of the roundabout.The red and blue curves displayed in Figure 12 indicate the autonomous paths generated by the DDPG and TD3 algorithms, respectively.In the case of turning left or going straight, the generated paths are considerably different for the two considered types of intersection (i.e., with and without an obstacle).When the agent encounters an obstacle, it passes parallel to the boundary of the obstacle and maintains a certain distance from it.For the intersection containing no obstacle, the autonomous paths generated by the DDPG and TD3 algorithms are almost coincident (Figure 12a), whereas in the environment containing an obstacle, the paths are slightly different (Figure 12b).
A virtual environment is designed that simulates a general intersection, as illustrated in Figure 12.The WMR starts from entrance 4 of the intersection (on the left in Figure 12), and the four possible destinations are the four exits, namely, the exits that can be reached by turning left, going straight, turning right, or making a U-turn.Figure 12a displays a general two-way intersection, whereas Figure 12b illustrates the same intersection but with an impassable circle in its center.The WMR should not collide with the lane separators or the central part of the roundabout.The red and blue curves displayed in Figure 12 indicate the autonomous paths generated by the DDPG and TD3 algorithms, respectively.In the case of turning left or going straight, the generated paths are considerably different for the two considered types of intersection (i.e., with and without an obstacle).When the agent encounters an obstacle, it passes parallel to the boundary of the obstacle and maintains a certain distance from it.For the intersection containing no obstacle, the autonomous paths generated by the DDPG and TD3 algorithms are almost coincident (Figure 12a), whereas in the environment containing an obstacle, the paths are slightly different (Figure 12b).Table 6 presents the numerical statistics for the DDPG and TD3 algorithms in the intersection scenario.These algorithms achieve almost the same ground-truth cumulative rewards in each case.For the environment containing an obstacle, the cumulative reward computed using the DDPG algorithm exhibits a large overestimation compared with the ground-truth cumulative reward, which leads to many steps being required to accomplish the navigation task and a low success rate per episode.The TD3 algorithm, which uses the survival penalty function, not only manages complex and different environments but also exhibits higher stability and efficiency than the DDPG algorithm does.Table 6 presents the numerical statistics for the DDPG and TD3 algorithms in the intersection scenario.These algorithms achieve almost the same ground-truth cumulative rewards in each case.For the environment containing an obstacle, the cumulative reward computed using the DDPG algorithm exhibits a large overestimation compared with the ground-truth cumulative reward, which leads to many steps being required to accomplish the navigation task and a low success rate per episode.The TD3 algorithm, which uses the survival penalty function, not only manages complex and different environments but also exhibits higher stability and efficiency than the DDPG algorithm does.

Multi-Obstacle Scenario
The virtual environment is designed to demonstrate the adaptability of our approach to scenarios with multiple obstacles.The scenario shown in Figure 13 has three obstacles.WMR starts from the left side of the figure with a starting yaw angle of 0 • .The destination is on the right, but the four possible target yaw angles are set to 0 • , 90 • , 180 • , and 270 • .For the first navigation situation where the target yaw angle is 0 • , the paths generated by the DDPG and TD3 algorithms pass below and above obstacle 1, respectively, and then enter the target location from the corridor between obstacle 2 and obstacle 3. When the target yaw angle is 90 • or 270 • , the paths established by the two algorithms almost overlap.The WMR passes through the upper or lower side of obstacles 2 and 3, respectively, and enters the target area.When the target yaw angle is 180 • , the red line path generated by DDPG and the blue line path generated by TD3 enter the tail of the target area from above obstacle 2 and below obstacle 3, respectively, perform a U-turn, and successfully complete the navigation task.Although the target position in the navigation situation is the same, different collision-free autonomous paths will be generated sequentially for different target yaw angles.Table 7 summarizes the simulation results of four navigation situations in multi-obstacle scenarios.Due to the vertical symmetry of the multi-obstacle design, the average step size of the 270 • target yaw angle calculated by the TD3 algorithm is almost the same as the average step size of the 90 • target yaw angle.The TD3 algorithm shows higher stability than the DDPG algorithm.In order to further verify the actual performance of our method, we conducted comparative simulations using these two algorithms in four virtual environments.A common failure mode of the DDPG algorithm [30] is that the learned Q-function starts to dramatically overestimate Q-values, which then leads to policy failure because it exploits errors in the Q-function.The TD3 algorithm is designed to solve the overestimation bias of the value function by learning two Q functions instead of one Q function.The TD3 algorithm chooses the smaller of the two Q-functions to form the target error loss function.Using a smaller Q value as a target and regressing back to that value helps avoid overestimation in the Q function.To obtain good policy exploration, we add epsilon noise to its action.function that is consistent with an application is challenging.For example, the application of RL in electric vehicle battery-charging management [50] uses positive rewards to allow management to play a major role in coordinating the charging and discharging mechanism and effectively achieve a safe, efficient, and reliable power system.In autonomous navigation applications, as shown in Figure 5, negative rewards can encourage agents to move to the destination as quickly as possible.On the contrary, positive rewards may lead the agent to increase the navigation time (steps) and obtain the maximum reward.In our study, since the navigation time of each navigation situation varies greatly, it is not directly designed as a penalty parameter.
TD3 and DDPG are model-free RL off-policy RL algorithms.Model-free algorithms learn directly from experience and can operate without complete knowledge of environmental dynamics or transition models.Simulated environments are the most common method of training agents.The kinematic model of WMR is based on a two-dimensional framework in the x-and y-directions.When a vehicle travels on uneven terrain, the vehicle will generate significant vertical vibrations.Movements in the z-direction may affect the precise positioning of the vehicle, which is a key requirement for autonomous navigation.Bikes, tripod cars, and four-wheel drive cars are real vehicles with different motion structures.The simulation environment could use an accurate vehicle model to generate relevant experience, but this would require a lot of computer time to solve the nonlinear vehicle equations.The agent relies on sampling to estimate the value function, resulting in noisy estimates and slower convergence, but model-free algorithms can be used in large, complex environments.Our approach can be extended to complex 3D environment applications such as UAVs or robotic manipulators.The goal of path-planning techniques is not only to discover optimal and collision-free paths, but also to minimize various issues such as path length, travel time, and energy consumption that require further research.Using multiple robots to conduct collaborative multi-observation of multiple moving targets is a future research direction.
The main contributions of this study are summarized as follows: (1) Appropriate reward functions are designed by considering human driving experiences and using a survival penalty method so that a wheeled mobile robot (WMR) can receive negative reward feedback at every step.The reward function varies continually on the basis of observation and action.Dense reward signals are found to improve convergence during training, and this finding verifies the feasibility of solving the sparse reward problem encountered in autonomous driving tasks for WMRs.(2) Two consecutive RL episodes are connected to increase the cumulative penalty.This process is performed when the WMR collides with an obstacle.The agent is prevented from selecting the wrong path, learns a path in which obstacles are avoided, and successfully and safely reaches the target.(3) The effectiveness and robustness of the proposed method is evaluated for three navigation scenarios.The simulation results indicate that the proposed method can effectively solve the problem of planning a path along which a WMR can drive to its destination in a complex environment.The results also indicate that the proposed method is robust and self-adaptive.(4) By using the DDPG and TD3 algorithms with the same AC framework, the WMR can safely and quickly reach its target under the simultaneous alignment of position and orientation.For the same environment and standard, compared with the DDPG algorithm, the TD3 algorithm requires considerably fewer computations and has higher real-time performance.(5) The path-planning method has the ability to handle orientation and positioning simultaneously.In an environment without map information, the intermediate waypoints between the departure point and the destination do not need to be pre-determined segmentally, and an end-to-end obstacle-free autonomous U-turn path can be realized.End-to-end vehicle navigation has the potential to be used in high-end industrial robot applications such as electric vehicle parking and forklift picking and placing.

Conclusions
For the autonomous navigation of a WMR, the target location as well as a target orientation are considered task constraints.Traditional random-action exploration leads to high-probability sparse reward systems.After each action in our method, the agent acquires information regarding the relative position and orientation of the target state, and this information is suitable for providing a dense reward signal, with the provision of a certain immediate reward being a considerably better option than the provision of no reward at all.The survival penalty function utilizes negative rewards to guide the agent to achieve the goal as quickly as possible.The concept underlying the design of the survival penalty function is similar to that underlying human driving behavior, and the negative reward provides a useful means of exploration for the training of agents.Training paradigms based on different task goals can enable the state update policy to be iteratively improved on the basis of the dense feedback signal provided by the environment during the interaction process.Every action taken by the WMR should be rewarded so that it can most effectively assess the quality of its actions and alleviate training difficulty.
The DDPG and TD3 algorithms are deterministic policy gradient algorithms based on the AC framework.In the proposed method, these two algorithms are incorporated with high-precision LiDAR range findings.A process in which two consecutive RL episodes in the experience pool are connected; this process helps the WMR to avoid collisions and enables paths to be planned for environments containing obstacles.The optimal policy can be regarded as a mapping from state to action, which enables the WMR to perform the optimally rewarded actions given the current state of the environment.We conduct simulations for three scenarios to verify the performance of the proposed approach: scenarios involving obstacle-free spaces, a parking lot, and intersections without and with obstacles.The actor network relies heavily on the critic network; thus, the performance of the DDPG algorithm is highly sensitive to critic learning.The DDPG algorithm is unstable because of its sensitivity to its hyperparameters and tends to converge to very poor solutions or even diverge.The TD3 algorithm, which has an auxiliary critic network, more accurately estimates cumulative rewards than does the DDPG algorithm and prevents high levels of cumulative reward overestimation.The simulations conducted in the three scenarios reveal that compared with the DDPG algorithm, the TD3 algorithm converges considerably faster, has a superior convergence effect, and exhibits higher efficiency and adaptability in complex dynamic environments.The TD3 algorithm results in the effective avoidance of obstacles in environments and a high rate of successful task completion.

Sensors 2023 ,
23, x 7 of 28 selects an action by using the estimate of [, (|  )|  ] obtained by the critic network; thus, the estimate should be as close as possible to the target value.

Figure 4 .
Figure 4. State space of the WMR: (a) WMR pose information, (b) environmental information provided by a LiDAR sensor scan.

Figure 4 .
Figure 4. State space of the WMR: (a) WMR pose information, (b) environmental information provided by a LiDAR sensor scan.

Figure 5 .
Figure 5. Cumulative rewards of the linear motion using positive and negative rewards: (a) positive reward, (b) negative reward.

Figure 6 .
Figure 6.Survival penalty function that guides the WMR's movement.

Figure 6 .
Figure 6.Survival penalty function that guides the WMR's movement.

Figure 7 .
Figure 7. Preprocessing the replay buffer for obstacle avoidance.Red circle indicates the original next state   is linked to the state of the first step of episode  + 1.

Figure 7 .
Figure 7. Preprocessing the replay buffer for obstacle avoidance.Red circle indicates the original next state s end is linked to the state of the first step of episode i + 1.

Figure 8 .Figure 8 .
Figure 8.Comparison of cumulative rewards between DDPG and TD3 algorithms at yaw angle of 90°.Note: The vertical blue and red lines indicate that the estimated rewards calculated by the TD3 and DDPG algorithms gradually become stable and consistent after approximately 400 and 600 episodes of training, respectively.

Figure 9 Figure 9 .
Figure 9 displays the autonomous paths planned by the system for initial yaw angles of 0 • , 40 • , 80 • , 120 • , 140 • , 160 • , and 180 • .The red and blue curves represent the autonomous paths created by the DDPG and TD3 algorithms, respectively.The green lines radiating from the WMR are the LiDAR scan lines.The paths generated by the TD3 algorithm have small turning radii and involve considerable straight-line movement.The shapes of these paths approximate a section of a rounded rectangle.Sensors 2023, 23, x 19 of 28

Figure 10 .
Figure 10.The trajectories of linear and angular velocities at yaw angle of 80 • : (a) linear forward velocity and (b) angular velocity.

Figure 11 .
Figure 11.Path planning in a parking scenario when the DDPG and TD3 algorithms are used: (a) DDPG and (b) TD3.Note: Numbers indicate parking locations.

Figure 12 .
Figure 12.Routing plan for an intersection without and with an obstacle: (a) obstacle-free scenario and (b) scenario with an obstacle.Note: Numbers represent the exit numbers.

Figure 12 .
Figure 12.Routing plan for an intersection without and with an obstacle: (a) obstacle-free scenario and (b) scenario with an obstacle.Note: Numbers represent the exit numbers.
WMR and the detected obstacle.The range information of LiDAR is normalized into the range [0, 1] and is expressed as follows: rel | π other

Table 1 .
Architectures of the actor and critic networks.

Table 2 .
Hyperparameter settings in the simulation.

Table 3 .
Geometrical parameters of and constraints on the WMR.

Table 4 .
Simulation results obtained for autonomous navigation under various yaw angle configurations.

Table 5 .
Simulation results obtained in the parking lot scenario.

Table 6 .
Simulation results obtained in the intersection scenario.

Table 6 .
Simulation results obtained in the intersection scenario.

Table 7 .
Simulation results obtained in the multi-obstacle scenario.