Path-Following and Obstacle Avoidance Control of Nonholonomic Wheeled Mobile Robot Based on Deep Reinforcement Learning

: In this paper, a novel path-following and obstacle avoidance control method is given for nonholonomic wheeled mobile robots (NWMRs), based on deep reinforcement learning. The model for path-following is investigated ﬁrst, and then applied to the proposed reinforcement learning control strategy. The proposed control method can achieve path-following control through interacting with the environment of the set path. The path-following control method is mainly based on the design of the state and reward function in the training of the reinforcement learning. For extra obstacle avoidance problems in following, the state and reward function is redesigned by utilizing both distance and directional perspective aspects, and a minimum representative value is proposed to deal with the occurrence of multiple obstacles in the path-following environment. Through the reinforcement learning algorithm deep deterministic policy gradient (DDPG), the NWMR can gradually achieve the path it is required to follow and avoid the obstacles in simulation experiments, and the effectiveness of the proposed algorithm is veriﬁed.


Introduction
Path-following has been considered as an alternative problem formulation for trajectory tracking problems [1]. The main task of path-following is to develop control laws for following a predefined path with minimum position error. In contrast to the trajectory tracking problem, path-following research focuses on the fact that the path is specified by a relatively independent timing control law, making it more flexible in terms of the control of the tracked object. Therefore, the path-following problem has been extensively studied in the field of control, for applications such as wheeled mobile robots [2,3], autonomous underwater vehicles [4,5], and quadrotors [6,7].
Currently, numerous control methods have been referenced in the study of pathfollowing problems, such as guiding vector field (GVF) [2,8], model predictive control (MPC) [7,9], sliding mode control (SMC) [2], etc. The GVF approach has been proposed to achieve path-following for a nonholonomic mobile robot, and global convergence conditions were established to demonstrate the proposed algorithm [8]. Linear constrained MPC has been proposed to solve the path-following problem for quadrotor unmanned aerial vehicles [7]. There are also studies on model predictive control methods using models for other control strategies; for instance, the information-aware Lyapunov-based MPC strategy was utilized to achieve classic robot control tasks in a feedback-feedforward control scheme [9]. A nonsingular terminal sliding mode control scheme was constructed to solve the problem of the omnidirectional mobile robot with mecanum wheels [2]. There are also numerous intelligent computing methods that have been widely used in the research field [10][11][12]. Moreover, with the boom in artificial intelligence technology in recent years, investigations based on machine learning are emerging in the control field, especially for path-following problems [13], etc.
Reinforcement learning (RL) is one of the classic types of machine learning. It is a learning paradigm concerned with learning to control a system in order to maximize a cumulative expected reward performance measure that expresses a long-term objective [14] and can determine the optimal policy for decisions in a real environment. Recently, research into RL methods has been extended into multiple control fields such as trajectory tracking [15][16][17][18], path-following [19,20], etc. [21]. It is noted that RL is capable of coping with a control problem without knowing information about the objective dynamics and presents control with good performance under the influence of external disturbances [15,16,20]. Reinforcement learning can be combined with other classical control methods to solve tracking problems [17]. Considering the PID method, a Q-learning-PID control approach has been proposed to solve the trajectory tracking control problem for mobile robots, with better results than the single approach [18]. For path-following problems for unmanned surface vessels, a smoothly convergent deep reinforcement learning (SCDRL) approach has been investigated, utilizing a deep Q-network (DQN) structure and RL [22]. RL has also been used in research into path-following control for quadrotor vehicles, and has obtained outstanding results in the actual physical verification [20]. These studies demonstrate the highly robust nature of RL control methods when handling dynamic model errors and confronting environmental disturbances [23].
Various RL-based control algorithms have been studied in the context of path-following control problems for mobile robots, such as the path-integral-based RL algorithm [24], the adaptive hybrid approach [25], etc. Following up with further complex research, the obstacle avoidance problem has been widely addressed in the study of the scalability problem of path-following [26]. Considering both path-following and obstacle avoidance based on the characteristics of reinforcement learning means specifically considering the environment and the reward. It is noted that a compromise must be reached, ensuring a sufficiently low-dimensional observation vector while still providing a sufficiently rich observation of the current environment [27]. Furthermore, it is not limited to a single agent that can be set up in a reinforcement learning environment, resulting in problem solutions with different dimensions. Two independent agents have been considered to solve the tracking and obstacle avoidance problems separately, and finally to realize the coordinated control of both [28].
The focus of this paper is to further explore how recent advances in RL can be applied to both the path-following and obstacle avoidance problems of nonholonomic wheeled mobile robots (NWMRs). The main contributions of the proposed method are as follows: • The path-following control method is designed and implemented by considering the deep reinforcement learning algorithm DDPG, which reveals excellent performance regarding the efficiency and accuracy of the following control. • A new path-following and obstacle avoidance control strategy for NWMRs is proposed based on the RL algorithm, specifically in the design of a new mechanism for the state and reward for both in the environment, which simplifies the dimensionality of the environment state, ensuring that the mobile robot can achieve the optimal solution between path selection and obstacle avoidance actions. Moreover, the minimum representative value approach for avoiding collisions is proposed to solve for multiple obstacles, along with path-following control.
The rest of this paper is organized as follows. In Section 2, a kinematics model for the NWMRs is established and the basics of the path-following problem are briefly introduced. The basics of reinforcement learning are briefly introduced, and DDPG and the pathfollowing and obstacle avoidance control strategy incorporating RL developed in Section 3. In Section 4, the simulation results are presented and the experiments are discussed. The conclusion is presented in Section 5.

Kinematics Model for NWMRs
The aim of this paper is to solve the path-following and obstacle avoidance problems for NWMRs by utilizing the RL algorithm. The classical two-wheel differential driving mobile robot is studied, as presented in Figure 1. When the robot moves, its state is given by the three dimensions in set q = [xỹθ ] T , as the current state in the two-dimensional coordinate plane. The parameters associated with the motion of a mobile robot are the linear and angular velocities v and ω, respectively, which are obtained from inputs as u = [ v ω ] T . The kinematics model of the mobile robot can be described as [29]: Figure 1. Mobile robot path-following schematic.

Path-Following
As distinguished from the trajectory tracking problem, the path-following problem aims at moving the system along a geometric reference without any pre-specified timing information. It is assumed that the parametrized regular curve in the two-coordinate space [30] is as given in Equation (2).
Here, the scalar variable θ p is called the path parameter, and Path: R → R n is a parametrization of Path. The geometric curve p(θ p ) is satisfied with the characteristic of local bijectivity. The map p: R → R n is assumed to be sufficiently often continuously differentiable. As shown in Figure 1, the path with the coordinate points can be considered as p k (θ p (t)), k ∈ N, where the time t of θ p is set arbitrarily. The direction line can be virtually set as the tangent line of the path at the point. In this paper, the discrete coordinate point of the path is [ θ p ρ(θ p ) ] and the mobile robot's desired position p = [ θ p ρ(θ p ) θ path_r ] T can be considered as the path with a time law [1].
The position error of the mobile robot for path-following can be expressed by the tracking error expression [29] in Equation (3).
The goal of path-following is to guarantee that the position error converges, i.e., lim t→0q e (t) = 0.

Path-Following and Obstacle Avoidance Control Strategy Incorporating Reinforcement Learning
In this section a reinforcement learning method is used for the investigation of pathfollowing and obstacle avoidance for nonholonomic wheeled mobile robots, based on the kinematic model and the path-following model in the above section.

Reinforcement Learning Control Method
RL can directly interact with the environment without having any information in advance [19]. Classical reinforcement learning approaches are based on the Markov decision process (MDP), consisting of the set of states S, the set of actions A, the rewards R, and the transition probabilities T that capture the dynamics of a system [31]. According to the Markov property, the next state s t+1 is obtained by the model from the state s t and action a t . This is called transition probability model p(s t+1 |s t , a t ), and a reward r(s t , a t ) is obtained after state transition evaluation. The whole process from s t to s t+1 can be considered as one training step of the reinforcement learning, where the aim is to find the optimal strategy π * , i.e., the stochastic policy π(a|s) or the deterministic policy π(s) that can be evaluated by using the value function V π (s) or the state value function Q π (s t , a t ), which is expressed as: where both value functions are shown in Equations (5) and (6) separately, and the accumulated discounted reward R t is given in Equation (7).
Recently, researchers have used the techniques of experience replay and a separate target network to eliminate instability by establishing the large-scale neural network called DQN in the RL problem. This has already shown excellent performance [32]. However, the DQN is limited by the discrete nature of its action space, and is not capable of dealing with continuous control problems [19]. To overcome the difficulty of accurate expressions for actions, deterministic policy gradient (DPG) is proposed for handling the continuous action space [33]. The deterministic policy a=µ θ (s) : S → A is considered instead of the policy π θ (a|s) selected stochastically in state S, and the vector θ ∈ R n is its parameter. If the target policy is deterministic, then the value function Q π (s t , a t ) can be expressed as Q µ (s t , a t ), and the expectation can be avoided [33] in Equation (8): If there is an approximator parametrized by θ Q , then it can be optimized by minimizing the following loss: where y t is dependent on θ Q : Q(s, a) can be considered as the critic, which is learned by Q-learning using the Bellman equation and is updated by the expected return from the actor network using a DPG [33]: Regarding the approach to handling large networks in DQN, the DDPG algorithm uses experience replay and separate target network techniques to deal with large-scale neural network approximators in deep reinforcement learning. The DDPG has two basic networks called critic and actor, respectively. The whole process structure of the algorithm is presented in Figure 2 and will be used in the following study of the control strategy. The first step of the algorithm is that the actor network selects the action (control values), the corresponding reward, and the next state through the actor network, according to the training environment, and these are stored in the replay buffer with the action and state. Then, selecting a set consisting of state, action, forward, and updated state from the replay buffer, the target critic network selects the critic parameters according to the behavior selected by the target actor network. The critic network also gives other critic parameters, and then the network update of the critic network will be realized by the gradient of TD (temporal difference) errors with those parameters. Finally, the critic network selects the action and the current state according to the actor network, to realize the forward and backward propagation of the network. This process achieves the updating of the actor network by the policy gradient. For the updating of the networks, the critic network is updated using the gradient of the loss function L(θ Q ) in (9), whereas the actor network uses a deterministic policy gradient, which can be found in (11).

Path-Following and Obstacle Avoidance Controller Based on DDPG
The algorithm performance and convergence speed of reinforcement learning are highly dependent on the correctness of the state space, action space, and reward. In the process of path-following and obstacle avoidance control, the solution of using two agents to achieve control tasks is obviously complicated and inconvenient [19]. In order to solve these issues, this paper unifies the two types of control by designing the state space and reward based on the specific requirements of the two types of control. This can ensure that the wheeled mobile robot achieves effective obstacle avoidance in the process of path-following.
In this paper, the primary goal is to minimize the errors expressed in the above goal of path-following, and the state space S is expressed as: Considering obstacle avoidance control, the state space S can be redesigned as: where d obs and θ obs_e are the parameters of state for avoiding collisions, d obs represents the distance between the obstacle center d obs and the center O t of the robot, and t is the current time step, as shown in Figure 3. When the robot is far from the obstacle during path-following, the current control of the robot is considered to be relatively safe. The parameter is only considered when the robot goes to the safe region setting for avoiding obstacles. Therefore, this paper also divides the obstacle avoidance region for the obstacle, as shown in Figures 3 and 4, and the parameters d obs and θ obs_e are defined as follows: where r e is the radius of the minimum obstacle avoidance control area.  The above is considered for the case of a single obstacle; it is not applicable to an environment containing multiple obstacles. In the case of multiple obstacles, the minimum representative value technique is proposed, considering several d obs and θ obs_e , where the minimum values for both parameters are chosen as the elements of state for avoiding more collisions. The minimum representative value can be expressed as:

Obstacle
where k is the number of obstacles. Since there are multiple obstacles corresponding to different states, for reward setting considering all obstacles, penalty rewards are considered for them all. The minimum representative approach is then able to ensure that the agent maintains the behavior of obstacle avoidance during training and that it is feasible to achieve path-following and obstacle avoidance control for NWMRs. In RL, the agent can learn to adjust its strategy according to the reward so that it can avoid multiple obstacles in path-following. Compared with the environmental state set according to the number of obstacles [28], this method can reduce the dimensionality of the state, thus achieving the effect of obstacle avoidance while reducing the computational burden and saving computing resources.
Considering the path-following issue only, based on the current evaluation of the robot status, the basic reward function is designed as: When the robot moves into obstacle avoidance regions, the reward function is redesigned by using extra punishments or rewards on the basis of the original reward for tracking control, making the robot capable of bypassing the obstacle without collisions. Based on the division into different regions, the reward function for obstacle avoidance regions is expressed as: where φ rz and φ pz are parameters that limit the reward and penalty, respectively. Both are able to prevent large, abrupt changes in the single-step reward, which can cause instability during training. More specifically, not only is the distance to the circle of the obstacle for the reward function considered but the errors between the robot navigation angle and the obstacle direction are also taken into account in this work when the NWMR moves into the Punishment Zone. Based on this concept, the reward function is represented as: where ξ is the parameter that moderates the penalty according to the errors θ obs_e . If the NWMR collides with an obstacle during movement, then the task is considered a failure, a severe penalty ψ is imposed directly as the reward in this step, and the training environment will convert to a new episode. Due to this severe negative reward, the robot is able to follow the basic path-tracking control strategy in the learning of obstacle avoidance, and eventually it is able to complete the motion control for the whole set path while avoiding the obstacles.
According to the above path-following and obstacle avoidance control strategy, the control process based on deep reinforcement learning is shown in Algorithm 1:

Algorithm 1 Path-Following and Obstacle Avoidance Control Strategy for NWMRs
Require: robot random initial pose q o , path p, training Max_episode, time step ∆t, learning rate α for actor network and β for critic network, parameter τ ∈ (0, 1) for stability of training, discount factor γ, experience replay buffer size N, the number k of obstacles, obstacle avoidance position o obs i , i ∈ [1, k], parameters r obs , r e , r c , θ s , and parameters related to obstacle avoidance ψ, ξ, θ s , φ rz , φ pz ; Intialize: critic network Q(s, a|θ Q ) and actor network µ(s|θ µ ) randomly, target network Q (s, a|θ Q ) and µ (s|θ µ ); 1: for each episode ∈ [1, Max_episode] do 2: Obtain an observation of random initial pose to NWMR in environment, then output position error q e (t) through path parameters, and finally obtain initial state s 1 = [x eỹeθe d obs θ obs_e ]; 3: Initialize a random noise N t for the deterministic strategy; 5: Randomly select an action as a control input based on the current environment strategy and exploration noise u t = µ(s t |θ µ ) + N t ; 6: Execute u, then obtain reward re t and new state s t+1 ; 7: Put transition(s t , u t , re t+1 , s t+1 ) into experience replay buffer D; 8: if number of transition > Memory then 9: Extract randomly a batch of transitions from D; 10: Update actor network and critic network, (9) (11); 11: Update target network for stable training as: 12: end if 15: end for 16: end for

Results and Discussions
In order to verify the path-following and obstacle avoidance control strategy proposed in this paper, several sets of simulation experiments were conducted. Firstly, only the path-following was investigated and compared with the model predictive control (MPC) method. Secondly, path-following and obstacle avoidance simulation experiments were conducted, and results were given for validating the effectiveness of the proposed controller performance for multiple obstacle avoidance in training environments.

Training Setting
In the environment of the simulation, the initial position [x initỹinitθinit ] of the NWMR was randomly selected around the end point of the path, where the initial position can be expressed as: where ∆e 1 , ∆e 2 , ∆e 3 ∈ [−1, 1] were used to generate different initial values in each start episode in the environment, and the maximum linear and angular velocities were set as 3 m/s and π/2 rad/s, respectively, in the training. In the training simulation, the time step ∆t was set to 0.5 s (2 Hz) and the size of the mini-batch to 64. To establish the training networks, the Adam optimizer was used to train both the actor and critic networks. The hyperparameters are shown in Table 1, and the networks were built using the machine learning library Pytorch. The learning rate α was set to 0.001 for the actor network, and β was set to 0.01 for the critic network. The target network transition gain τ was selected as 0.01, and the discount factor γ was selected as 0.9. For exploration of the training, the Ornstein-Uhlenbeck exploration method discussed in [34] was used. The sinusoidal path can be parametrized, as investigated in [22], as:

Comparison of Path-Following between the Proposed Method and MPC
In this experiment, the agent was trained for 400 episodes, with a total of 240,000 training steps in the simulations. To validate the path-following capabilities of the proposed method, the MPC algorithm [35] was introduced to present its performance for path-following, and a comparison was made between the two methods. Moreover, the effectiveness of the proposed algorithm is further illustrated by showing the following effects at different training stages. Figure 5a shows the path-following effect of the proposed algorithm and also adds the MPC algorithm for a comparison of the results. As shown in Figure 5b, errors for each waypoint of the path are presented, It is shown that the convergence performance of the proposed algorithm is better than that of the MPC algorithm at the turn. It can also be shown that the longitudinal error performance using the proposed algorithm is better than that using the MPC and that the other errors show the same or better performance. The comparison of inputs between the two algorithms is given in Figure 5c. Figure 5d presents the reward changes in every step of the final episode in the training, and Figure 5e shows the average score per 100 episodes in the whole training process. It can be shown that the trend of the reward score is consistent with the path-following effect, by combining the results in Table 2, which represents the four stages of the training process, chosen as 100, 200, 300, and 400 episodes. It can be shown that the agent tended to show worse performance results for following the path at the initial stage of the training, which is irrelevant to the initial purpose of path-following. As the training continued, the path-following performance tended to develop, resulting in a decrease in the cross-track error and the angular error, with scores climbing steadily. Finally, the proposed algorithm results in better performance than MPC considering the comparison of the longitudinal and cross errors.  1930 According to the results of the path-following simulation experiments, the proposed control strategy performs more robustly and has more accurate characteristics. It is capable of moving close to the waypoints of the path with different starting points, whereas the comparison algorithm needs to adjust its parameters to meet the requirements of the initial point.

The Performance of Path-Following with Collision Avoidance
In this experiment, the agent was trained for 1000 episodes, with a total of 600,000 training steps in the simulations. Two obstacles were chosen in order to perform a validation study of the proposed algorithm for path-following and obstacle avoidance. The learning rate β was reduced to 0.002 for the critic network, and the discount factor γ was changed to 0.98. The number of obstacles k was set at two, with centers at [10,0] and [47, −10] around the waypoints of the reference path, and the common parameters r obs , r c , r e were set to 3.5 m, 5.5 m, and 7.5 m, respectively. The parameters φ rz and φ pz were set to 100 and 10, as scores for the reward and penalty, respectively. For each obstacle, the severe penalty ψ was 100 in every episode, and the parameter θ s was set at π/4. Figure 6a shows the results of the proposed algorithm when considering path-following and obstacle avoidance control simultaneously. In Figure 6b, the errors at each waypoint of the path are presented. The inputs of the proposed algorithms are given in Figure 6c. Figure 6d presents the reward changes in every step in final episode of the training. It can also be seen in Figure 6d that there is a certain penalty score when the robot moves to the obstacle avoidance area, which shows that the robot is able to achieve the obstacle avoidance operation from the divided area in the case of path-following. Figure 6e shows the reward changes in the final episode and in all episodes in the training, and it can be seen that the trend of the reward score is consistent with the path-following effect, by combining the results in Table 3, which represents the four stages of the training process, chosen as 100, 200, 400, and 1000 episodes.

Conclusions
In this paper, a deep-reinforcement-learning-based controller was proposed for pathfollowing for nonholonomic wheeled mobile robots(NWMRs). The deep deterministic policy gradient (DDPG) algorithm was utilized to establish a control law for linear and steering velocities, and the learning-based control policy was trained using repeated pathfollowing simulations. The path-following results demonstrated the effectiveness of the proposed method, and the comparisons showed that our method had better efficiency and more robust performance than the MPC method for path-following control without collisions. For research on path-following and obstacle avoidance control, a new approach was proposed to deal with redesigning the state and reward in RL. Moreover, minimum value techniques for the state were given for the path-following and obstacle avoidance controller, and the results showed the feasibility of solving the multiple obstacles environment problem during path-following for the control of NWMRs.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: