Path Following for Autonomous Ground Vehicle Using DDPG Algorithm: A Reinforcement Learning Approach

: The potential of autonomous driving technology to revolutionize the transportation industry has attracted signiﬁcant attention. Path following, a fundamental task in autonomous driving, involves accurately and safely guiding a vehicle along a speciﬁed path. Conventional path-following methods often rely on rule-based or parameter-tuning aspects, which may not be adaptable to complex and dynamic scenarios. Reinforcement learning (RL) has emerged as a promising approach that can learn effective control policies from experience without prior knowledge of system dynamics. This paper investigates the effectiveness of the Deep Deterministic Policy Gradient (DDPG) algo-rithm for steering control in ground vehicle path following. The algorithm quickly converges and the trained agent achieves stable and fast path following, outperforming three baseline methods. Additionally, the agent achieves smooth control without excessive actions. These results validate the proposed approach’s effectiveness, which could contribute to the development of autonomous driving technology.


Introduction
Autonomous driving is increasingly being valued due to its potential in reshaping and enhancing safety in mobility, as well as in efficiency [1]. In urban environments, selfdriving vehicles are less prone to traffic accidents so that the traffic network can be more fluid. In agricultural areas, reducing the required labor is a decent compensation for the decline in agricultural labor force and it makes it possible to complete agricultural tasks during certain hours that are not suitable for humans to work. Despite various technologies such as localization, perception and planning are required in realizing the full potential of autonomous driving; ensuring that the vehicle can stabilize at a set point or dynamically track reference signals and trajectories is a fundamental technology requirement [1][2][3].
Path following has been widely studied as an alternative to trajectory tracking in various types of vehicles due to its looseness with respect to time constraints [2]. In trajectorytracking problems, the reference trajectory defines when and where the vehicle is supposed to be in the state space, while the objective of path following is to keep the vehicle as close as possible and moving along a predefined geometry path with no preassigned time information [2,4,5]. Pure pursuit is one of the earliest proposed strategies for following a path [6,7]. The straightforward nature of this strategy has made it a popular choice in many applications where real-time control is essential. Assuming that the reference path has no curvature and the vehicle is moving at a constant speed, the pure pursuit controller is based on fitting a circle through the vehicle's current configuration, which in the case of a car is the rear wheel position, to a point on the path ahead of the vehicle by a so-called look-ahead distance L [1,8]. However, the controller does not take the case where the distance from the current configuration to the reference path is greater than L into consideration. Among the feedback control-based approaches, an approach where the reference path is parameterized as a continuous function and the rear wheel position is used as the regulated variable to minimize the cross-track error between the rear wheel and the path, while also ensuring the stability of the vehicle heading, has been proposed in [9]. Notwithstanding the local exponential convergence ensured by the control law, knowledge of the reference path's curvature is required, thus necessitating that the parameterized function be twice continuously differentiable. Another feedback-based control law that minimizes the cross-track error of the front wheel position relative to a reference path that is discretized from the base trajectory to smooth waypoints has been proposed in [10]. This approach, which utilizes a nonlinear feedback function of the cross-track error to ensure local exponential convergence to the reference path, provides effective control for lower speeds, but necessitates some variations to be used for reverse driving. The vehicle robot using this controller for steering, namely Stanley, won the 2005 DARPA Grand Challenge. The aforementioned approaches [6,9,10] are suitable as baselines for reference and comparison, since they achieve satisfactory performance with a small number of adjustable parameters, moderate model accuracy and uncertainty requirements.
In recent years, RL has achieved significant accomplishments in many fields, such as gaming and robotics, which have brought it great attention and widespread recognition [11,12]. RL is a machine learning approach that enables an agent to learn optimal decisions through interactions with the environment. Since the use of deep neural networks to approximate the action-value function [13], which is also known as the Q-function, RL has been capable of dealing with high-dimensional state spaces. The Deterministic Policy Gradient (DPG) algorithm [14] replaces the method of representing policy using probability distribution with a deterministic function, making control in continuous action spaces more practical. The DDPG algorithm [15], which combines the two approaches, has thus found widespread applications in fields such as robotics control. Over the last few years, numerous researchers have attempted to apply deep reinforcement learning (DRL) methods to solve the problem of path following. Rubí et al. [5] proposed three sequentially improved methods based on DDPG, which realized the tracking of the path and adaptive velocity control of a quadrotor. It is difficult to fine-tune an agent trained in a simulation environment through experimentation; thus, a correction constant is employed for stable performance of the agent's output in their experiment. Cheng et al. [16] accomplished path following and collision avoidance for a nonholonomic wheeled mobile robot in simulation, but the trained agent exerted excessive control effort, resulting in high jerks in robot velocities. More recently, Zheng et al. [17] propose a 3D path-following control method for powered parafoils, utilizing a combination of linear active disturbance rejection control and DDPG, to effectively control the parafoils' flight trajectory and counter wind disturbances. Ma et al. [18] present a path-following control scheme based on Soft Actor-Critic for an underwater vehicle, demonstrating successful path tracking.
Our work focuses on exploring the application of RL to the path-following problem for ground car-like vehicles that move at medium speeds, with the primary objective of minimizing cross-track error. We use the DDPG algorithm, which combines the advantages of Deep Q-Network (DQN) and DPG, to solely address the steering control of the vehicle, separating lateral and longitudinal control to simplify the problem and reduce the action dimension. We also use a simple reward function to achieve smooth steering and avoid jitters during vehicle operation. This approach has been demonstrated to be easy to train and effective, with better performance on the trained path compared to three baseline methods and comparable performance on untrained paths. The trained agent's control strategy has been shown to respond quickly to lateral offset from the target path with an acceptable overshoot. Given the advantages of this method, it has great potential for practical application.
The remainder of this paper is structured as follows. In Section 2, the path-following problem is described using a general approach, allowing for broad applicability. The kinematics model of the car-like vehicle, which relates to the reference path, is then introduced in detail. Section 3 provides an overview of the prerequisite knowledge of reinforcement learning, which forms the foundation of the actor-critic algorithm known as DDPG. This is followed by a description of the design and algorithm of the path-following controller based on DDPG, including implementation details. In Section 4, the training process and corresponding agent performances on three different paths are presented. Each case is compared and numerically analyzed against three baseline methods, with additional discussion on corresponding steering actions. The conclusion is presented in Section 5.

Problem Formulation and Modeling
In this section, we present the problem of path-following for ground car-like vehicles in a planar environment. The problem is described in a generalized manner, making it applicable to a variety of vehicle models. The ground vehicles are modeled as simplified two-wheeled vehicles, where it is assumed that the left and right wheels at the front and rear of the vehicle are consolidated at the center position of the axle. Additionally, the model takes into account the inertia effects.

Path Following
For a controlled nonlinear system of the form below, where x ∈ X ⊆ R n and u ∈ U ⊆ R m define state and input constraints, the objective of path following is to design a controller, such that the system follows a parametrized reference path [2]. The reference path can be given by Equation (2), where only the movement on the plane is considered.
For any given ω, a local tangential path reference frame {T} centered at p T (ω) can be defined, which is indicated by the subscript T. The relative angle σ between the fixed ground frame {O} and the local reference frame {T} can be calculated with Equation (3).
where the function atan2 is the four-quadrant version of arctan that returns the angle between the positive x-axis and a point [x T , y T ] T in the Cartesian plane and is positive counterclockwise. It is also evidently seen that the parametrized reference path must be continuously differentiable. Given the vehicle's posture [x(t), y(t), φ(t)] T at time t, the path-following error is given by Equation (4), which is a cross product between two vectors. The control objective is to guarantee that the path-following error converges such that lim t→∞ e p (t) = 0.
where d = (d x , d y ) is the tracking error vector andt = (t x ,t y ) is the unit tangent vector to the reference path at ω(t) as defined in Equations (5) and (6), respectively. d = (x(t), y(t)) − (x T (ω(t)), y T (ω(t))) (5) t = (x T (ω(t)), y T (ω(t))) (x T (ω(t)), y T (ω(t))) The relative orientation ψ(t) between the vehicle and the path at time t is given by Equation (7). It indicates whether the vehicle is moving towards the direction of the path or away from it. In practice, we employ the normalization technique to restrict the value of ψ(t) within the range of −π to π. Normalizing the angle difference is a common technique. Although the trigonometric operations remain the same, constraining the range within this specific interval helps reduce the observation space.
To find the point along the reference path from which to calculate the cross-track error, the point that is nearest to the vehicle is selected [9,19,20]. This gives rise to an optimization problem of finding the parameter ω that minimizes the distance between the vehicle position and the reference path. We prefer the squared Euclidean distance that is based on the equivalence it holds with the original optimization problem and its computational convenience. The optimization problem can be expressed as follows: One natural approach for updating the path variable ω is to iteratively compute the value that yields the nearest distance between the vehicle and the reference path using Newton's method [19]. The feature that Newton's method only guarantees a local optimum helps prevent sudden jumps in the path variable and promotes stability in the optimization process by using the previous path variable value as the initial guess. One can refer to [21] for details.

Kinematics Model for Car-like Vehicles
Car-like vehicles are a class of vehicles that are capable of independently controlling their forward speed and steering, with the most commonly used model being the equivalent kinematics bicycle model [1,22,23]. The bicycle model as illustrated in Figure 1 consists of two wheels connected by a rigid link and is restricted to movement in a plane, where the front wheel is allowed to rotate about the axis vertical to the plane and the rear wheel fixed to the body provides forward momentum. More generally, vehicles with these constraints on maneuverability are referred to as nonholonomic vehicles. The parameters and notations of the vehicle model used in Figure 1 are presented in Tables 1 and 2, respectively.
The motion in the lateral and yaw directions of the vehicle's center of gravity (C.G.) is given by the following equation [22,24,25].
where cornering forces generated in the front and rear tires, denoted as F f and F r , respectively, are expressed as follows.
The motion equation for the vehicle model is derived by substituting Equation (10) into Equation (9), resulting in the following expression.
where δ is an additional freedom degree of the front tire that rotates to steer [26]. Due to vehicle mechanics, the tire angle δ is usually limited to a range, where δ ∈ [δ min , δ max ]. δ . .  Let x = [x, y, φ] T ∈ R 3 denote the posture of the vehicle's C.G. in the fixed ground frame {O}. Here, the heading angle φ corresponds to the orientation of the vehicle, measured as the angle between the x-axis of frame {O} and the direction in which the vehicle is moving. Positive values of the heading angle indicate a counterclockwise rotation. The kinematics model that describes the differential constraints of the vehicle can be given by Equation (12) The path-following problem based on this model is depicted in Figure 2. In the figure, e p denotes the cross-track error, which represents the deviation between the tangent at the nearest point on the path and the C.G. of the vehicle. When the vehicle is on the left side of the path, e p is greater than zero (e p > 0) and conversely, when the vehicle is on the right side of the path, e p is less than zero (e p < 0). ψ signifies the relative orientation between the C.G. of the vehicle and the path. When ψ ranges from − π 2 to π 2 , the vehicle's travel direction aligns with the path. When |ψ| exceeds π 2 , it indicates that the vehicle is traveling in the opposite direction to the path. Additionally, a value of π 2 represents the vehicle moving perpendicular to the path.  In this study, we address the problem of path following for a car-like vehicle, as depicted in the structure illustrated in Figure 3. This structure involves the separation of longitudinal and latitudinal controls. Our focus is on the steering control, which is determined by a DDPG agent [15], with the aim of keeping the vehicle as close as possible to the reference path p T (s) while maintaining a certain velocity V * . The three baseline methods [6,9,10] also rely on this separated control structure, where the path-following algorithm is responsible for providing the commanded steering angle value. Specifically, in this study, the controllers are assumed to be employing identity transformations. We focus on the path-following algorithm, which is solely used to determine the steering angle command δ * .

Path-Following Control Strategy with Deep Deterministic Policy Gradient
In this section, we employ a DRL approach to investigate path following for car-like vehicles in a customized simulation environment. This approach is based on the kinematic bicycle model and reference path definition introduced in the preceding section. Prior knowledge about DDPG is first discussed, followed by its implementation in the context of path following.

Preliminaries of Reinforcement Learning
RL involves an agent interacting with the environment to learn the optimal actions that maximize cumulative reward. Here, we consider a standard RL architecture using a Markov Decision Process (MDP), where based on a given state s t ∈ S at time t the agent takes an action a t ∈ A, receives a corresponding reward r t and transitions to the next state s t+1 with a probability p(s t+1 |s t , a t ) [27][28][29]. The MDP models the decision-making process of the agent as it interacts with the environment. Actions taken by the agent are specified by a policy, in general, which is stochastic and denoted by π(a|s). The policy π : S → P (A) maps the states to a probability distribution over actions, specifying the likelihood of taking each action given a state.
The return, which it is the objective of the agent to maximize, is the total discounted reward from time-step t onwards, as defined in Equation (13).
where γ ∈ (0, 1) is a discount rate that is used to prevent divergence of the return to infinity. Let ρ π denote the discounted state distribution for a policy π; the action-value function, also commonly referred to as Q-function is defined to be the expected total discounted reward, The aim of the agent is to acquire a policy that maximizes the cumulative discounted reward from the initial state, which is expressed as an expectation in Equation (14).
The Bellman equation is a fundamental recursive relationship that is widely employed in the field of RL [14,30]. Specifically, the Q-function under a stochastic policy is expressed as Equation (15), whereas the same function under a deterministic policy µ : S ← P (A) can be rewritten as Equation (16) [14,15].
With a deterministic policy, the expectation is solely dependent on the environment, allowing for the learning of Q µ off-policy by utilizing transitions generated from a distinct stochastic policy α. The actor-critic architecture is a widely utilized off-policy framework [31,32], consisting of two eponymous components. Considering function approximators parameterized by θ ∈ R n , which refers to a vector of n parameters, an actor updates the parameters θ µ of the actor function µ using policy gradients, while a critic updates the parameters θ Q and estimates the unknown true action-value function Q µ by using a policy evaluation algorithm such as temporal-difference (TD) learning. By minimizing the mean squared loss given by Equation (17), we can optimize the parameters of the critic.
where y t is the TD target that the Q-function is updating towards.
By applying the chain rule to the actor parameters θ µ , the way to update the actor using the expected return from the starting J is shown in Equation (19).

Deep Deterministic Policy Gradient for Path-Following
The path-following controller based on the DDPG algorithm makes use of a pair of neural networks, one for learning the policy (actor) and the other for learning the value function (critic) and introduces target networks to reduce bias in estimation through delayed updates. Additionally, it employs an experience replay buffer to store and replay samples, which reduces sample correlation, increases sample efficiency and enhances the learning capability of the algorithm. As used in the original paper [15], an Ornstein-Uhlenbeck (OU) process [33], which generates temporally correlated noise, is introduced to explore the action space. An overall architecture is depicted in Figure 4. , Figure 4. DDPG-based controller for path following. The optimization process of the two network pairs is illustrated, with the parameters used to optimize the critic and actor networks distinguished by the colors blue and red, respectively. The exploration noise is only added in the training phase. The apostrophe implies that the variable pertains to onward steps.

Observation Space and Action Space
As the MDP under consideration is partially observable, we use the term "observation" instead of "state" to denote the information the agent relies on when taking actions. The agent performs some actions in the environment and observes the resulting changes in the environment's state. This interaction between action and observation is commonly known as a time step. The observation s, which serves as the input to the path-following controller, is selected in an intuitive and low-dimensional manner: where e p is the cross-track error in Equation (4), ψ ∈ [−π, π] is the orientation error between the path and the vehicle, and δ is the steering angle of the front tire, as graphically presented in Figure 2. We calculate the cross-track error and heading error based on the C.G. of the vehicle, as described in Section 2.1. The steering angle δ reflects the level of effort exerted by the vehicle. More details are presented in Table 3. It is worth noting that, for the purpose of constraining the sparsity of the reward function, the absolute value of the cross-track error is limited to below 2.0 m. This constraint helps reduce the size of the observation space. Additionally, any deviation exceeding 2.0 m is considered an unacceptable error and the vehicle should be forcefully brought to a stop. We choose to use the rate of steering angleδ over the steering angle δ itself as the action, a t = µ(s t |θ µ ), to avoid undesired fast angle changes. Adopting an incremental control input to the vehicle makes it easier to achieve smooth vehicle motions. Furthermore, constraining the steering angle rate within a specified range, i.e.,δ ∈ [δ min ,δ max ], provides a better stability. The steering angle command δ * t at each time step during training is computed using Equation (21), which incorporates the sampled noise signal n t from an OU process N . However, the noise is excluded when it is not the training phase. The clipping operation is used to bound values within their ranges, typically to prevent saturation. The constraints on the action and OU process parameters are illustrated in Tables 4 and 5.
where ν n determines the speed of mean reversion, the drift term µ n affects the asymptotic mean and the standard Wiener process scaled by the volatility ζ n is denoted by dW t .

Rewards
Given that the objective of an agent is to maximize long-term returns, the design of a reward function is crucial for the satisfactory performance of the agent. In the context of path following, a natural and intuitive approach is to reward the agent for minimizing the cross-track error with respect to the desired path. In [5], the agent is rewarded when the vehicle stays on the path and is penalized by an absolute value function when it deviates from the path. In [21], a Gaussian reward function centered at a cross-track error of 0, with a reasonable standard deviation, is employed. However, we believe that the exponential reward function, with its shape, incentivizes more effective minimization of cross-track error. To ensure smooth steering control, we apply penalties to excessive steering, without considering the consistency between the heading and the path. To mitigate the sparsity of the exponential reward and reduce insignificant experiences, the reward function, as shown in Equation (22), is kept concise. Moreover, if the vehicle deviates from the path beyond a certain distance e max or moves in the opposite direction of the path, the current training episode is truncated, indicating a failure in completion. In such cases, a negative reward is given to penalize the failure. On the other hand, if the vehicle successfully completes the path or reaches the maximum time steps allowed for a single episode, a positive reward is given. The scaling parameters associated with the rewards are provided in Table 6.
where c e > 0 determines the sparsity of the reward and the degree of convergence in training; a high c e may result in overly sparse rewards. c δ > 0 and cδ > 0 scale the penalty term on the steering angle and its rate to penalize excessive steering. H and F are logical values that indicate the successful completion of an episode and failure, respectively.

Environment
The environment in which the agent is expected to perform is an a priori known path. The vehicle kinematics should also be considered as part of the training environment, as it is beyond our control. Moreover, it is crucial for the agent to be trained on a wide variety of challenges to enable it to handle generalized situations instead of overfitting to specific paths. Therefore, we propose an algorithm outlined in Algorithm 1 for generating stochastic reference paths.

Algorithm 1 Stochastic Path Generator
Generated waypoint counter n ← 1, starting waypoint p 1 ← [0, 0], Number of path waypoints N w ∈ N, Range of length between waypoints [L min , L max ] ∈ R + while n ≤ N w do Sample L w from U (L min , L max ) Sample θ w from U (0, 2π) New waypoint p n+1 ← p n + L w [cos(θ w ), sin(θ w )] T n ← n + 1 end while Create parameterized path p T (ω) = [x T (ω), y T (ω)] T using Cubic Spline Interpolator In this work, N w = U (2, 6), L min = 25 and L max = 50. Some paths randomly generated from this algorithm are shown in Figure 5.

Implementation Details
The actor and critic neural networks both consist of two hidden layers, as illustrated in Figure 6. Each layer includes rectified linear units (ReLU) activation, with 400 neurons in the first hidden layer and 300 neurons in the second. Notably, in the critic's network, the state vector connects to the first hidden layer, whereas the action is concatenated before the second hidden layer, following the structure of the original algorithm. This design allows the action to bypass the first layer, which improves the stability and performance of the networks [15]. The final layer of the actor is a tanh layer used to bound the action.
actor network critic network Figure 6. The two networks share the same structure and input, except that in the critic network, the action is concatenated before the second hidden layer.
We initialize the weights following the method described in [34], with the exception that we used uniform distributions [−3 × 10 −3 , 3 × 10 −3 ] and [−3 × 10 −4 , 3 × 10 −4 ] to initialize the final layers of the actor and critic networks, respectively. This is done to prevent output saturation during the early stages of training. To optimize neural networks, we employ Adam [35] with a minibatch size of 64.
Algorithm 2 outlines the training process based on DDPG, following the path-following strategy described above. The initial posture of the vehicle is sampled from a uniform distribution, with a position range of [−1, 1] meters and a heading angle range of [−0.2618, 0.2618] radians. We also introduce a warm-up technique to collect completely random experiences. During the initial training period, the agent uniformly samples random actions from the action space. After warming up, the actor network first generates an output, i.e., action, based on the observed state and adds sampled noise, transitions to the next state, computes the corresponding reward and stores it in the experience replay buffer in the form of a tuple until the number of experiences reaches the size of the set minibatch, at which point the networks are optimized and updated. The loss required for updating the parameters of the critic and actor networks can be calculated using Equations (17) and (19). After optimizing the networks, the parameters of the target networks are updated using a soft update strategy, as denoted by Equation (23). Specifically, a fraction of the updated network parameters are blended with the target network parameters, which helps to stabilize the learning process and avoid oscillations.
where parameter τ indicates how fast the update is carried on and the update is performed at each step after training the online networks. During training, a random path is generated for each episode and the agent's action is subject to exploration noise. The agent selected is determined based on its performance in the evaluation. The difference between training and evaluation is that during evaluation, the actions taken by the agent are solely based on the current learned policy without the addition of exploration noise. For each evaluation, the agent's performance is assessed across 10 randomly generated paths. Since evaluation occurs at regular intervals, we select the agent that achieves the most high rewards among the evaluations. Table 7 presents the relevant parameters and their values used in the training. We also conduct the training 10 times using 10 different seeds to ensure reputability. For instance, the uniform sampling of the initial position of the vehicle and the random generation of paths rely on a random number generator that is controlled by a seed.

Algorithm 2 Training Process of Path-Following Control Strategy for Car-Like Vehicles
Randomly Initialize critic network Q(s, a|θ Q ) and actor µ(s|θ µ ) with weights θ Q and θ µ Initialize target network Q and µ with weights θ Q ← θ Q and θ µ ← θ µ Initialize replay buffer R for t = 1, T do Initialize a stochastic path P and random initial posture [x 1 , y 1 , θ 1 ] T of vehicle Initialize an OU process N for action exploration Observe initial state s 1 while True do if t < T start then Select action randomly from action space else Select action based on the policy a t = clip(µ(s t |θ µ ) + N t ,δ min ,δ max ) end if Calculate steering command δ * t = clip(δt + a t ∆t, δ min , δ max ) Execute steering control, calculate reward r t and transitions to new state s t+1 Store transition (s t , a t , r t , s t+1 ) in R if abs(e p (t)) > e max or abs(ψ(t)) ≥ π 2 then Update critic by minimizing the loss: L = 1 N ∑ i (y i − Q(s i , a i |θ Q )) 2 Update the actor policy using the sampled policy gradient: Update the target networks with soft update strategy:

Tools and Libraries
Our solution was implemented using the PyTorch library [36], which provides a comprehensive deep learning framework for constructing and training neural networks. In our implementation, we utilized PyTorch's tensor operations and high-level modules, such as the torch.nn module, to construct our actor-critic networks. We also employed PyTorch's optimization algorithms, such as the Adam optimizer, to update the weights of the networks during training. The agent was trained in a custom path-following environment built using Python, which allowed us to simulate a range of scenarios and evaluate the agent's performance under different conditions.

Results
In this section, we discuss the training process of path following and present the test results on three parameterized paths after training. The evaluation criteria focus on the effectiveness of path convergence and whether the agent adopts minimal and smooth steering as much as possible.

Training Process
The learning curve for the path-following problem is illustrated in Figure 7, where the solid line represents the average of the 10 trials. The shaded region indicates half a standard deviation of the average evaluation. In the initial stage, the agent learns quickly and receives high rewards. However, as the warm-up phase ends and the agent starts taking actions based on the current policy, there is a period of decline with relatively unstable results across the 10 trials. Nevertheless, after progressing halfway through the learning process, all 10 agents begin to learn a stable policy that yields high rewards consistently.

Figure-Eight Curve
The figure-eight curve, also known as the Lemniscate of Gerono, possesses a curvature that is exceptionally smooth and richly dispersed. This curve can be parameterized via Equation set (24).
where a is a constant that determines the size and shape of the curve and is set as 50.
The trajectories of the four methods, including our proposed approach, in tracking the curve are shown in Figure 8. All four methods are able to maintain a small cross-track error while following the path, with the main difference being observed at locations with high curvature and sharp turns. From the right subplot, it can be clearly observed that our proposed approach achieves faster reduction of cross-track error and maintains closer proximity to the desired path, even in the overall task. It is noteworthy that such a path did not appear in our training environment and based on our experience, our random path generation algorithm would have difficulty producing paths similar to this. This highlights the generality of the RL-based method. The overall root squared cross-track error for the path-following task is summarized in Table 8  The corresponding steering angles δ are shown in Figure 9. As expected, our proposed approach achieves smooth steering actions for the agent while reducing the cross-track error in the prior task, thereby avoiding any jitters. In addition, the agent trained using our approach tends to employ smaller steering angles. WLPH>V@ WLUHVWHHULQJDQJOH>UDG@ 3XUH3XUVXLW )HHGEDFN 6WDQOH\ ''3*

Lane Change
The lane change maneuver, which is a very common vehicle maneuver, is selected to verify and compare the tracking performance of the algorithms. The path can be parameterized as a sigmoid function with Equation (25).
where the parameters are defined as follows: a = 0, b = 40, c = 40 and k = 0.2, representing the starting point, end point, center of the lane change and the steepness of the sigmoid function, respectively.
The trajectories for each method are shown in Figure 10 and the root squared crosstrack error for the overall task is summarized in Table 9. In this scenario, the performances of the DDPG-based controller and the Feedback-based controller stand out, with DDPG slightly outperforming the latter. However, both controllers tend to employ relatively larger steering angles compared to other approaches as shwon in Figure 11. The performance of the 10 agents varies, with some agents performing well and others performing poorly, in comparison to the previous scenario.

Return to Lane
The convergence performance of the controller can be assessed by evaluating its ability to execute the return-to-lane path. The return-to-lane path refers to the vehicle's process of returning to a straight line from an offset posture, which frequently happens during normal vehicle operation.
Compared to the other methods, our proposal demonstrates superior overall performance in terms of fast convergence and avoidance of overshoot. While achieving rapid path convergence, it also maintains smooth and minimal steering as illustrated in Figure 12. The numerical comparison is summarized in Table 10, where delay time is defined as the time required for the cross-track error to reach 50% of its steady-state value, settling time is defined as the time required for the cross-track error to enter the ±5% range of its steady-state value and overshoot is defined as the percentage difference between the peak value of the cross-track error and its target value, relative to the target value.

Conclusions
In this paper, we explored an off-policy algorithm, namely DDPG, based on an actorcritic architecture to address the path-following problem for ground vehicles. Our approach not only minimizes the cross-track error between the vehicle and the path, but also prevents excessive steering that can cause severe oscillations. To train the agent, We used a challenging and varied environment where each episode generates a random path. In testing, we evaluated the performance of the trained agents in terms of fast path convergence and smooth steering by selecting three representative paths.
Conventional methods rely on rules and parameter tuning. As a comparison, three baseline methods mentioned in this paper require parameter adjustment for each path to achieve good path-following performance. In contrast, the trained agent has broader applicability and outperforms the baseline methods. Similar to the baseline methods, our agent focuses only on steering control to achieve the goal of path following, reducing the action dimension but also losing some exploration space. An agent that combines both speed and steering control may find better solutions, which is our future research direction. Furthermore, control strategies that take into account both the path-following and tyre management in contact with various terrains [37,38], or more importantly include the pollution due to particles of worn rubber [39,40], is a practical aspect we aim to investigate in our future work.
In conclusion, our approach effectively achieves smooth path following by only interacting with the environment to reward long-term returns. Our approach has demonstrated satisfactory performance and could contribute to the development of autonomous driving technology.