Vehicle Lateral Control Based on Augmented Lagrangian DDPG Algorithm

Li, Zhi; Wang, Meng; Zhao, Haitao

doi:10.3390/app15105463

Open AccessArticle

Vehicle Lateral Control Based on Augmented Lagrangian DDPG Algorithm

by

Zhi Li

,

Meng Wang

^* and

Haitao Zhao

Key Laboratory of Smart Manufacturing in Energy Chemical Process, Ministry of Education, East China University of Science and Technology, Shanghai 200237, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(10), 5463; https://doi.org/10.3390/app15105463

Submission received: 7 April 2025 / Revised: 11 May 2025 / Accepted: 12 May 2025 / Published: 13 May 2025

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

This paper studies the safe trajectory tracking control of intelligent vehicles, which is still an open and challenging problem. A deep reinforcement learning algorithm based on augmented Lagrangian safety constraints is proposed to the lateral control of vehicle trajectory tracking. First, the tracking control of intelligent vehicles is described as a reinforcement learning process based on the Constrained Markov Decision Process (CMDP). The actor-critic neural network based reinforcement learning framework is established and the environment of reinforcement learning is designed to include the vehicle model, tracking model, road model and reward function. Secondly, the augmented Lagrangian Deep Deterministic Policy Gradient (DDPG) method is proposed for updating, in which a replay separation buffer method is used to solve the problem of sample correlation, and a neural network with the same structure is copied to solve the update divergence problem. Finally, a vehicle lateral control approach is obtained, whose effectiveness and advantages over existing results are verified through simulation results.

Keywords:

deep reinforcement learning; trajectory tracking; safety learning; constrained Markov decision process

1. Introduction

The rapid advancement of autonomous driving technology has imposed increasingly stringent requirements on vehicle trajectory tracking control. As a core component of intelligent driving systems for new energy vehicles, trajectory tracking control must balance tracking accuracy, safety, stability, and multi-objective optimization capabilities in complex dynamic environments. Traditional control methods such as model predictive control (MPC), and sliding mode control (SMC) demonstrate advantages including strong interpretability and straightforward implementation. Recent MPC methods achieve sub-cm tracking via adaptive parameter optimization (fuzzy particle swarm optimization) and hierarchical control, effectively managing over-actuated systems (four wheel steer and four wheel drive/agricultural machinery) through real-time quadratic programming in high-dimensional spaces [1,2,3]. Advanced SMC integrates neural disturbance estimation with adaptive feedback, demonstrating lateral error and disturbance rejection under coupled dynamic constraints across variable road conditions [4,5,6]. However, their model-dependent nature fundamentally limits effectiveness in addressing nonlinear system dynamics and multi-constrained optimization challenges. Deep Reinforcement Learning (DRL) has recently emerged as a transformative alternative through end-to-end learning mechanisms that bypass explicit model requirements.

Recent advances in deep reinforcement learning [7,8,9,10] have enhanced intelligent vehicle control through model-integrated end-to-end frameworks [11], expert-guided pretraining [12], and PID hybrid architectures [13]. However, two fundamental limitations persist: (1) Policy instability during early training phases due to random parameter initialization incurs high exploration costs, and (2) Frequent safety violations (boundary breaches/collisions) during exploration reveal critical constraint enforcement deficiencies. To resolve these challenges, we propose a constrained optimization framework modeling lateral control as a CMDP, integrating three key innovations: an augmented Lagrangian safety optimizer, replay-separated experience buffers, and dual-Critic value estimation to prevent overoptimistic Q-values while ensuring stable policy updates.

Deep Reinforcement Learning (DRL) directly maps environmental states to control actions through deep neural networks, with its core advantages manifested in two aspects: (1) leveraging the feature extraction capabilities of deep learning to handle high-dimensional nonlinear systems, and (2) utilizing the trial-and-error mechanism of reinforcement learning to achieve multi-objective dynamic optimization. Its significance stems from the breakthrough solutions offered by Deep Q-Network (DQN) [14] and its derivative algorithms—such as Soft Actor-Critic (SAC) [15], Twin Delayed Deep Deterministic Policy Gradient (TD3) [16], Deep Deterministic Policy Gradient (DDPG) [17], Double Deep Q-Network (Double DQN) [18], and Asynchronous Advantage Actor-Critic (A3C) [19]—in addressing complex decision-making problems. Deep reinforcement learning in vehicle control evolves through two paradigms: model-free and model-based approaches. Model-free methods like DDPG [11] employ end-to-end frameworks to directly map states to control commands, enhancing tracking accuracy in complex scenarios though challenged by inefficient initial exploration. While expert-guided pretraining [12] narrows exploration spaces, it imposes strict requirements on demonstration data quality. Model-based approaches integrate six-DOF dynamics with DDPG for handling tire-road interactions, yet remain vulnerable to parametric uncertainties (e.g., friction coefficient variations) [20] in real-world deployments.

Safe Deep Reinforcement Learning (Safe DRL) establishes a CMDP to transform physical safety boundaries into hard constraints for policy optimization [21]. By incorporating constraint optimization mechanisms, Safe DRL ensures safety during algorithm training. Existing Safe DRL methods primarily employ approaches such as safety shielding layers [22], finite-state controllers [23], semi-supervised reward learning [24], and control barrier functions [25]. However, these solutions often suffer from high implementation complexity and overly conservative constraint conditions. This study innovatively introduces the augmented Lagrangian method to construct a dynamic constraint optimization model. By adaptively adjusting the penalty factor in real-time, it balances the threshold between safety and performance, thereby reducing safety violation rates. This approach provides a novel methodological framework for safe control in complex dynamic environments.

This paper investigates four-wheel independent steering control for vehicles, applying deep reinforcement learning algorithms to address trajectory tracking challenges. The study aims to enhance vehicle stability and accuracy during high-speed maneuvers and complex trajectory scenarios. A DDPG-based approach is employed to optimize the neural network architecture, with specific designs for the Critic and Actor network structures and parameters. A novel replay-separated buffer method is proposed to improve the efficiency of experience replay, accelerating the training process while ensuring robustness. By leveraging the augmented Lagrangian method, the vehicle trajectory tracking problem is modeled as a CMDP, and a new reward function is designed to enhance training efficiency. Experimental results demonstrate that the proposed algorithm achieves significant improvements in key performance metrics, including training safety, tracking precision, and ride comfort.

2. Preliminaries

To address the controller design issues encountered in vehicle trajectory tracking, this study restructures the problem as a Markov decision process and applies deep reinforcement learning techniques to solve it. Training vehicle trajectory tracking controllers using deep reinforcement learning methods can lead to scenarios where the vehicle collides with lane boundaries, prematurely ending the reinforcement learning training process. To mitigate this challenge, we introduce constrained reinforcement learning within lane boundary limitations, formally establishing a Safe Reinforcement Learning (Safe RL) framework with collision-avoidance guarantees.

A Constrained Markov Decision Process (CMDP)

M_{C}

is defined by the tuple

(S, A, P, R, C, ρ_{0}, γ)

, where

S

and

A

are the state and action spaces, respectively;

P : S \times A \to P (s)

represents the transition probability

P (s^{'} | s, a)

from state s to

s^{'}

under action a;

R : S \times A \times S \to R

is the reward function, yielding the immediate reward

r (s, a, s^{'})

when transitioning from s to

s^{'}

via action a; C is a set of safety constraints, where each

c_{i} : S \times A \to R

is a safety cost function with an associated threshold

b_{i}

for

i = 1, \dots, m

;

ρ_{0} : S \to [0, 1]

is the initial state distribution; and

γ

is the discount factor for future rewards.

A stationary policy

π_{θ}

is a probability distribution over actions given states, with

π_{θ} (a | s)

denoting the probability of taking action a in state s. The set of all stationary policies is

Π_{θ} = {π_{θ} : θ \in R^{P}}

. The state-value function for

π_{θ}

is defined as

V_{π_{θ}} (s) = E_{π_{θ}} [\sum_{t = 0}^{\infty} γ^{t} r_{t + 1} ∣ s_{0} = s]

, where

E_{π_{θ}} [\cdot | \cdot]

denotes the expectation over actions selected by

π_{θ}

. The objective of reinforcement learning is to maximize

J_{r} (π_{θ}) = E_{s \sim ρ_{0} (\cdot)} [V_{π_{θ}} (s)]

. In CMDPs, the expected discounted cost for policy

π_{θ}

is

J_{C_{i}} (π_{θ}) = E [\sum_{t = 0}^{\infty} γ^{t} c_{i} (s_{t}, a_{t})]

. The feasible policy set is defined as

Π_{C} = ⋂_{i = 1}^{m} {π_{θ} \in Π_{θ} ∣ J_{C_{i}} (π_{θ}) \leq b_{i}}

. The goal of CMDP is to find the optimal policy

π^{*} = arg {max}_{π_{θ} \in Π_{C}} J_{r} (π_{θ})

.

Definition 1.

(Optimization Problem of MDP with Instantaneous Constraints) Consider a Markov Decision Process defined by state transition probabilities

s_{t + 1} \sim p_{t} (\cdot ∣ s_{t}, a_{t})

, initial state distribution

s_{0} \sim p_{0}

, a bounded reward function

r (s, a)

, and a bounded constraint function

g (s, a)

. The objective is to find the optimal policy

π^{*}

that maximizes the expected discounted cumulative reward while satisfying the given constraints. The optimization problem can be formulated as follows:

\begin{matrix} max_{π} & J_{r} (π) = E [\sum_{t = 0}^{\infty} γ^{t} r (s_{t}, a_{t}) ∣ π] \\ s . t . & s_{t + 1} \sim p (\cdot ∣ s_{t}, a_{t}), \forall t \in {0, 1, \dots}, \\ a_{t} \sim π (s_{t}), \forall t \in {0, 1, \dots}, \\ s_{0} \sim p_{0}, \\ E [g (s_{t}, a_{t}) ∣ π] \leq 0, \forall t \in {0, 1, \dots} . \end{matrix}

(1)

where the state transition probabilities

s_{t + 1} \sim p (\cdot ∣ s_{t}, a_{t})

describe how the system evolves over time, the initial state distribution

s_{0} \sim p_{0}

specifies the starting point of the process, the bounded reward function

r (s, a)

quantifies the immediate reward obtained from taking action a in state s, and the bounded constraint function

g (s, a)

ensures that certain constraints are met at each time step. The goal is to maximize the expected discounted cumulative reward

J_{r} (π)

while ensuring that the constraints are satisfied.

Although only one set of instantaneous constraints is considered in the bounded constraints of Problem 1, the results of this paper can be extended to the case of multiple sets of instantaneous constraints by associating each constraint with a Lagrange multiplier.

3. Problem Formulation

3.1. The Dynamic Model of the Four-Wheel Steering Vehicle

Vehicle dynamics modeling is the foundation of vehicle trajectory tracking control. The four-wheel independent steering vehicle is a highly coupled, nonlinear, and uncertain system. To simplify the vehicle modeling problem, we assume that the vehicle’s longitudinal dynamics are negligible, focusing solely on lateral and yaw dynamics while ignoring vertical dynamics. For four-wheel independent steering vehicles, the following dynamic equations hold:

\begin{matrix} m (\dot{v_{x}} - v_{y} w) = \sum F_{x} \\ m (\dot{v_{y}} + v_{x} w) = \sum F_{y} \\ I_{z} \dot{w} = \sum M_{z} \end{matrix}

(2)

where m is the vehicle mass,

v_{x}

and

v_{y}

are the longitudinal and lateral velocities, w is the yaw rate,

I_{z}

is the vehicle’s yaw inertia,

F_{x}

and

F_{y}

are the longitudinal and lateral forces, and

M_{z}

is the yaw moment. To simplify the vehicle dynamics, we assume that the longitudinal velocity

v_{x}

is constant, i.e.,

\dot{v_{x}} = 0

, and

\sum F_{x} = 0

. The lateral force and yaw moment dynamics are as follows:

\begin{matrix} m (\dot{v_{y}} + v_{x} w) = & \sum F_{y} = F_{y, f l} + F_{y, f r} + F_{y, r l} + F_{y, r r} \\ I_{z} \dot{w} = & \sum M_{z} \\ = & a (F_{y, f l} + F_{y, f r}) - b (F_{y, r l} + F_{y, r r}) \\ + \frac{B_{f}}{2} (F_{y, f r} - F_{y, f l}) + \frac{B_{r}}{2} (F_{y, r r} - F_{y, r l}) . \end{matrix}

(3)

Based on the small angle assumption, the vehicle’s tire model can be simplified as a linear model, i.e.,

F_{y, i} = C_{i} α_{i}

. The lateral slip angles of the vehicle’s tires are expressed as follows:

\begin{matrix} α_{f l} & = δ_{f l} - \frac{v_{y} + a w - \frac{B_{f}}{2} w}{v_{x}} \\ α_{f r} & = δ_{f r} - \frac{v_{y} + a w + \frac{B_{f}}{2} w}{v_{x}} \\ α_{r l} & = δ_{r l} - \frac{v_{y} - b w - \frac{B_{r}}{2} w}{v_{x}} \\ α_{r r} & = δ_{r r} - \frac{v_{y} - b w + \frac{B_{r}}{2} w}{v_{x}} \end{matrix}

where

δ_{f l}

,

δ_{f r}

,

δ_{r l}

, and

δ_{r r}

are the steering angles of the vehicle’s front left, front right, rear left, and rear right wheels, respectively,

C_{i}

is the tire’s lateral stiffness, a and b are the vehicle’s front and rear axle distances, and

B_{f}

and

B_{r}

are the vehicle’s front and rear track widths.

Based on the above assumptions, the vehicle dynamics model with 6 degrees of freedom (DOF) as shown in Figure 1 was established, including the lateral motion along the y-axis, the yaw motion around the z-axis, and the rotation of the four wheels.

The state space representation of the 6 DOF model of a four-wheel vehicle is as follows:

\begin{matrix} [\begin{matrix} \dot{v_{y}} \\ \dot{w} \end{matrix}] = A [\begin{matrix} v_{y} \\ w \end{matrix}] + B [\begin{matrix} δ_{f l} \\ δ_{f r} \\ δ_{r l} \\ δ_{r r} \end{matrix}] \end{matrix}

(4)

where A is the system matrix:

\begin{matrix} A = [\begin{matrix} - \frac{2 (C_{f} + C_{r})}{m v_{x}} & - v_{x} - \frac{2 (a C_{f} - b C_{r})}{m v_{x}} \\ - \frac{2 (a C_{f} - b C_{r})}{I_{z} v_{x}} & - \frac{2 (a^{2} C_{f} + b^{2} C_{r}) + B_{f} B_{r} (C_{f} + C_{r}) / 2}{I_{z} v_{x}} \end{matrix}] \end{matrix}

and B is the input matrix:

\begin{matrix} B = [\begin{matrix} \frac{C_{f}}{m} & \frac{C_{f}}{m} & \frac{C_{r}}{m} & \frac{C_{r}}{m} \\ \frac{a C_{f} - B_{f} C_{f} / 2}{I_{z}} & \frac{a C_{f} + B_{f} C_{f} / 2}{I_{z}} & \frac{- b C_{r} - B_{r} C_{r} / 2}{I_{z}} & \frac{- b C_{r} + B_{r} C_{r} / 2}{I_{z}} \end{matrix}] . \end{matrix}

3.2. The Vehicle Trajectory Error System

The vehicle dynamics model can convert the vehicle’s underlying motion characteristics into a quantitative problem of high-level control objectives, i.e., vehicle trajectory tracking accuracy problem. In the control architecture of a four-wheel steering vehicle, control commands and trajectory tracking performance are dynamically coupled through a multi-level state transmission chain.

Specifically, the four-wheel steering angles

δ_{i}

serve as control inputs, first acting on the vehicle’s physical system through the 6 DOF dynamic model: the lateral force of the tires varies with the steering angle, driving the lateral acceleration

{\dot{v}}_{y}

and yaw acceleration

\dot{w}

, which in turn update the vehicle’s motion state

(v_{y}, w)

. Therefore, establishing a vehicle trajectory tracking model based on the vehicle dynamic model effectively addresses the trajectory tracking problem, as demonstrated in Figure 2.

The trajectory tracking error model under single preview-point configuration is derived through kinematic analysis and dynamic coupling. Let L denote the preview distance ahead of the vehicle’s center of gravity. The lateral position error

y_{L}

, defined as the perpendicular deviation between the vehicle’s Center Of Gravity and the reference path, evolves according to three contributions: (1) longitudinal velocity

v_{x}

projected laterally due to heading misalignment

ε_{L}

, (2) direct lateral velocity

- v_{y}

, and (3) yaw-induced displacement

- ω L

. This yields the lateral error dynamics:

{\dot{y}}_{L} = v_{x} ε_{L} - v_{y} - ω L .

Concurrently, the heading error

ε_{L}

, representing the angular deviation from the path tangent, is governed by the discrepancy between the reference curvature

ρ

and the vehicle’s yaw rate

ω

:

{\dot{ε}}_{L} = v_{x} ρ - ω .

The dynamic equation of the vehicle trajectory tracking model with preview distance is as follows:

[\begin{matrix} {\dot{y}}_{L} \\ {\dot{ε}}_{L} \end{matrix}] = \underset{A}{\underset{︸}{[\begin{matrix} 0 & v_{x} \\ 0 & 0 \end{matrix}]}} [\begin{matrix} y_{L} \\ ε_{L} \end{matrix}] + \underset{B}{\underset{︸}{[\begin{matrix} - 1 & - L & 0 \\ 0 & - 1 & v_{y} \end{matrix}]}} [\begin{matrix} v_{y} \\ w \\ ρ \end{matrix}] .

(5)

The vehicle trajectory tracking model introduces preview distance, which introduces the curvature information of the vehicle trajectory through the preview distance, solving the dynamic tracking problem of the vehicle trajectory.

3.3. Constrained Reinforcement Learning-Based Tracking Control Framework

To address the control problem in the vehicle trajectory tracking system, this paper employs deep reinforcement learning methods, modeling the vehicle trajectory tracking system as a constrained Markov decision process. The vehicle is treated as the environment in the RL problem, while the four-wheel steering controller serves as the agent. Leveraging the periodicity of steering control, each control cycle’s steering actions and resulting vehicle state changes are considered as an agent-environment interaction.

When approaching curved road sections, insufficient lane boundary information acquisition may cause lateral control response delays, leading to overshooting control targets and reduced stability. To ensure timely control updates before entering curvature-changing zones, we select lateral tracking errors

y_{L}

, heading deviation

ε_{L}

, and trajectory curvature

ρ

as state space components. The state space

S

is defined as:

S = {y_{L}, ε_{L}, ρ}

.

The study focuses on four-wheel independent steering vehicles, thus defining the action space as the steering angles of the four wheels:

ϕ_{l f}

,

ϕ_{l r}

,

ϕ_{r f}

,

ϕ_{r r}

. The action space

A

is defined as:

A = {ϕ_{l f}, ϕ_{l r}, ϕ_{r f}, ϕ_{r r}}

. During deep reinforcement learning training, excessively large action ranges in the control policy output degrade training performance, necessitating action range constraints through linear mapping of policy network outputs.

To reduce the occurrence of dangerous behaviors such as overstepping and collisions during training, preventing numerous ineffective training cycles, this paper introduces lane boundary constraints as state constraints. Ensuring that the vehicle trajectory tracking error strictly satisfies the safety boundary constraints,

| y_{L} | < 3 \forall t

, essentially requires the control policy to avoid entering dangerous state regions at any time.

First, define the constraint cost function

c : S \to {0, 1}

to characterize the safety attributes in the state space:

\begin{matrix} c (s_{t}) = I (| y_{L} | \geq 3) = \{\begin{matrix} 1 & | y_{L} | \geq 3 \\ 0 & others . \end{matrix} \end{matrix}

Based on this, the constraints of the CMDP can be expressed as an upper bound on the expected cumulative constraint cost:

\begin{matrix} E_{π} [\sum_{t = 0}^{\infty} γ_{c}^{t} c (s_{t})] \leq ϵ \end{matrix}

where

γ_{c} \in [0, 1]

is the constraint discount factor. For strict hard constraint problems, set

γ_{c} = 1

and

ϵ = 0

, where the constraint condition is equivalent to requiring

c (s_{t}) = 0

at all times. Based on Definition 1, the vehicle trajectory tracking problem under safety constraints can be modeled as a CMDP as follows:

\begin{matrix} max_{π} J_{r} (π) = & E_{π} [\sum_{t = 0}^{T} γ^{t} r (s_{t}, a_{t}) | π] \\ s . t . & E_{π} [\sum_{t = 0}^{T} c (s_{t})] \leq 0 \end{matrix}

(6)

where

J_{r} (π)

is the expected reward,

r (s_{t}, a_{t})

is the immediate reward,

c (s_{t})

is the constraint cost,

γ

is the discount factor, and

π

is the policy. The objective of this optimization problem is to find the optimal policy

π^{*}

that maximizes the expected cumulative reward while satisfying the constraints.

4. Main Results

4.1. Augmented Lagrangian Method for Safe RL

The instantaneous constraints in safe reinforcement learning are typically solved using the Lagrange multiplier method, which converts the constrained optimization problem into an unconstrained optimization problem using the Lagrange function, adding the weighted constraints to the objective function to obtain the Lagrange function:

L (π, λ) = J_{r} (π) - λ (J_{c} (π) - b)

. The Lagrange function

L (π, λ)

is used to convert the constrained optimization problem into an unconstrained optimization problem.

min_{λ \geq 0} max_{π} L (π, λ) = min_{λ \geq 0} max_{π} [J_{r} (π) - λ (J_{c} (π) - α)] .

The existing methods exhibit dual theoretical limitations in alternately optimizing policy parameters

π

and Lagrange multipliers

λ

. First, the multiplier update mechanism employs a fixed learning rate for gradient ascent, lacking adaptive capability to optimize surface curvature, which leads to unstable convergence and oscillatory divergence with improper learning rates. Second, zero initialization renders multipliers ineffective during early optimization, weakening the constraint violation penalty in the initial training phase, thereby reducing penalty strength during early exploration of safety boundaries. In traditional constrained optimization methods, there is another approach known as the Augmented Lagrangian Method (ALM). This method combines the concepts of the penalty function method and the Lagrange multiplier method to construct an augmented Lagrangian function that includes a quadratic penalty term. For the aforementioned constrained optimization problem, the augmented Lagrangian function can be defined as follows:

L (π, λ, ρ) = J_{r} (π) - λ (J_{c} (π) - α) - \frac{ρ}{2} {(J_{c} (π) - α)}^{2} .

where

ρ > 0

is the penalty factor for the quadratic penalty term

{(J_{c} (π) - α)}^{2}

.

Definition 2.

(Augmented Lagrangian MDP Optimization Problem) Using the framework defined in Definition 1, combined with the augmented Lagrangian multiplier method, formalize the constrained optimization problem of the vehicle trajectory tracking problem as the following unconstrained optimization problem:

\begin{matrix} max_{π} min_{λ \geq 0} & E [\sum_{t = 0}^{\infty} γ^{t} (r (s_{t}, a_{t}) - λ_{t} g (s_{t}) + \frac{ρ}{2} max {(0, g (s_{t}, a_{t}))}^{2})| π] \\ s . t . & s_{t + 1} \sim p_{s} (\cdot | s_{t}, a_{t}), \forall t \in {0, 1, \dots}, \\ a_{t} \sim π (s_{t}), \forall t \in {0, 1, \dots}, \\ s_{0} \sim p_{0} \end{matrix}

(7)

The constraint function requires that the vehicle’s desired position at each moment does not exceed the lane boundary. Compared with the traditional Lagrange multiplier method, the augmented Lagrangian multiplier method has two main advantages. First, during the iterative optimization process using the augmented Lagrangian function

L (π, λ, ρ)

, the step size for updating

λ

is no longer fixed but is adaptively adjusted by the variable penalty factor

ρ

, which increases progressively during optimization. This mechanism effectively adjusts the learning rate dynamically, where the synergistic effect of the multiplier term and quadratic penalty term enhances algorithmic convergence rate and stability. Second, initializing the penalty factor

ρ

to a positive value enforces strict “warm-start” penalization of safety constraint violations from the outset of training. This design prioritizes safety considerations during initial learning phases, thereby mitigating the risk of converging to unsafe local optimal solutions.

Transforming the constrained Markov decision process problem into an unconstrained optimization problem using the augmented Lagrangian function enables the solution of constrained optimization problems through the augmented Lagrangian multiplier method. This paper proposes a safe reinforcement learning algorithm based on the augmented Lagrangian function to address the trajectory tracking problem with lane boundary constraints, achieving a better balance between safety and optimal objectives.

4.2. Reward Function Design

In reinforcement learning algorithms, network initialization is random. Consequently, the agent can only interact with the environment based on guidance from the reward function. The reward function is typically a scalar value, where positive values represent rewards and negative values represent penalties. Reinforcement learning algorithms optimize towards maximizing the reward value. The scenario focused on in this paper is lane-keeping for vehicles. This requires that, during driving, the vehicle should stay as close as possible to the center of the lane, and the vehicle’s forward direction should be aligned with the road axis. Additionally, on this basis, the vehicle should aim to increase its speed as much as possible.

Aggressive steering during turns may cause stability loss and uncomfortable lateral acceleration fluctuations. To prevent local optimal solutions, we design an exponential reward function that accelerates reward growth near optimal solutions. The proposed reward function integrates tracking accuracy and comfort. This reward structure systematically integrates trajectory tracking accuracy with steering stability constraints through the following formulation for each timestep t:

r_{t} (s_{t}, a_{t}) = \{\begin{matrix} r_{t} (s_{t}, a_{t}) = \underset{Lateral stability}{\underset{︸}{δ_{1} \cdot e^{- | a_{lat} |}}} + \underset{Steering efficiency}{\underset{︸}{δ_{2} \cdot e^{- φ_{steer}}}} + \underset{Oscillation penalty}{\underset{︸}{δ_{3} \cdot P_{t}}}, & normal driving; \\ - C, & hit boundary \end{matrix}

(8)

where:

Lateral Stability Term (

δ_{1} \cdot e^{- | a_{lat} |}

): Penalizes excessive lateral acceleration

a_{lat}

through exponential decay, where

δ_{1}

determines the penalty intensity. This term ensures vehicle stability during aggressive maneuvers.

Steering Efficiency Term (

δ_{2} \cdot e^{- φ_{steer}}

): Encourages minimal steering effort by penalizing the cumulative steering angle

φ_{steer} = \sum_{i = 1}^{4} | ϕ_{i} |

, where

ϕ_{i}

denotes the steering angle of each wheel. The exponential form prioritizes smooth steering operations.

Oscillation Penalty Term (

δ_{3} \cdot P_{t}

): Suppresses steering oscillations through a sliding-window mechanism:

P_{t} = - \sum_{i = 1}^{4} η_{i} [λ_{1}^{(i)} Δ_{mean}^{(i)} (t) + λ_{2}^{(i)} σ^{(i)} (t) + λ_{3}^{(i)} R^{(i)} (t)]

where:

Δ_{mean}^{(i)}

: Mean absolute difference of steering channel i over window

W_{t}

σ^{(i)}

: Variance of steering actions in channel i

R^{(i)}

: Range (max-min) of steering actions in channel i

η_{i}

: Channel-specific weights (

\sum η_{i} = 1

)

λ_{1 : 3}^{(i)}

: Penalty coefficients for temporal variation characteristics

The sliding window

W_{t} = {a_{t - N + 1}, . . ., a_{t}}

maintains the latest N steering actions (

a_{t} \in R^{4}

) using FIFO updating. This multi-timescale penalty simultaneously suppresses high-frequency jitter (via

Δ_{mean}

), amplitude fluctuations (via

σ

), and abrupt changes (via R).

4.3. Vehicle Tracking Control Algorithm Based on Safety Constraints

This study addresses the vehicle trajectory tracking problem, where the vehicle controller regulates the steering angles of all four wheels based on available lane trajectory information (including lane boundaries and trajectory curvature) and vehicle state parameters (such as velocity and acceleration). The objective is to ensure precise and safe lane-following performance. The framework of the proposed trajectory tracking decision control algorithm is illustrated in Figure 3.

The reinforcement learning problem can be solved using the Actor-Critic framework, which combines the advantages of policy gradient methods and temporal difference methods, including an Actor network and a Critic network. The structure of the Actor network consists of four fully connected layers: an input layer, two hidden layers, and an output layer.

The input layer of the Critic network is divided into two parts: one part is the state information of the vehicle simulation environment, and the other part is the action values output by the Actor network. The simulation environment information is processed by one hidden layer, then combined with the action values and input into the second hidden layer. Subsequently, the state information is processed by the output neurons to obtain the state-action reward value.

To enhance the learning efficiency of reinforcement learning, we introduce a prioritized experience replay mechanism and a dual experience pool mechanism within the framework of constrained reinforcement learning algorithms. The prioritized experience replay mechanism allows the agent to preferentially select experience samples with higher importance or value for training. By focusing on more representative situations or critical learning moments, the agent can learn and optimize its strategy more effectively. By prioritizing the replay of key experiences, the agent can learn faster from past experiences, thereby accelerating the learning process. This is particularly useful in situations where rapid adaptation to environmental changes is required.

The actor network generates a set of action values, i.e., the steering angles of the four wheels, and adds random noise before inputting them into the MATLAB (R2024b) simulation software. MATLAB then inputs the next state

s^{'}

into the reward function module based on the actions, storing it in the buffer along with the current state s, reward value r, and action value a. To utilize high-quality data for neural network training and improve training stability, the buffer is divided into a standard error buffer

R_{s t a n d - e}

and a high-error buffer

R_{h i g h - e}

. The experience buffer is separated by calculating the TD error

T D E_{t} = y_{t} - \bar{Q} θ (S_{t}, a_{t})

, with the sampling probability of

R_{h i g h - e}

increasing as model training progresses.

A certain number of samples are proportionally sampled from the experience replay buffer, with each sample containing

s, a, r, s^{'}

. Then, s and

s^{'}

are passed to the Actor network, and

s, a, r, s^{'}

are passed to the Critic network for iterative updates. For the Actor network, it accepts s and

s^{'}

, then outputs a along with random noise to MATLAB. Simultaneously, after outputting the next action

a^{'}

to the Critic network, it accepts the state-action value gradient

Q (s, a)

to

a^{'}

to update the network. For the Critic network, it accepts

s^{'}

and

a^{'}

to calculate

Q_{R} (s^{'}, a^{'})

. Next, it combines

Q_{R} (s^{'}, a^{'})

with r to compute the labels used for network iterative updates. Additionally, s and a are input into the Critic network, and the mean squared error between the output and the labels is used as the loss function.

\begin{matrix} L_{R} = \frac{1}{N} \sum_{i} (y_{i} - Q_{R} (s_{i}, a_{i} | θ_{R}^{Q}))^{2} \end{matrix}

(9)

The reward Critic network is updated through iteration (9),

\begin{matrix} L_{C} = \frac{1}{N} \sum_{i} (z_{i} - Q_{C} (s_{i}, a_{i} | θ_{C}^{Q}))^{2} \end{matrix}

(10)

while the loss Critic network is updated through iteration (10).

\begin{matrix} Δ_{θ^{μ}} = \frac{1}{N} \sum_{i} Δ_{θ^{μ}} (Q_{R} (s, μ (s | θ^{μ}) | θ_{R}^{Q}) - λ Q_{C} (s, μ (s | θ^{μ}) | θ_{C}^{Q})) |_{s = s_{i}} \end{matrix}

(11)

The sampled policy gradient (11) is used to update the action network, and the Lagrange multiplier is updated using the loss evaluation network. The pseudocode for the safe reinforcement learning algorithm based on the augmented Lagrangian function method is Algorithm 1.

Algorithm 1 Based on the Augmented Lagrangian Method for Safe DDPG Algorithms (DDPGALM).

1:: Randomly initialize the network Reward Critic Q-network $Q_{R} (s, a | θ_{R}^{Q})$ , Cost Critic Q-network $Q_{C} (s, a | θ_{C}^{Q})$ and Actor network $μ (s | θ^{μ})$
2:: Initialize the target network $θ_{R}^{Q^{'}} \leftarrow θ_{R}^{Q}$ , $θ_{C}^{Q^{'}} \leftarrow θ_{C}^{Q}$ , $θ^{μ^{'}} \leftarrow θ^{μ}$
3:: Initialize parameters Replay buffer $R$ and Augmented Lagrangian Method $λ$
4:: for episode k = 0,1,…, do
5:: Initialize a random process $N$ for action exploration
6:: Get the initial state $s_{0} \sim p_{0}$
7:: for t = 1,…,T do
8:: Select an action $a_{t} = μ (s_{t} | θ^{μ})$
9:: Execute the action $a_{t}$ , Observe $r_{t}, c_{t}, s_{t + 1}$
10:: Store the tuple $(s_{t}, a_{t}, r_{t}, c_{t}, s_{t + 1})$ in the replay buffer $R$
11:: Randomly sample N tuples ${(s_{t}, a_{t}, r_{t}, c_{t}, s_{t + 1})}_{i = 1}^{N}$ from the replay buffer $R$
12:: Define $y_{i} = r_{i} + γ Q_{R}^{'} (s_{i + 1}, μ^{'} (s_{i + 1} | θ^{μ^{'}}) | θ^{Q^{'}}$ , $z_{i} = c_{i} + γ Q_{C}^{'} (s_{i + 1}, μ^{'} (s_{i + 1} | θ^{μ^{'}}) | θ_{C}^{Q^{'}})$
13:: Update the reward evaluation network by minimizing $L_{R} = \frac{1}{N} \sum_{i} {(y_{i} - Q_{R} (s_{i}, a_{i} | θ_{R}^{Q}))}^{2}$ , and update the loss evaluation network by minimizing $L_{C} = \frac{1}{N} \sum_{i} {(z_{i} - Q_{C} (s_{i}, a_{i} | θ_{C}^{Q}))}^{2}$ .
14:: Update the action network using the sampled policy gradient:
$Δ_{θ^{μ}} = \frac{1}{N} \sum_{i} Δ_{θ^{μ}} (Q_{R} (s, μ (s | θ^{μ}) | θ_{R}^{Q}) - λ Q_{C} (s, μ (s | θ^{μ}) | θ_{C}^{Q})) |_{s = s_{i}}$
15:: Update the Augmented Lagrangian multiplier: $λ$ :
$Δ_{λ} = \frac{1}{N} \sum_{i} [Q_{C} (s_{i}, μ (s_{i} | θ^{μ}) - d]$
16:: Update the target network:
$θ_{R}^{Q^{'}} \leftarrow τ θ_{R}^{Q} + (1 - τ) θ_{R}^{Q^{'}}$
$θ_{C}^{Q^{'}} \leftarrow τ θ_{C}^{Q} + (1 - τ) θ_{C}^{Q^{'}}$
$θ^{μ^{'}} \leftarrow τ θ^{μ} + (1 - τ) θ^{μ^{'}}$
17:: end for
18:: end for

The algorithm based on the Augmented Lagrangian Method for Safe DDPG Algorithms addresses the safety constraints in trajectory tracking for four-wheel steering vehicles. Specifically, it aims to mitigate the issue of premature termination during model training caused by collisions with lane edges. By incorporating the Augmented Lagrangian Method, the algorithm ensures that safety constraints are rigorously enforced, thereby enhancing the robustness and reliability of the training process. This approach not only improves the vehicle’s ability to track trajectories accurately but also significantly reduces the risk of unsafe behaviors, such as lane boundary violations, during both training and deployment.

5. Simulation Studies

5.1. Simulation Platform and Model Verification

The vehicle trajectory tracking control algorithm is validated using a co-simulation platform, which consists of three main components: CarSim 2020 (Mechanical Simulation Corporation, Ann Arbor, MI, USA), Simulink R2021b (The MathWorks, Inc., Natick, MA, USA), and Python 3.7. The system architecture incorporates CarSim as a high-fidelity vehicle dynamics solver, embedded into the Simulink environment through S-Function interface, providing multi-degree-of-freedom vehicle models and real-time state feedback, the simulation process in the CarSim software is shown in the Figure 4a. Notably, a dedicated 6-DOF vehicle dynamics model is implemented in Simulink, serving as the primary plant model during deep reinforcement learning training phases. This parallel modeling approach enables comparative validation between the Simulink-based dynamics model and CarSim’s S-Function module, which share similar functional capabilities in providing real-time vehicle state feedback while differing in model fidelity and computational efficiency.

The Simulink platform serves as the control framework, implementing signal routing, data preprocessing, and actuator command conversion. The co-simulation strategy employs the Simulink-built 6-DOF model for algorithm training iterations due to its computational efficiency, while reserving the higher-fidelity CarSim S-Function module for final validation stages. Figure 4b illustrates integrated simulation framework combining Python-based intelligent decision-making with MATLAB/Simulink-CarSim vehicle dynamics modeling. The Python agent, implemented through PyTorch’s deep reinforcement learning toolkit, interfaces with MATLAB via the MatlabEngine package to exchange control actions (wheel_angle) and synchronization signals (pause_flag, pause_time) at 100 Hz frequency. The Simulink environment processes these inputs through a CarSim vehicle model, returning observation states, reward signals, constraint costs, and episode termination flags via shared memory buffers. Tight temporal synchronization (10 ms timestep) is maintained through event-driven pausing mechanisms, enabling deterministic co-simulation while preserving the fidelity of both control computations and vehicle dynamics simulations. This hierarchical modeling architecture ensures both training efficiency through Simulink’s real-time simulation capabilities and validation credibility via CarSim’s professional vehicle dynamics solver.

The Simulink simulation diagram is given as shown in Figure 4c, which includes five main parts. Act module is the strategy actions obtained through reinforcement learning algorithm training; Carsim Model is a high-fidelity simulation software joint simulation module; Dynamics Model is a vehicle dynamics simulation module built in Simulink; Reward module calculates the reward, cost, done, obs and other values of the current environment according to the state obtained in Carsim Model or Dynamics Model; Assertion module is used to control the state communication between Python and Simulink during reinforcement learning training.

Our experimental framework establishes the following environmental assumptions: The simulation assumes dry asphalt road conditions with friction coefficient

μ = 0.85

, excluding terrain irregularities and elevation changes. Vehicle localization integrates LiDAR, vision sensors, and odometry measurements rather than GPS positioning, reflecting modern autonomous system architectures. Environmental disturbances including wind effects and electromagnetic interference are excluded from consideration. The control system implements 10ms actuation latency while maintaining 1ms time resolution in the Simulink-CarSim co-simulation environment to ensure dynamic fidelity. Training trajectories adopt the ISO 3888-1:1999 [26] emergency lane-change standard augmented with curvature perturbations (

Δ κ \sim U (- 0.2, 0.2) m^{- 1}

) and path length variations (

Δ s \sim N (0, 0.5 m)

), systematically expanding the operational envelope while preserving safety constraints. These explicitly defined parameters delineate the method’s operational domain and enable reproducible validation under controlled autonomous driving conditions.

The double lane change (DLC) maneuver, recognized as a representative scenario for evaluating vehicle stability during high-speed overtaking and obstacle avoidance, serves as a standard test for assessing vehicle handling and stability performance. The vehicle trajectory tracking control system is validated through co-simulation using the integrated Simulink and CarSim platform. To verify the simulation accuracy of both the 6-DOF vehicle dynamics model and CarSim S-Function module, a comparative validation was conducted under dual-lane change maneuvers at 60 km/h and 90 km/h operating conditions. The CarSim platform’s built-in driver controller was employed to generate baseline vehicle responses, with both models receiving identical steering inputs and road signals.

As shown in Figure 5, the comparative analysis of critical state parameters (including lateral velocity, yaw rate) demonstrates that the Simulink-implemented 6-DOF dynamics model achieves a high correlation with CarSim’s reference outputs at 60 km/h and 90 km/h respectively. Quantitative comparisons between simulation platforms demonstrate exceptional agreement across operational scenarios. At 60 km/h, the lateral velocity (

v_{y}

) achieves an RMSE of

1.990 \times 10^{- 4}

m/s, while yaw rate (

ω

) maintains

3.376 \times 10^{- 5}

rad/s precision. These metrics scale consistently with velocity, reaching

4.228 \times 10^{- 4}

m/s for

v_{y}

and

7.023 \times 10^{- 4}

rad/s for

ω

at 90 km/h. The sub-millimeter/second level discrepancies validate the modeling fidelity of our Simulink implementation relative to the CarSim industry standard, confirming the vehicle dynamics replication accuracy essential for reliable reinforcement learning training.

To verify the superiority of the proposed method, different models were used for training, including SAC (Soft Actor-Critic) [15], TD3 (Twin Delayed Deep Deterministic Policy Gradient) [16], DDPG [17], and DDPGALM algorithms in this paper, as shown in Table 1.

Our methodology employs safety-constrained offline reinforcement learning to address the adaptation-safety dilemma. The training process utilizes the ISO 3888-1:1999 emergency lane-change trajectory as the baseline, introducing controlled perturbations during policy optimization to enhance generalization while preserving safety constraints. Highway geometry inherently bounds trajectory deviations through physical road boundaries, ensuring operational scenarios remain within the trained parameter space. Experimental validation in next section confirms the policy’s robust adaptation to all tested trajectory variations without safety violations, demonstrating effective mitigation of offline training limitations through systematic environment-aware perturbation design.

5.2. Simulation Results and Analysis

To comprehensively evaluate the performance of the proposed DDPGALM algorithm, comparative simulations were conducted against general deep reinforcement learning algorithms under two distinct speed conditions: 60 km/h (low speed) and 90 km/h (high speed). The selection of these speeds aims to validate the algorithm’s adaptability across diverse driving scenarios. Low-speed scenario (60 km/h) reflect typical urban driving environments, where precise trajectory tracking and smooth control are essential. In contrast, high-speed scenario (90 km/h) impose greater challenges due to increased dynamic complexity, such as higher lateral acceleration and reduced reaction time, thereby testing the robustness and stability of the control strategy.

Figure 6 shows the training process of different reinforcement learning algorithms in different scenarios. At 60 km/h (Figure 6a,b), all algorithms demonstrate comparable training stability with similar episode termination frequencies and reward convergence patterns, attributable to reduced lane boundary constraint activations under lower dynamic complexity. The high-speed scenario (90 km/h, Figure 6c,d) reveals critical differentiation DDPGALM significantly reduces early termination while sustaining stable reward progression, contrasting with general methods that exhibit frequent training interruptions. The off-road rate of DDPGALM, SAC, TD3, and DDPG algorithms is

24.8 %

,

51.4 %

,

64.4 %

, and

71.9 %

respectively at 60 km/h, and

14.8 %

,

35.7 %

,

75.5 %

, and

43.4 %

respectively at 90 km/h shown in (Figure 6a,c). This performance divergence stems from DDPGALM’s adaptive learning architecture, which dynamically adjusts exploration strategies and anticipates safety constraints under high-inertia conditions, thereby maintaining coherent policy evolution despite increased dynamic challenges.

As shown in Figure 7, DDPGALM demonstrates remarkable trajectory tracking accuracy in a single episode of the simulation process. At 60 km/h (Figure 7a,b), both algorithms achieve low lateral errors, but DDPGALM exhibits smoother trajectory adjustments, as evidenced by its reduced oscillation in Figure 7b. At 90 km/h (Figure 7c,d), the baseline DRL struggles with accumulating trajectory deviations, particularly during sharp turns, leading to a maximum lateral error of 0.35 m. In contrast, DDPGALM limits the error to 0.15 m, highlighting its enhanced robustness under high-speed perturbations. Our rigorous lateral tracking standard (<0.05 m error threshold) yields

83.24 %

and

80.86 %

trajectory completion rates at 60 km/h and 90 km/h respectively, as quantitatively validated in Figure 7.

Figure 8 analyzes steering angle and lateral acceleration. At 60 km/h (Figure 8a,b), the steering angle and lateral acceleration profiles of DDPGALM are smoother, ensuring passenger comfort. At 90 km/h (Figure 8c,d), the general DRL generates abrupt steering adjustments, resulting in unstable lateral acceleration peaks (up to 1.8 m/s²). DDPGALM, however, maintains controlled steering inputs, restricting lateral acceleration below 1.0 m/s².

Figure 9 shows the reward values calculated by the reward function based on the vehicle’s lateral tracking error, wheel steering angles, and lateral acceleration during the trajectory tracking process. The comparative simulations under low- and high-speed scenario demonstrate that DDPGALM outperforms conventional DRL in terms of training efficiency, trajectory tracking precision, and vehicle state stability. The high-speed scenario (90 km/h) particularly underscores the algorithm’s advanced adaptability to dynamic challenges, a critical requirement for autonomous driving systems operating in complex environments. This dual-speed validation framework provides a comprehensive assessment of the algorithm’s scalability and robustness, ensuring its applicability across diverse real-world scenarios.

6. Conclusions

The vehicle trajectory tracking algorithm based on safe reinforcement learning designed in this paper can improve the stability and accuracy of four-wheel steering vehicles during driving. Compared to conventional deep reinforcement learning algorithms, the augmented Lagrangian DDPG algorithm used in this paper models the vehicle trajectory tracking problem as a constrained Markov decision process, characterized by faster training speeds and higher safety. By utilizing a prioritized experience replay buffer, the model’s ability to learn and explore from experience data is enhanced through flexible and efficient use of data. Experiments show that unsafe behaviors, such as collisions with lane edges, are significantly reduced during training. By designing a reasonable reward function, both tracking accuracy and passenger comfort indicators are ensured. However, the algorithm still does not fully guarantee training safety, indicating limitations in handling constrained problems. Future work needs to focus on more complex road scenarios and explore more effective methods for handling constrained problems.

Author Contributions

Conceptualization, Z.L. and M.W.; Methodology, Z.L.; Software, Z.L.; Validation, Z.L.; Formal analysis, Z.L.; Investigation, Z.L.; Writing—original draft, Z.L.; Writing—review & editing, M.W.; Supervision, H.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Wang, M.; Niu, C.; Wang, Z.; Jiang, Y.; Jian, J.; Tang, X. Model and Parameter Adaptive MPC Path Tracking Control Study of Rear-Wheel-Steering Agricultural Machinery. Agriculture 2024, 14, 823. [Google Scholar] [CrossRef]
Tan, Q.; Dai, P.; Zhang, Z.; Katupitiya, J. MPC and PSO Based Control Methodology for Path Tracking of 4WS4WD Vehicles. Appl. Sci. 2018, 8, 1000. [Google Scholar] [CrossRef]
Zou, T.; You, Y.; Meng, H.; Chang, Y. Research on Six-Wheel Distributed Unmanned Vehicle Path Tracking Strategy Based on Hierarchical Control. Biomimetics 2022, 7, 238. [Google Scholar] [CrossRef] [PubMed]
Hajjami, L.E.; Mellouli, E.; Žuraulis, V.; Berrada, M.; Boumhidi, I. A Robust Intelligent Controller for Autonomous Ground Vehicle Longitudinal Dynamics. Appl. Sci. 2023, 13, 501. [Google Scholar] [CrossRef]
Wang, Y.; Wang, Z.; Shi, D.; Chu, F.; Guo, J.; Wang, J. Optimized Longitudinal and Lateral Control Strategy of Intelligent Vehicles Based on Adaptive Sliding Mode Control. World Electr. Veh. J. 2024, 15, 387. [Google Scholar] [CrossRef]
Oh, K.; Seo, J. Development of a Sliding-Mode-Control-Based Path-Tracking Algorithm with Model-Free Adaptive Feedback Action for Autonomous Vehicles. Sensors 2023, 23, 405. [Google Scholar] [CrossRef] [PubMed]
Shan, Y.; Zheng, B.; Chen, L.; Chen, L.; Chen, D. A Reinforcement Learning-Based Adaptive Path Tracking Approach for Autonomous Driving. IEEE Trans. Veh. Technol. 2020, 69, 10581–10595. [Google Scholar] [CrossRef]
Zhang, Q.; Wang, H.G.; Ni, L. Improved DDPG Algorithm Based on Offline Model Pre-Training Learning. Comput. Eng. Des. 2022, 43, 1451–1458. [Google Scholar]
Luo, Z.; Zhou, J.; Wen, G. Deep Reinforcement Learning Based Tracking Control of Unmanned Vehicle with Safety Guarantee. In Proceedings of the 2022 13th Asian Control Conference (ASCC), Jeju Island, Republic of Korea, 4–7 May 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 1893–1898. [Google Scholar]
Xiao, H.X.; Zhao, H.X.; Yang, T.J. Special Vehicle Route Optimization Strategy Based on Route Search DQN. Comput. Eng. Des. 2024, 45, 3153–3160. [Google Scholar]
Chen, I.M.; Chan, C.Y. Deep Reinforcement Learning Based Path Tracking Controller for Autonomous Vehicle. Proc. Inst. Mech. Eng. Part J Automob. Eng. 2021, 235, 541–551. [Google Scholar] [CrossRef]
Ning, Q.; Liu, Y.S.; Xie, L.Y. Application of SAC-Based Autonomous Vehicle Control Method. Comput. Eng. Appl. 2023, 59, 306–314. [Google Scholar]
Ye, B.L.; Wang, X.; Li, L.X.; Wu, W.M. Vehicle Intelligent Control Method Based on Deep Reinforcement Learning PPO. Comput. Eng. 2024, 1–14. [Google Scholar] [CrossRef]
Mnih, V. Playing Atari with Deep Reinforcement Learning. arXiv 2013, arXiv:1312.5602. [Google Scholar]
Haarnoja, T.; Zhou, A.; Abbeel, P.; Levine, S. Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor. In Proceedings of the International Conference on Machine Learning, PMLR, Stockholm, Sweden, 10–15 July 2018; pp. 1861–1870. [Google Scholar]
Fujimoto, S.; Hoof, H.; Meger, D. Addressing Function Approximation Error in Actor-Critic Methods. In Proceedings of the International Conference on Machine Learning, PMLR, Stockholm, Sweden, 10–15 July 2018; pp. 1587–1596. [Google Scholar]
Lillicrap, T.P. Continuous Control with Deep Reinforcement Learning. arXiv 2015, arXiv:1509.02971. [Google Scholar]
Hassan, I.A.; Ragheb, H.; Sharaf, A.M.; Attia, T. Reinforcement Learning for Precision Navigation: DDQN-Based Trajectory Tracking in Unmanned Ground Vehicles. In Proceedings of the 2024 14th International Conference on Electrical Engineering (ICEENG), Cairo, Egypt, 21–23 May 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 54–59. [Google Scholar]
Li, M.; Liu, H.; Wang, H.; Xia, M. Trustworthy Dynamic Object Tracking Using Deep Reinforcement Learning with the Self-Attention Mechanism. In Proceedings of the 2023 IEEE 19th International Conference on Automation Science and Engineering (CASE), Auckland, New Zealand, 26–30 August 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 1–6. [Google Scholar]
Srikonda, S.; Norris, W.R.; Nottage, D.; Soylemezoglu, A. Deep Reinforcement Learning for Autonomous Dynamic Skid Steer Vehicle Trajectory Tracking. Robotics 2022, 11, 95. [Google Scholar] [CrossRef]
Gu, S.; Yang, L.; Du, Y.; Chen, G.; Walter, F.; Wang, J.; Knoll, A. A Review of Safe Reinforcement Learning: Methods, Theory and Applications. arXiv 2022, arXiv:2205.10330. [Google Scholar] [CrossRef] [PubMed]
Carr, S.; Jansen, N.; Junges, S.; Topcu, U. Safe Reinforcement Learning via Shielding under Partial Observability. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023; Volume 37, pp. 14748–14756. [Google Scholar]
Simão, T.D.; Suilen, M.; Jansen, N. Safe Policy Improvement for POMDPs via Finite-State Controllers. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023; Volume 37, pp. 15109–15117. [Google Scholar]
Park, J.; Seo, Y.; Shin, J.; Lee, H.; Abbeel, P.; Lee, K. SURF: Semi-Supervised Reward Learning with Data Augmentation for Feedback-Efficient Preference-Based Reinforcement Learning. arXiv 2022, arXiv:2203.10050. [Google Scholar]
Cohen, M.H.; Belta, C. Safe Exploration in Model-Based Reinforcement Learning Using Control Barrier Functions. Automatica 2023, 147, 110684. [Google Scholar] [CrossRef]
ISO 3888-1:1999; Passenger Cars: Test Tracks for a Severe Lane-Change Manoeuvre: Part 1: Double Lane-Change. International Organisation for Standardization: London, UK, 2018.

Figure 1. 6 DOF model of a four-wheel vehicle.

Figure 2. Vehicle Tracking Model with Preview Distance.

Figure 3. Track tracking decision control algorithm framework.

Figure 4. Vehicle system simulation frameworks and processes.

Figure 5. Verification of vehicle modeling at different velocities.

Figure 6. Reinforcement learning training process.

Figure 7. Simulation results of tracking trajectory.

Figure 8. Simulation results of vehicle state.

Figure 9. Simulation results of reward value.

Table 1. Reinforcement learning parameters in the vehicle DLC maneuver.

Parameters	Reinforcement Learning Parameter Values for Different Methods
Parameters	SAC	TD3	DDPG	DDPGALM
Sampling time/s	0.01	0.01	0.01	0.01
Batch size	256	256	256	256
Discount factor $γ$	0.99	0.99	0.99	0.99
Initial exploration rate	0.2	0.2	0.2	0.2
Final exploration rate	0.01	0.01	0.01	0.01
Critic learning rate	$3 \times 10^{- 4}$	$3 \times 10^{- 4}$	$1 \times 10^{- 3}$	$1 \times 10^{- 3}$
Actor learning rate	$3 \times 10^{- 4}$	$5 \times 10^{- 3}$	$1 \times 10^{- 3}$	$1 \times 10^{- 3}$
Soft update coefficient	$5 \times 10^{- 3}$	$2 \times 10^{- 3}$	$1 \times 10^{- 2}$	$1 \times 10^{- 2}$
Entropy coefficient $α$	0.2	–	–	–
Target policy noise	–	0.2	–	–
Delayed update frequency	–	2	–	–
Initial penalty factor	–	–	–	0.01
Initial Lagrange multiplier	–	–	–	$5 \times 10^{- 3}$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, Z.; Wang, M.; Zhao, H. Vehicle Lateral Control Based on Augmented Lagrangian DDPG Algorithm. Appl. Sci. 2025, 15, 5463. https://doi.org/10.3390/app15105463

AMA Style

Li Z, Wang M, Zhao H. Vehicle Lateral Control Based on Augmented Lagrangian DDPG Algorithm. Applied Sciences. 2025; 15(10):5463. https://doi.org/10.3390/app15105463

Chicago/Turabian Style

Li, Zhi, Meng Wang, and Haitao Zhao. 2025. "Vehicle Lateral Control Based on Augmented Lagrangian DDPG Algorithm" Applied Sciences 15, no. 10: 5463. https://doi.org/10.3390/app15105463

APA Style

Li, Z., Wang, M., & Zhao, H. (2025). Vehicle Lateral Control Based on Augmented Lagrangian DDPG Algorithm. Applied Sciences, 15(10), 5463. https://doi.org/10.3390/app15105463

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Vehicle Lateral Control Based on Augmented Lagrangian DDPG Algorithm

Abstract

1. Introduction

2. Preliminaries

3. Problem Formulation

3.1. The Dynamic Model of the Four-Wheel Steering Vehicle

3.2. The Vehicle Trajectory Error System

3.3. Constrained Reinforcement Learning-Based Tracking Control Framework

4. Main Results

4.1. Augmented Lagrangian Method for Safe RL

4.2. Reward Function Design

4.3. Vehicle Tracking Control Algorithm Based on Safety Constraints

5. Simulation Studies

5.1. Simulation Platform and Model Verification

5.2. Simulation Results and Analysis

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI