A Rule-Guided Distributional Soft Actor–Critic Algorithm for Safe Lane-Changing in Complex Driving Scenarios

Cui, Shuwan; Li, Hao; Su, Yanzhao; Huang, Jin; Cheng, Kun; Li, Huiqian

doi:10.3390/vehicles8030058

Open AccessArticle

A Rule-Guided Distributional Soft Actor–Critic Algorithm for Safe Lane-Changing in Complex Driving Scenarios

by

Shuwan Cui

¹,

Hao Li

¹

,

Yanzhao Su

^2,*

,

Jin Huang

^2,3,*,

Kun Cheng

³ and

Huiqian Li

³

¹

School of Mechanical and Automotive Engineering, Guangxi University of Science and Technology, Liuzhou 545006, China

²

State Key Laboratory of Intelligent Green Vehicle and Mobility Tsinghua University, Beijing 100084, China

³

School of Vehicle and Mobility, Tsinghua University, Beijing 100084, China

^*

Authors to whom correspondence should be addressed.

Vehicles 2026, 8(3), 58; https://doi.org/10.3390/vehicles8030058

Submission received: 29 January 2026 / Revised: 27 February 2026 / Accepted: 11 March 2026 / Published: 13 March 2026

(This article belongs to the Special Issue AI-Empowered Assisted and Autonomous Driving)

Download

Browse Figures

Versions Notes

Abstract

Mandatory lane-changing in complex driving scenarios poses significant challenges for autonomous driving systems due to complex vehicle interactions and strict safety requirements. Existing methods often rely on handcrafted rules or extensive expert demonstrations, which increase data collection costs and provide limited safety guarantees during learning. To address these issues, this paper proposes a rule-guided reinforcement learning framework for lane-changing policy optimization. A lightweight rule-based controller is employed to generate initial experience, guiding the training of an improved Distributional Soft Actor–Critic with Three Refinements (DSAC-T), while a safety-aware constraint controller filters high-risk actions to ensure stable and safe learning. The proposed method is evaluated in Regular Lane Change and Lane Merging scenarios under mixed traffic composed of aggressive and conservative vehicles within a simulation environment. Simulation results show that although lane-changing success rates decrease as traffic aggressiveness increases, the proposed method consistently outperforms SAC and TD3. Notably, under highly aggressive traffic conditions with an aggressiveness ratio of 0.7, the proposed approach improves the success rate by 17.13% compared to SAC and by 10.49% compared to TD3, demonstrating superior robustness and safety in complex, high-conflict lane-changing scenarios. The present study is conducted solely in simulation and requires further validation before application to real-world traffic environments.

Keywords:

lane changing; reinforcement learning; congested traffic; rule-guided learning; autonomous driving

1. Introduction

1.1. Motivation

Lane-changing is one of the most challenging subtasks in autonomous driving and serves as a key indicator of a system’s ability to make complex decisions in dynamic environments. Statistics indicate that approximately 13% of highway traffic accidents in the United States are caused by improper lane-changing behaviors [1]. Based on the underlying motivation, lane changes can generally be classified into two types: those driven by subjective intentions (e.g., seeking speed advantages) and those triggered by objective requirements such as mandatory changes dictated by global route planning [2,3].

Numerous researchers have explored the use of rule-based, game-theoretic, and learning-based approaches to address the lane-changing decision-making problem in autonomous driving. However, achieving safe and reliable lane changes in highly interactive and congested environments remains a major challenge [4,5]. Rule-based methods, while interpretable and easy to implement, often suffer from poor generalization and lack the flexibility to handle complex and adversarial interactions common in congested traffic. Game-theoretic approaches attempt to model interactions among vehicles as strategic behaviors, but they typically require strong assumptions about the intentions and rationality of other drivers, which limits their applicability in uncertain and dynamic real-world settings. On the other hand, pure reinforcement learning (RL)-based methods rely heavily on large-scale expert demonstrations, which significantly increase data collection costs and hinder deployment. Moreover, in the absence of explicit safety constraints, RL agents tend to make unsafe decisions during the early stages of training, posing risks in real-world applications.

To address the limitations of high reliance on expert data and insufficient safety guarantees in existing methods, this paper proposes a rule-guided reinforcement learning approach with integrated safety constraints. The method utilizes a rule-based controller to generate initial experience, thereby reducing dependence on human expert demonstrations during the early stages of training. Furthermore, a safety constraint module is introduced to filter potentially dangerous actions produced by the policy network, effectively preventing unsafe behaviors during exploration. This framework enables safe and stable reinforcement learning, laying the foundation for practical deployment in real-world autonomous driving scenarios.

1.2. Literature Review

Existing lane-change decision methods can be broadly classified into three categories: rule-based, game theory-based, and reinforcement learning-based methods. Rule-based approaches depend on handcrafted decision rules, which provide strong interpretability and high computational efficiency; however, their performance often degrades in highly dynamic or complex traffic scenarios due to limited adaptability. He et al. [6] proposed a rule-driven, safety-critical control method based on a finite state machine (FSM). However, the simplified vehicle dynamics model adopted in their approach exhibits noticeable accuracy deviations under highly dynamic traffic conditions, which limits the generalization and robustness of the strategy in complex real-world scenarios. Cao et al. [7] developed a rule-based lane-changing method incorporating a look-ahead mechanism. While it performs well in sparse traffic, its generalization capability remains limited when dealing with complex traffic environments such as low-speed congestion. Asano et al. [8] proposed a rule-based cooperative lane-changing method, but it lacks robustness in congested scenarios with strong interactions.

Game-theoretic-based lane-changing models emphasize the interactions among drivers and, compared with traditional models, provide a more realistic representation of driving behavior. In multi-agent interactive settings, each participant selects a strategy based on expectations of others’ behavior to maximize their own benefit, leading to a balance of strategies known as equilibrium [9,10,11]. However, they involve complex modeling procedures and rely on idealized assumptions about the behaviors of other agents. Ali et al. [12] incorporated the behavior of following vehicles into game-theoretic modeling and validated their method in a connected environment, demonstrating strong behavioral interpretability. However, the timing of lane changes is constrained by limited acceleration lane space and the response delay of following vehicle strategies, resulting in considerable prediction errors in both lane-change timing and positioning. Lopez et al. [13] formulated a standard-form game model for multi-vehicle interactions, but their approach considers only pairwise games between the ego vehicle and individual surrounding vehicles, without incorporating global traffic context. As a result, the model faces limitations in reward modeling accuracy and strategy generalization under high-density or behaviorally complex traffic scenarios.

Imitation learning (IL) aims to learn policies that mimic expert behavior from demonstrations, without requiring explicit rewards or handcrafted rules [14]. However, it suffers from covariate shift between training and deployment, leading to compounding errors, particularly in end-to-end autonomous driving tasks where raw sensory inputs are directly mapped to control commands. Limited expert data, poor interpretability, and high sensitivity to perception noise significantly hinder policy generalization and safety in such scenarios [15]. A prevailing approach is to first use imitation learning to obtain an initial policy, which is then refined through reinforcement learning. Lie and Liu et al. [16] proposed a hierarchical lane-changing framework combining behavior cloning and reinforcement learning. A decision tree is used at the high level to select driving tasks, while the low level applies behavior cloning to learn control commands from expert data. Feature selection helps reduce model complexity and avoid overfitting. Xiao and Wang et al. [17] proposed a hybrid method that combines behavior cloning (BC) with reinforcement learning (RL), using expert data to pretrain both the policy and value networks. The policy network learns to mimic expert actions via supervised learning, while the value network is updated using temporal-difference errors. This approach accelerates RL training, reduces online interaction time, and enables efficient learning in complex tasks.

Deep reinforcement learning (DRL) combines the powerful representational capabilities of neural networks with the decision-making framework of reinforcement learning (RL) [18,19,20,21], making it well-suited for long-term decision-making tasks in complex environments. However, it suffers from unstable training processes, slow convergence, and a tendency to generate unsafe behaviors during early learning stages. Zhao et al. [22] proposed a Q-learning-based method incorporating expert demonstrations to improve exploration efficiency. Sharma et al. [23] proposed a hierarchical reinforcement learning approach based on the SAC algorithm, achieving integrated high-level decision-making and low-level trajectory control for autonomous driving. Liu et al. [24] proposed a lane-changing trajectory planning method based on the LSTM-TD3 algorithm, which integrates temporal state information to generate smooth and stable acceleration strategies for autonomous lane changing, achieving improved decision reliability and higher success rates in complex traffic scenarios. Katzilieris et al. [25] proposed a reinforcement learning-based dynamic lane reversal method, which employs a dueling Double DQN agent to learn optimal reversal timings under varying traffic demands. Liu et al. [26] proposed a reinforcement learning-based method for learning personalized discretionary lane-change initiation, which leverages in-vehicle user feedback through an offline contextual bandit framework, achieving significantly higher accuracy (86.1%) in reproducing individual driving preferences compared to non-customized models.

1.3. Contribution

To address the limitations of existing lane-changing decision-making approaches, this study proposes a hybrid reinforcement learning framework that integrates rule-guided exploration, distributional Soft Actor–Critic (SAC) learning, and safety-aware control within a closed-loop architecture. During early training, a rule-based model generates structured experience stored in a dedicated Rule Policy Buffer. A high sampling ratio from this buffer allows the agent to incorporate prior knowledge and accelerate policy learning. As training progresses, sampling gradually shifts to the Actor Policy Buffer, enabling a dual-stage replay mechanism that balances expert guidance and autonomous exploration. To improve training stability and reduce sensitivity to reward scaling, we adopt an enhanced SAC algorithm, DSAC-T, which incorporates return distribution estimation, double-distribution Q-learning, adaptive clipping, and gradient scaling. The learned policy is filtered by a safety-aware controller before interacting with the environment. Environmental feedback is then used to update both the rule model and the DSAC-T learner, forming a closed-loop learning cycle.

The key contributions of our work are summarized as follows:

(1) Rule-guided and safety-constrained DSAC-T framework: A Distributional Soft Actor–Critic framework with integrated rule guidance and safety constraints is proposed, enabling rapid policy initialization, progressive policy improvement beyond rule baselines, and effective suppression of unsafe actions.

(2) Enhanced distributional RL algorithm: An improved DSAC-T algorithm is adopted, incorporating double-distribution Q-learning, adaptive clipping, and gradient scaling to ensure stable learning without task-specific target tuning.

(3) Lightweight rule and safety control module: A simple rule-based model combined with a safety-aware action masking mechanism is introduced to guide policy learning and prevent unsafe behaviors during both training and deployment.

1.4. Paper Organization

The paper is organized as follows. Section 2 presents the problem formulation and reviews related work. Section 3 introduces the proposed rule-guided and safety-aware reinforcement learning framework. Section 4 details the training procedures. Section 5 describes the modeling of surrounding vehicles, simulation parameter settings, and simulation scenario design. Section 6 reports and analyzes the simulation results. Finally, Section 7 concludes the paper.

2. Problem Formulation and Background

Mandatory lane changing in congested traffic is a sequential decision-making problem involving dynamic interactions with surrounding vehicles. In real-world driving, a vehicle must determine whether and when to initiate a lane change based on factors such as relative distance, relative speed, surrounding traffic density, and safety constraints. The decision must balance efficiency (e.g., timely completion of the maneuver) and safety (e.g., collision avoidance and risk minimization), while accounting for the uncertain and potentially aggressive behaviors of neighboring vehicles.

From a control perspective, this process can be naturally formulated as a sequential interaction between the ego vehicle and a stochastic traffic environment. At each time step, the ego vehicle observes its current traffic state, selects a control action (e.g., acceleration and steering adjustments), and transitions to a new state influenced by both its own behavior and that of surrounding vehicles. This sequential nature and uncertainty motivate the formulation of the lane-changing problem within a reinforcement learning framework.

In the deep reinforcement learning (DRL) framework, an agent interacts with an uncertain environment by selecting a sequence of actions over time. At each time step, the agent receives a feedback reward from the environment based on its current state and the action taken. The objective is to learn a policy that maximizes the cumulative expected reward over time. To explicitly establish the connection between the mandatory lane-changing task and reinforcement learning, the decision-making process is formulated as a Markov Decision Process (MDP). The formal definition of the MDP is given in Equation (1):

M = < S_{t}, A_{t}, S_{t + 1}, R_{t} >,

(1)

where

S_{t}

denotes the state at time

t

,

A_{t}

is the action taken,

S_{t + 1}

is the next state, and

R_{t}

is the reward received. In the lane-changing context, the state typically includes the ego vehicle’s kinematic variables and the relative states of surrounding vehicles, the action corresponds to continuous control commands, and the reward encodes safety, efficiency, and maneuver completion objectives.

In this study, the lane-changing task is modeled as an MDP, where the agent aims to learn an optimal policy under dynamic and uncertain traffic conditions. The objective of reinforcement learning is to learn a policy

π^{*}

that maximizes the expected cumulative reward, which can be formally defined as:

π^{*} = \arg \max_{π} E_{π} [\sum_{t = 0}^{\infty} γ^{t} r (s_{t}, a_{t})]

(2)

where

γ \in [0, 1]

is the discount factor,

r (s_{t}, a_{t})

is the immediate reward under state

s_{t}

and action

a_{t}

, and

E_{π}

denotes the expectation under policy

π

.

However, directly optimizing this objective in high-dimensional or continuous control tasks often suffers from issues such as insufficient exploration and training instability. To address these limitations, the maximum entropy reinforcement learning framework augments the reward objective with a policy entropy term, encouraging exploration by favoring more stochastic policies. To encourage sufficient exploration in continuous control, the objective is augmented with a policy entropy term

H (π (\cdot | s_{t}))

, leading to the maximum entropy formulation shown in Equation (3):

π^{*} = \arg \max_{π} E_{π} [\sum_{t = 0}^{\infty} γ^{t} (r (s_{t}, a_{t}) + α H (π (\cdot | s_{t})))]

(3)

where

H (π (\cdot | s_{t})) = - E_{a_{t} ~ π} [\log π (a_{t} | s_{t})]

denotes the entropy of the policy and

α

is a temperature coefficient balancing reward and entropy. Under this formulation, the optimal policy naturally emerges as one that maximizes both expected return and entropy, enabling improved stability and exploration.

Building upon the standard Soft Actor–Critic (SAC) algorithm [27,28], the Distributional Soft Actor–Critic with Three Refinements (DSAC-T) is used to mitigate value overestimation, training instability, and sensitivity to reward scaling by introducing three key enhancements within a distributional reinforcement learning framework.

The first refinement is expected value substitution. Instead of relying on a single random sample from the learned value distribution, DSAC-T employs the expected Q-value as the training target, which reduces estimation variance and improves training stability:

y_{q} = r + γ (Q_{\bar{θ}} (s^{'}, a^{'}) - α \log π_{\bar{ϕ}} (a^{'} | s^{'}))

(4)

The second refinement, twin value distribution learning, extends the clipped double Q-learning strategy to the distributional setting. Two independent value distributions

Z_{θ_{1}} (s, a)

and

Z_{θ_{2}} (s, a)

are trained in parallel, and the distribution with the smaller mean is selected for both critic and actor updates:

J_{π} (ϕ) = E_{s ~ B, a ~ π_{ϕ}} [\min_{i = 1, 2} Q_{θ_{i}} (s, a) - α \log π_{ϕ} (a | s)]

(5)

The third refinement is variance-based critic gradient adjustment. A fixed clipping boundary is replaced by an adaptive one based on the value distribution’s standard deviation:

b = ξ \cdot E_{(s, a) ~ B} [σ_{θ} (s, a)]

(6)

A gradient scaling factor

ω = E_{(s, a) ~ B} [σ_{θ} {(s, a)}^{2}]

is further introduced to normalize the update size.

3. Proposed Approach

3.1. Framework Overview

To optimize lane-changing strategies in autonomous driving decision-making, we propose a rule-driven reinforcement learning approach. In the initial stage, a simple rule-based lane-changing model is introduced to generate prior experience, thereby accelerating the early training of the DSAC-T algorithm and reducing inefficient exploration. A dual experience replay buffer is then constructed to dynamically sample experiences from both rule-based control and real-time RL interactions. During training, to mitigate fluctuations in critic gradients, we adopt the enhanced DSAC-T algorithm proposed by Duan et al. [29], which replaces the original target return with the expected value. To address the overestimation issue, a dual value distribution learning structure is employed, consisting of two independently trained value distributions. To improve policy generalization across tasks, adaptive clipping and gradient scaling are integrated to enable more stable and transferable learning. In addition, to prevent unsafe behaviors during RL training, a Safety Constraint Module (SCM) and a collision avoidance controller with longitudinal and lateral constraints are introduced to mask hazardous actions in the policy output. The overall control framework is illustrated in Figure 1.

3.2. Rule-Guided Controller for Safe and Efficient Lane Changing

To ensure behavioral stability during the early stages of reinforcement learning training and to provide interpretable and controllable fallback behavior when the learned policy is immature, a heuristic rule-based controller is designed to handle essential driving tasks including lane keeping, car-following, and right-lane changing. The controller takes as input key variables from the observation space, including the ego vehicle’s speed

v_{e g o}

, the distance to the leading vehicle

d_{f r o n t}

, and the relative speed

Δ v = v_{e g o} - v_{f r o n t}

, and outputs a three-dimensional control command

[s t e e r, t h r o t t l e, b r a k e]

.

In lateral control, if the lane change trigger condition is met (e.g., the current lane ID is −3 and an external signal is activated), the controller applies a fixed steering command

s t e e r = 0.5

to execute a right-lane change. Otherwise, it computes the steering angle based on the position of a forward waypoint. Given the waypoint coordinates in the vehicle’s local frame as

(x_{w p}, y_{w p})

, the steering angle is computed as:

θ_{s t e e r} = \arctan (\frac{2 y_{w p}}{x_{w p}^{2} + y_{w p}^{2}}), s t e e r = clip (\frac{4 θ_{s t e e r}}{π}, - 1.0, 1.0)

(7)

This geometry-based heuristic enables smooth directional adjustments and accommodates mild road curvatures.

For longitudinal control, the controller is structured into three operational modes according to the relative distance to the preceding vehicle: free-cruising mode, collision-avoidance mode, and car-following regulation mode. The mathematical formulations corresponding to these modes are given in Equations (8)–(11). When no lead vehicle is detected

(d_{f r o n t} = 0)

, the vehicle is in free-cruising mode. The control logic adjusts throttle and brake according to the speed error

e_{v} = v_{d} - v_{e g o} : if e_{v} > 0

, the throttle is given by

t h r o t t l e = c l i p (0.3 + 0.1 e_{v}, 0.0, 0.6)

; otherwise, the brake is applied via

b r a k e = c l i p (0.1 |e_{v}|, 0.0, 1.0)

.

If the front vehicle distance is below a minimum safety threshold

d_{\min}

, the controller enters a collision-avoidance mode and enforces strong braking:

b r a k e = c l i p (0.6 + 0.4 (d_{\min} - d_{f r o n t}), 0.0, 1.0)

(8)

In the car-following regime, when the spacing

d

the desired velocity

v_{d}

is adjusted as defined in Equation (9):

v_{f o l l o w} = v_{d} - 0.5 (d_{\max} - d_{f r o n t})

(9)

The throttle command is then computed as a function of the speed error

v_{d} - v

and relative speed

Δ v

, as shown in Equation (10):

t h r o t t l e = clip (0.3 + 0.1 (v_{f o l l o w} - v_{e g o}) - 0.2 \cdot \max (Δ v, 0), 0.0, 1.0)

(10)

Furthermore, to enhance safety during lane changes, if a slower vehicle is detected in the target lane, the desired speed is proactively limited by:

v_{d} = \min (v_{d}, v_{target})

(11)

where

v_{target}

denotes the safe velocity estimated from the target-lane leading vehicle, and

v_{d}

represents the nominal desired speed before safety adjustment.

3.3. Curriculum-Aware Replay Sampling with Decaying Rule Ratio

To improve sample efficiency and early convergence during policy training, a dual dynamic experience replay mechanism is designed to integrate high-quality prior experiences from the rule-based controller with online experiences collected by the RL agent. The proposed mechanism constructs two separate experience buffers, denoted as the rule buffer

D_{r u l e}

and the agent buffer

D_{target}

. Each experience is stored in the form of a five-element tuple

(s_{t}, a_{t}, r, s_{t + 1}, d_{t})

, representing the state, action, reward, next state, and termination flag, respectively. During training, the sampling ratio is dynamically adjusted over time. When the current time-step

t

is less than the predefined warm-up threshold

t_{w a r m u p}

, the agent relies entirely on rule-based experiences, i.e., the sampling strategy is set to

π_{sample} (B) = Uniform (D_{rule})

. When

t \geq t_{w a r m u p}

, a hybrid sampling strategy is adopted, where experiences are drawn from both buffers proportionally. The sampling probability of rule-based experiences, denoted as

ρ (t)

, decays linearly over time. It is defined as follows:

ρ = \max (ρ_{\min}, ρ_{\max} - (\frac{t - t_{w a r m u p}}{t_{\max} - t_{w a r m u p}}) (ρ_{\max} - ρ_{\min}))

(12)

Here,

ρ_{\max} = 0.9

,

ρ_{\min} = 0.3

, and

t_{\max}

denotes the maximum training step. A training batch of size

B

is constructed at each step. The number of samples drawn from the rule buffer is computed as

N_{r} = ⌊ρ (t) \cdot B⌋

, where

ρ (t)

denotes the time-dependent sampling ratio. The remaining

N_{a} = B - N_{r}

samples are drawn from the agent buffer. These two subsets are then concatenated to form a unified training batch:

B_{t} = S a m p l e (D_{r u l e}, N_{r}) \cup S a m p l e (D_{a g e n t}, N_{a})

(13)

To further improve the efficiency of experience sampling, a reward-based stratified prioritized sampling mechanism is introduced into the agent experience buffer. Specifically, past transition samples in

D_{a g e n t}

are partitioned into subsets based on the immediate reward

r_{t}

: the high-reward subset

D_{a g e n t}^{+}

, which contains samples with returns greater than the running average, and the low-reward subset

D_{a g e n t}^{-}

, which contains the remaining transitions.

For each training step,

N_{a}

samples are drawn from

D_{a g e n t}

. With a fixed probability

p_{h i g h} \in [0.6, 0.9]

, samples are preferentially drawn from

D_{a g e n t}^{+}

to improve the quality of learning signals for policy updates. This mechanism introduces no explicit sampling weights or replay prioritization, thereby avoiding bias and instability while enhancing the agent’s sensitivity to high-reward experiences.

Simulation results demonstrate that this reward-based sampling enhancement significantly improves sample efficiency in early-stage training and enhances policy stability and generalization in complex traffic scenarios.

3.4. Safety Shield for Policy Enforcement

To ensure safe execution of lane change maneuvers in dense traffic, we develop a lightweight collision checking module that predicts the short-term motion of the ego vehicle and surrounding traffic participants under a fixed control action. This module serves as a safety gate that filters potentially dangerous decisions before execution.

Given the ego vehicle’s current state and a candidate action, we assume that both the ego vehicle and surrounding vehicles maintain constant velocity during a prediction horizon

T

. The ego vehicle’s trajectory is computed using a simplified kinematic model, where the position and orientation evolve as follows:

\begin{array}{l} x_{e g o} (t + Δ t) = x_{e g o} (t) + v \cos (θ (t)) Δ t \\ y_{e g o} (t + Δ t) = y_{e g o} (t) + v \sin (θ (t)) Δ t \\ θ (t + Δ t) = θ (t) + ω Δ t \end{array}

(14)

where

v

is the longitudinal velocity and

ω

is the yaw rate specified by the action. The above equations are iterated over discrete time steps to simulate the ego trajectory within the horizon

T

.

Simultaneously, the positions of surrounding vehicles—specifically the target lane front vehicle and the adjacent vehicle—are propagated forward under a constant velocity assumption. For a surrounding vehicle with initial relative position

(x_{r e l}, y_{r e l})

and velocity

v_{s}

its global position at time

t

is given by

x_{s u r r} (t) = x_{r e l} + v_{s} t, y_{s u r r} (t) = y_{r e l}

(15)

At each predicted time step, we evaluate the Euclidean distance between the ego and each surrounding vehicle:

d (t) = \sqrt{{(x_{e g o} (t) - x_{s u r r} (t))}^{2} + {(y_{e g o} (t) - y_{s u r r} (t))}^{2}}

(16)

If the distance

d (t)

falls below a predefined safety threshold

D_{s a f e}

, the action is marked unsafe. Formally, an action is rejected if:

\min_{t \in [0, T]} d (t) < D_{s a f e} .

(17)

In our implementation, we also consider the physical dimensions of each vehicle and apply a rectangle-based approximation for collision checking. Specifically, if the predicted rectangular envelopes of two vehicles overlap in both longitudinal and lateral directions, a collision is flagged. Let

L

and

W

denote the vehicle length and width, respectively. Then a collision occurs if:

| x_{e g o} (t) - x_{s u r r} (t) | < \frac{L_{e g o} + L_{s u r r}}{2} | y_{e g o} (t) - y_{s u r r} (t) | < \frac{W_{e g o} + W_{s u r r}}{2}

(18)

The safety decision is binary: if no collision is predicted throughout the horizon, the action is deemed feasible; otherwise, it is filtered. This module is integrated into our control pipeline as a real-time risk evaluator, ensuring that even aggressive or learning-based policies adhere to minimum safety constraints during interaction with surrounding traffic.

4. Training Details

4.1. State and Action

To facilitate autonomous lane-changing in dense traffic scenarios, we construct a structured state space that encodes both the ego vehicle’s dynamic state and critical information from surrounding vehicles. The observation is formulated as a 28-dimensional continuous vector, which includes the ego vehicle’s global position

(x, y)

, yaw angle

ψ

, longitudinal velocity

v_{x}

, and accelerations

(a_{x}, a_{y})

. In addition, the state vector incorporates the kinematic states of up to three surrounding vehicles: the vehicle ahead in the current lane, the parallel vehicle in the target lane, and the target lane’s leading vehicle. Each surrounding vehicle contributes a 7-dimensional vector, including its own position, orientation, velocity, acceleration, and a relative longitudinal distance to the ego vehicle, computed in the local coordinate frame. A final scalar, denoted as

Δ y_{p a r a l l e l}

, measures the lateral offset between the ego vehicle and the parallel vehicle along the global y-axis. Formally, the full state vector is defined as:

s = [x, y, ψ, \dot{ψ}, v_{x}, a_{x}, a_{y}, s_{f r o n t}, s_{p a r a l l e l}, s_{t a r g e t}, Δ y_{p a r a l l e l}] \in ℝ^{28}

(19)

where each

s_{i} \in ℝ^{7}

denotes the state of a surrounding vehicle, with the last element being the relative distance

d_{i}

computed as

d_{i} = x_{target} - x_{e g o}

in the ego’s local frame.

The action space is constructed as a 2-dimensional continuous vector output by the reinforcement learning policy, denoted by

a_{R L} \in ℝ^{2}

, where

a_{R L} \in [- 1, 1] \times [- 1, 1]

. The first dimension

a_{s t e e r}

directly corresponds to the steering control. The second dimension

a_{a c c e l_b r a k e}

represents a unified longitudinal control signal. If

a_{a c c e l_b r a k e} > 0

, it is interpreted as throttle input, and brake is set to zero; if

a_{a c c e l_b r a k e} < 0

, throttle is set to zero and the brake intensity is

|a_{a c c e l_b r a k e}|

. This design can be compactly expressed by the following mapping:

\{\begin{cases} s t e e r = a_{s t e e r} \\ t h r o t t l e = \max (0, a_{a c c e l_b r a k e}) \\ b r a k e = \max (0, - a_{a c c e l_b r a k e}) \end{cases}

(20)

4.2. Reward Function

To achieve safe and efficient mandatory lane changes, a composite reward function is designed in this work, integrating both discrete and continuous components. At each time step, the total reward is calculated as the summation of individual reward terms. Discrete rewards and penalties are employed to enforce basic behavioral constraints such as avoiding collisions, staying within legal lanes, and maintaining forward motion. Continuous reward terms, on the other hand, are introduced to guide fine-grained control over spatial positioning, vehicle attitude, and task completion quality. The discrete reward components and their corresponding triggering conditions are summarized in Table 1.

It is worth noting that the magnitudes of the discrete reward components are designed based on relative priority rather than absolute scaling. Safety-critical events such as collision, off-road driving, and illegal lane occupancy are assigned large negative penalties (−10.0) to dominate the reward structure and prevent unsafe exploration. In contrast, behavior-shaping rewards such as forward driving and lane-change initiation are assigned smaller positive values (+2.0) to encourage efficient maneuvering without overriding safety constraints. These values were empirically tuned to ensure stable training convergence while preserving a clear safety–efficiency hierarchy.

In addition to the discrete terms shown in Table 1, continuous reward components are defined to guide smooth and accurate control toward the goal lane. To encourage the vehicle to approach the center of the target lane, a proximity reward is introduced:

r_{7} = 10 \times (1 - e^{- 5 \cdot d_{x}}), d_{x} = \frac{495 - x}{145}

(21)

where

x

denotes the current longitudinal position of the vehicle, and

d_{x} \in [0, 1]

is a normalized distance to the target zone, computed over a 145-m reference segment. The exponential form is adopted to provide stronger gradient signals when the vehicle is far from the goal, while gradually saturating as it approaches the target region. The scaling factor (10) is introduced to balance this term with other reward components and ensure numerical stability during training.

To maintain proper alignment in both lateral position and heading angle during the lane change, an alignment reward is formulated using a product of two Gaussian functions:

r_{8} = 5 \cdot \exp (- \frac{e_{y}^{2}}{2 σ_{y}^{2}}) \cdot \exp (- \frac{e_{ψ}^{2}}{2 σ_{ψ}^{2}})

(22)

Here,

e_{y} = y + 14

represents the lateral deviation from the center of the target lane (in meters), and

σ_{y} = 0.5

controls the lateral tolerance.

e_{ψ}

is the heading angle error relative to the goal orientation of 180° (in degrees), and

σ_{ψ} = 10

specifies the heading tolerance. This reward reaches its maximum only when both position and heading are well aligned, promoting stability during the lane change process.

To assess the quality of task completion near the destination, a final alignment reward is defined as follows:

r_{9} = \{\begin{cases} \max (0.5, \min (2.5, - 0.04 \cdot |x - 300| + 2.5)), if |y + 14| \leq 3, |e_{ψ}| \leq 20, v \geq 2 \\ 0.0, otherwise \end{cases}

(23)

This term is only activated when the vehicle’s lateral deviation is less than 3 m, the heading angle error is within 20°, and the forward speed exceeds 2 m/s. The threshold of 3 m is selected considering the typical lane width in the simulation environment (approximately 3.5 m), ensuring that the terminal reward is activated only when the vehicle is sufficiently centered within the target lane. This prevents premature reward activation during partial or incomplete lane-change maneuvers. The reward linearly decreases with the absolute error in the x-position relative to the goal (set at

x = 300

), thereby promoting precise stopping behavior at the target location.

5. Simulation

5.1. Traffic Scenario and Vehicle Modeling

In interactive low-speed traffic scenarios (such as roads in industrial parks, ports, or mining areas), accurately modeling surrounding vehicle behavior is essential for constructing representative simulation environments. Unlike highway scenarios, behavioral differences in low-speed settings are primarily reflected in drivers’ willingness to yield to merging vehicles and in the degree of spacing compression during interaction, rather than in intense longitudinal acceleration and deceleration. Therefore, establishing a vehicle behavior model capable of characterizing different interaction intensities is fundamental for evaluating the robustness of autonomous driving decision-making strategies.

Among various car-following models, the Intelligent Driver Model (IDM) is widely adopted due to its clear structure and strong physical interpretability. The longitudinal acceleration of the classical IDM is expressed as

a = a_{\max} [1 - {(\frac{v}{v_{0}})}^{4} - {(\frac{s^{*} (v, Δ v)}{s})}^{2}]

(24)

where

v

denotes the current vehicle speed,

v_{0}

is the desired speed,

Δ v

represents the relative speed with respect to the leading vehicle, and

s

is the actual inter-vehicle spacing. The desired dynamic spacing is defined as

s^{*} (v, Δ v) = s_{0} + v T + \frac{v Δ v}{2 \sqrt{a_{\max} b}}

(25)

where

s_{0}

denotes the minimum static spacing,

T

is the desired time headway,

a_{\max}

is the maximum acceleration,

b

is the comfortable deceleration. This model balances free-flow motion and car-following braking behavior and demonstrates good capability in reproducing homogeneous traffic flow dynamics.

However, in scenarios involving merging and lane-changing interactions, the classical IDM assumes homogeneous and passive responses from all vehicles and cannot capture strategic differences among drivers during interaction. In low-speed environments, vehicle interactions mainly manifest as active yielding behavior or competitive spacing compression. Therefore, it is necessary to introduce a merging-awareness mechanism into the IDM framework to construct heterogeneous interaction behavior patterns.

To this end, a merging indicator variable

I_{m e r g e}

is introduced into the IDM framework, which takes the value 1 when a merging maneuver is detected and 0 otherwise. Based on this mechanism, two behavior modes are constructed: conservative driving and aggressive driving.

In the conservative mode, vehicles tend to yield and moderately increase their desired spacing. The acceleration is formulated as

a = a_{\max} [1 - {(\frac{v}{v_{0}})}^{4} - {(\frac{s_{c}^{*}}{s})}^{2}] + ζ

(26)

where

ζ

is a small stochastic disturbance term used to simulate natural fluctuations in human driving behavior. The corresponding desired spacing is defined as

s_{c}^{*} = s_{0} + v T + \frac{v Δ v}{2 \sqrt{a_{\max} b}} + λ_{1} s_{0} I_{m e r g e}

(27)

where

λ_{1}

is a dimensionless yielding coefficient. By proportionally enlarging the static spacing

s_{0}

, it increases the desired spacing while ensuring unit consistency and physical plausibility.

In the aggressive mode, vehicles tend to compress the desired spacing when a merging maneuver is detected in order to reflect competitive behavior. The acceleration is expressed as

a = a_{\max} [1 - {(\frac{v}{v_{0}})}^{4} - {(\frac{s_{a}^{*}}{s})}^{2}]

(28)

The corresponding desired spacing is defined as

s_{a}^{*} = \max (s_{\min}, s_{0} + v T + \frac{v Δ v}{2 \sqrt{a_{\max} b}} - λ_{2} s_{0} I_{m e r g e})

(29)

where

s_{\min}

is the safety lower bound used to prevent unrealistically small spacing.

In this study,

λ_{1}

and

λ_{2}

are defined as dimensionless behavioral coefficients used to characterize interaction intensity, rather than as simulation variables for parameter sweeping. Specifically, the conservative driving mode adopts

λ_{1} = 0.2

, indicating that when a merging maneuver is detected, the vehicle proportionally increases the desired spacing relative to

s_{0}

, thereby reflecting yielding behavior in low-speed scenarios. The aggressive driving mode adopts

λ_{2} = 0.7

, indicating that the vehicle significantly compresses the desired spacing during interaction to reflect competitive behavior, while the compression is constrained by the lower bound

s_{\min}

to ensure basic safety. These parameter values are determined through preliminary tuning to produce clear behavioral differentiation under low-speed conditions while maintaining traffic flow stability and physical plausibility. The focus of this study is to evaluate autonomous driving decision-making performance under different interaction intensities; therefore, no systematic sensitivity analysis of these parameters is conducted.

Through the above extension, the model is capable of generating continuous behavioral differences ranging from cooperative yielding to competitive spacing compression in low-speed interaction scenarios, thereby providing a controllable traffic environment for robustness evaluation of reinforcement learning–based decision-making policies.

5.2. Parameter Settings

All simulations and training procedures were conducted in a Python-based environment using Python 3.10.16, CARLA 0.9.15, PyTorch 2.7.1, and CUDA 12.8. The system ran on Windows 11 and was executed on a workstation equipped with an NVIDIA RTX 5070 Ti GPU (16 GB VRAM, NVIDIA Corporation, Santa Clara, CA, USA) and 32 GB RAM.

To evaluate autonomous driving strategies in a high-fidelity setting, a multi-lane traffic scenario was constructed in CARLA, where the ego vehicle performs a right-lane change maneuver. The scenario includes essential traffic participants such as a lead vehicle in the ego lane and both leading and following vehicles in the target lane. Control and perception parameters were systematically configured to ensure behavioral consistency and physical realism. A summary of simulation settings and training parameters is presented in Table 2.

5.3. Simulation Cases

To evaluate the performance of the reinforcement learning policy in typical low-speed autonomous driving tasks, two representative scenarios are employed for model training and validation: a Regular Lane Change and a Lane Merging Scenario. In both scenarios, the initial spacing between surrounding vehicles is randomly set within the range of 7 to 13 m to simulate realistic traffic density and ensure sufficient variability in vehicle interactions.

5.3.1. Case 1: Regular Lane Change (RLC)

In this study, a standard two-lane lane-changing scenario is constructed as a baseline environment to evaluate the performance of the decision-making model under typical traffic conditions. The scenario consists of a straight two-lane road without lane narrowing, merging zones, or significant traffic disturbances. Although lane changes can be executed smoothly, the target lane maintains a moderate traffic density, resulting in relatively low speeds after the maneuver. Speed fluctuations remain mild, and overall driving behavior is stable, providing a controlled setting for assessing the trajectory smoothness and response consistency of lane-changing policies under near-ideal conditions. The structure of this scenario is illustrated in Figure 2.

5.3.2. Case 2: Lane Merging Scenario (LMS)

A more complex merging scenario is designed to evaluate the robustness of the lane-changing strategy under dynamic and congested traffic conditions. The ego vehicle must merge into a densely occupied target lane, where vehicles from adjacent lanes or ramps compete for limited space. Frequent yielding interactions near the merging point lead to sudden deceleration and stop-and-go behavior after merging, resulting in pronounced speed fluctuations. This scenario poses significant challenges for maintaining safety margins and smooth longitudinal control, and serves as a critical benchmark for assessing the adaptability and safety-awareness of the proposed model. The scenario configuration is illustrated in Figure 3.

6. Results and Discussion

To validate the effectiveness of the proposed method, we compare it with the standard Soft Actor–Critic (SAC) algorithm as well as the Twin Delayed Deep Deterministic Policy Gradient (TD3) algorithm. In the primary evaluation setting (used for method comparison and ablation studies), the ego vehicle’s initial longitudinal position is randomly sampled within a predefined range, resulting in varying maneuvering distances to the target or merging point. To further assess the contributions of individual components, ablation studies are conducted to examine the effects of the rule-guided module and the safety constraint module on overall performance. Specifically, comparison models are constructed by selectively removing each module and are trained under the same environmental settings, enabling a quantitative evaluation of their roles in improving learning efficiency and decision-making safety.

6.1. Quantitative Analysis

For each method, we adopt an evaluation protocol where the trained model is tested over 100 episodes per trial, and the number of successful arrivals at the target lane is recorded. This process is repeated across 10 independent trials, and the average success rate is reported.

As shown in Table 3, in the regular lane change scenario, the success rate increases steadily across the evaluated methods. The baseline SAC achieves a success rate of 78.58%, while DSAC-T improves the performance to 83.56%, corresponding to a gain of 4.98%. By incorporating rule-guided learning, Rule-Guidance + DSAC-T further enhances the success rate to 86.11%. Finally, the proposed Rule-Guidance + DSAC-T + Safe Aware method achieves the highest success rate of 88.45%, representing an overall improvement of 9.87% compared to the baseline SAC.

In the more challenging lane merging scenario, overall success rates decrease due to intensified vehicle interactions at merge points, as summarized in Table 3. The baseline SAC achieves a success rate of 75.35%, while DSAC-T slightly underperforms at 74.26%, indicating reduced adaptability in highly interactive traffic environments. Introducing rule-guided learning leads to a noticeable improvement, with Rule-Guidance + DSAC-T reaching 77.36%. The best performance is achieved by Rule-Guidance + DSAC-T + Safe Aware, which attains a success rate of 79.45%, demonstrating superior robustness and safety under complex merging conditions characterized by frequent speed fluctuations and increased rear-end collision risks.

To evaluate the robustness of different lane-changing strategies under varying surrounding traffic behaviors, a fixed scenario template is adopted in which the ego vehicle starts from a constant longitudinal position with a fixed distance to the target or merging point, while three traffic configurations are simulated with aggressive driving vehicles accounting for 0.2, 0.4, and 0.7 of the traffic, and the remaining vehicles follow conservative driving patterns. As shown in Table 4, the success rates of all methods decrease as the proportion of aggressive vehicles increases, indicating the growing difficulty of lane changing in more hostile traffic environments. The proposed method consistently outperforms SAC and TD3 across all scenarios. Specifically, when the aggressiveness ratio is 0.2, the proposed method achieves a success rate of 88.21%, compared to 84.31% for SAC and 72.65% for TD3. Even under the most challenging condition with an aggressiveness ratio of 0.7, the proposed method remains significantly superior to SAC and TD3. In highly aggressive traffic, surrounding vehicles tend to actively accelerate and reduce gaps upon detecting the ego vehicle’s lane-change intention, substantially limiting feasible insertion opportunities. Under such conditions, SAC and TD3 frequently suffer from collision-induced failures due to insufficient risk anticipation, or become overly conservative and remain stationary for extended periods, eventually exceeding the maximum number of steps allowed in a single episode and being classified as failures. In contrast, the proposed method effectively suppresses high-risk actions while avoiding unnecessary waiting, demonstrating superior robustness and safety in complex, high-conflict lane-changing scenarios.

To examine the internal learning behavior of the proposed method, we further analyze the training dynamics under a fixed aggressive-vehicle ratio of 0.3. Specifically, Figure 4, Figure 5, Figure 6 and Figure 7 illustrate the evolution of entropy regulation, value learning, and uncertainty-related variables during training. This analysis provides complementary insights into the optimization stability and learning characteristics of the proposed framework under a representative traffic condition.

As shown in Figure 4, the temperature coefficient α and the policy entropy both decrease gradually and converge during training. This behavior indicates that the automatic entropy adjustment mechanism enables strong exploration in the early stage and progressively shifts the policy toward more deterministic actions as learning proceeds, which is consistent with widely adopted maximum-entropy reinforcement learning frameworks.

Figure 4. Evolution of the entropy-related variables during training. (a) Adaptive temperature coefficient α; (b) policy entropy. These curves illustrate the automatic entropy regulation process, which balances exploration and exploitation throughout training.

The learning dynamics of the value function are illustrated in Figure 5, where Figure 5a,b report the average Q-values estimated by Critic 1 and Critic 2, respectively. Both critics exhibit a rapid increase in Q-values during the early training phase, followed by stable convergence, while remaining closely aligned throughout the training process. This consistency confirms that the double-critic architecture effectively mitigates value overestimation and ensures stable value learning under the stochastic Q formulation.

Figure 5. Learning dynamics of the value function estimated by the double-critic architecture. (a) Average Q-value predicted by Critic 1; (b) average Q-value predicted by Critic 2. The close alignment between the two critics indicates stable value learning and effective mitigation of overestimation bias.

Figure 6 presents the evolution of Q-value uncertainty predicted by the stochastic critics. As shown in Figure 6a,b, the average predicted uncertainty decreases sharply at the beginning of training and then stabilizes, indicating that the critics gradually form reliable estimates of the return distribution as experience accumulates. Meanwhile, the minimum predicted uncertainty in Figure 6c,d remains bounded without collapse or divergence, demonstrating that the uncertainty estimation branch maintains numerical stability throughout training.

Figure 6. Uncertainty estimation behavior of the stochastic critics during training. (a) Average Q-value uncertainty predicted by Critic 1; (b) average Q-value uncertainty predicted by Critic 2; (c) minimum Q-value uncertainty of Critic 1; (d) minimum Q-value uncertainty of Critic 2. These results demonstrate stable uncertainty modeling without collapse.

The global uncertainty statistics introduced in DSAC-T are depicted in Figure 7. Specifically, Figure 7a,b show the exponential moving average of the predicted Q-value uncertainty for the two critics. This global uncertainty measure decays rapidly in the early stage and converges smoothly thereafter, providing a stable reference for uncertainty-aware loss reweighting. By attenuating the influence of highly uncertain samples during critic updates, this mechanism enhances robustness against noisy value targets and constitutes a key distinction between DSAC-T and conventional SAC-based methods.

Figure 7. Global uncertainty statistics used for uncertainty-aware critic reweighting in DSAC-T. (a) Exponential moving average of Q-value uncertainty for Critic 1; (b) exponential moving average of Q-value uncertainty for Critic 2. The global uncertainty measures provide a stable reference for adaptive loss weighting.

6.2. Qualitative Analysis

To facilitate qualitative analysis, the trajectory and speed curves in the figures have been smoothed by averaging every 10 steps. The reward and front distance values are presented without any post-processing. Figure 8 and Figure 9 present the evaluation results under model evaluation mode for the Regular Lane Change (RLC) and Lane Merging Scenario (LMS), respectively. From the step–velocity plots, it can be observed that once the ego vehicle detects favorable conditions for lane changing, it quickly accelerates and steers into the target lane. This behavior improves the likelihood of a successful merge and reduces disruption to traffic flow in the original lane.

In the LMS scenario, after the vehicle enters the target lane, frequent speed fluctuations occur due to interactions with vehicles ahead in the merging area. This is a typical characteristic of the lane merging setting and is not commonly observed in the RLC scenario.

Regarding the reward function, the ego vehicle receives a baseline reward while driving in the original lane. As it begins to change lanes and its center of mass approaches the centerline of the target lane, the reward increases continuously, guiding the vehicle to merge smoothly. After completing the merge, the vehicle’s heading gradually aligns with the target lane direction, further increasing the reward and encouraging stable lane-keeping.

The “Front Distance” curve at the bottom of the figure shows the distance between the ego vehicle and the vehicle ahead. A value of zero indicates no vehicle in front. During the merging phase, this distance often drops to zero, indicating a clear lane and a safe merging condition. After merging, the distance remains relatively stable, reflecting safe and consistent car-following behavior.

7. Conclusions

This study proposes a rule-guided and safety-constrained distributional reinforcement learning framework for autonomous lane changing in congested traffic environments, enabling stable and safe policy learning. Simulation results demonstrate that under highly aggressive and adversarial traffic conditions, the proposed method exhibits a clear performance advantage over baseline approaches. Specifically, in scenarios with an aggressiveness ratio of 0.7, the proposed method improves the lane-changing success rate by 17.13% compared to SAC and by 10.49% compared to TD3. These results indicate that incorporating rule-guided learning and safety-aware action constraints is particularly effective in suppressing collision-induced failures and enhancing robustness in high-conflict, safety-critical lane-changing scenarios.

Although the simulation results demonstrate the effectiveness of the proposed method, certain limitations remain. This study was conducted based on the CARLA simulation platform. While the designed scenarios aim to approximate complex low-speed traffic environments, discrepancies may still exist between simulation and real-world traffic conditions. In addition, surrounding vehicle behaviors are modeled using a parameterized extension of the IDM framework. Although this approach enables the representation of different interaction intensities, it may not fully capture the complexity of real human driving behavior. Future work will focus on collecting real-world surrounding vehicle data under dense traffic conditions through field vehicle experiments, with the objective of constructing a more realistic and representative surrounding vehicle behavior model to further enhance the practical applicability and generalization capability of the proposed method.

Author Contributions

Conceptualization, S.C., Y.S. and J.H.; Methodology, S.C., Y.S., K.C. and H.L. (Huiqian Li); Software, H.L. (Hao Li); Validation, H.L. (Hao Li); Formal analysis, S.C.; Investigation, S.C. and H.L. (Hao Li); Resources, Y.S.; Writing—original draft, Y.S.; Writing—review & editing, Y.S. and J.H.; Supervision, J.H.; Project administration, J.H.; Funding acquisition, Y.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported in part by the Anhui Provincial Major Science and Technology Special Project (Grant No. 202423d120500072); in part by the National Natural Science Foundation of China (NSFC) (Grant No. 52442211); in part by the Independent Research Project of the State Key Laboratory of Intelligent Green Vehicle and Mobility, Tsinghua University (Grant No. ZZ-PY-20250302); and in part by the China Postdoctoral Science Foundation (Grant No. GZC20250902).

Data Availability Statement

The data supporting the findings of this study are available from the corresponding author upon reasonable request. Requests for access to the data should be directed to 20230103030@stdmail.gxust.edu.cn.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Abbreviations

The following abbreviations are used in this manuscript:

SAC	Soft Actor–Critic
DSAC-T	Distributional Soft Actor–Critic with Three Refinements
LMS	Lane Merging Scenario
RLC	Regular Lane Change

References

Johnson, C.B.J. Car Accident Statistics for 2025. 2024. Available online: https://vegasvalleylaw.com/blog/car-crash-statistics/ (accessed on 15 January 2026).
Zhao, N.; Zhang, J.; Wang, B.; Lu, Y.; Zhang, K.; Su, R. A Data-Driven Long-Term Prediction Method of Mandatory and Discretionary Lane Change Based on Transformer. In 2023 IEEE 26th International Conference on Intelligent Transportation Systems (ITSC); IEEE: New York, NY, USA, 2023; pp. 2390–2395. [Google Scholar]
Huang, Y.; Gu, Y.; Yuan, K.; Yang, S.; Liu, T.; Chen, H. Human Knowledge Enhanced Reinforcement Learning for Mandatory Lane-Change of Autonomous Vehicles in Congested Traffic. IEEE Trans. Intell. Veh. 2024, 9, 3509–3519. [Google Scholar] [CrossRef]
Guo, J.; Harmati, I. Lane-changing decision modelling in congested traffic with a game theory-based decomposition algorithm. Eng. Appl. Artif. Intell. 2022, 107, 104530. [Google Scholar] [CrossRef]
Chakraborty, S.; Cui, L.; Ozbay, K.; Jiang, Z.-P. Automated lane changing control in mixed traffic: An adaptive dynamic programming approach. Transp. Res. Part B Methodol. 2024, 187, 103026. [Google Scholar] [CrossRef]
He, S.; Zeng, J.; Zhang, B.; Sreenath, K. Rule Based-Safety Critical Control Design using Control Barrier Functions with Application to Autonomous Lane Change. In 2021 American Control Conference (ACC); IEEE: New York, NY, USA, 2021. [Google Scholar]
Cao, W.; Zhao, H. Lane change algorithm using rule-based control method based on look-ahead concept for the scenario when emergency vehicle approaching. Artif. Life Robot. 2022, 27, 818–827. [Google Scholar] [CrossRef]
Asano, S.; Ishihara, S. Safe, Smooth, and Fair Rule-Based Cooperative Lane Change Control for Sudden Obstacle Avoidance on a Multi-Lane Road. Appl. Sci. 2022, 12, 8528. [Google Scholar] [CrossRef]
Qin, Z.; Ji, A.; Sun, Z.; Wu, G.; Hao, P.; Liao, X. Game Theoretic Application to Intersection Management: A Literature Review. In IEEE Transactions on Intelligent Vehicles; IEEE: New York, NY, USA, 2024; pp. 1–19. [Google Scholar]
Elvik, R. A review of game-theoretic models of road user behaviour. Accid. Anal. Prev. 2014, 62, 388–396. [Google Scholar] [CrossRef] [PubMed]
Ji, A.; Levinson, D. A review of game theory models of lane changing. Transp. A Transp. Sci. 2020, 16, 1628–1647. [Google Scholar] [CrossRef]
Ali, Y.; Zheng, Z.; Haque, M.M.; Wang, M. A game theory-based approach for modelling mandatory lane-changing behaviour in a connected environment. Transp. Res. Part C Emerg. Technol. 2019, 106, 220–242. [Google Scholar] [CrossRef]
Lopez, V.G.; Lewis, F.L.; Liu, M.; Wan, Y.; Nageshrao, S.; Filev, D. Game-Theoretic Lane-Changing Decision Making and Payoff Learning for Autonomous Vehicles. IEEE Trans. Veh. Technol. 2022, 71, 3609–3620. [Google Scholar] [CrossRef]
Zare, M.; Kebria, P.M.; Khosravi, A.; Nahavandi, S. A Survey of Imitation Learning: Algorithms, Recent Developments, and Challenges. IEEE Trans. Cybern. 2024, 54, 7173–7186. [Google Scholar] [CrossRef] [PubMed]
Le Mero, L.; Yi, D.; Dianati, M.; Mouzakitis, A. A Survey on Imitation Learning Techniques for End-to-End Autonomous Vehicles. IEEE Trans. Intell. Transp. Syst. 2022, 23, 14128–14147. [Google Scholar] [CrossRef]
Guo, L.; Liu, X. Lane-Changing Decisions Making for Autonomous Vehicles via Behavior Cloning and Decision Tree. In 2023 China Automation Congress (CAC); IEEE: New York, NY, USA, 2023; pp. 8648–8652. [Google Scholar]
Xiao, D.; Wang, B.; Sun, Z.; He, X. Behavioral Cloning Based Model Generation Method for Reinforcement Learning. In 2023 China Automation Congress (CAC); IEEE: New York, NY, USA, 2023; pp. 6776–6781. [Google Scholar]
Zhao, R.; Li, Y.; Fan, Y.; Gao, F.; Tsukada, M.; Gao, Z. A Survey on Recent Advancements in Autonomous Driving Using Deep Reinforcement Learning: Applications, Challenges, and Solutions. IEEE Trans. Intell. Transp. Syst. 2024, 25, 19365–19398. [Google Scholar] [CrossRef]
Ladosz, P.; Weng, L.; Kim, M.; Oh, H. Exploration in deep reinforcement learning: A survey. Inf. Fusion 2022, 85, 1–22. [Google Scholar] [CrossRef]
Arulkumaran, K.; Deisenroth, M.P.; Brundage, M.; Bharath, A.A. Deep Reinforcement Learning: A Brief Survey. IEEE Signal Process. Mag. 2017, 34, 26–38. [Google Scholar] [CrossRef]
Balhara, S.; Gupta, N.; Alkhayyat, A.; Bharti, I.; Malik, R.Q.; Mahmood, S.N.; Abedi, F. A survey on deep reinforcement learning architectures, applications and emerging trends. IET Commun. 2022, 19, e12447. [Google Scholar] [CrossRef]
Zhao, L.; Farhi, N.; Christoforou, Z.; Haddadou, N. Imitation of Real Lane-Change Decisions Using Reinforcement Learning. IFAC-PapersOnLine 2021, 54, 203–209. [Google Scholar] [CrossRef]
Sharma, A.K.; Choudhary, A.; Chaudhary, R.; Bhardwaj, A.; Aslam, A.M. Adaptive Trajectory Planning in Autonomous Vehicles: A Hierarchical Reinforcement Learning Approach with Soft Actor-Critic. In 2024 IEEE International Conference on Advanced Networks and Telecommunications Systems (ANTS); IEEE: New York, NY, USA, 2024; pp. 1–6. [Google Scholar]
Liu, J.; Feng, Y.; Jing, S.; Hui, F. Deep Reinforcement Learning-Based Lane-Changing Trajectory Planning for Connected and Automated Vehicles. In 2023 9th International Conference on Mechanical and Electronics Engineering (ICMEE); IEEE: New York, NY, USA, 2023; pp. 406–412. [Google Scholar]
Katzilieris, K.; Kampitakis, E.; Vlahogianni, E.I. Dynamic Lane Reversal: A reinforcement learning approach. In 2023 8th International Conference on Models and Technologies for Intelligent Transportation Systems (MT-ITS); IEEE: New York, NY, USA, 2023; pp. 1–6. [Google Scholar]
Liu, Z. Learning Personalized Discretionary Lane-Change Initiation for Fully Autonomous Driving Based on Reinforcement Learning. In 2020 IEEE International Conference on Systems, Man, and Cybernetics (SMC); IEEE: Toronto, ON, Canada, 2020. [Google Scholar]
Haarnoja, T.; Zhou, A.; Hartikainen, K.; Tucker, G.; Ha, S.; Tan, J.; Kumar, V.; Zhu, H.; Gupta, A.; Abbeel, P.; et al. Soft Actor-Critic Algorithms and Applications. arXiv 2018, arXiv:1812.05905. [Google Scholar]
Lim, D.; Joe, I. A DRL-Based Task Offloading Scheme for Server Decision-Making in Multi-Access Edge Computing. Electronics 2023, 12, 3882. [Google Scholar] [CrossRef]
Duan, J.; Wang, W.; Xiao, L.; Gao, J.; Li, S.E.; Liu, C. Distributional Soft Actor-Critic With Three Refinements. IEEE Trans. Pattern Anal. Mach. Intell. 2025, 47, 3935–3946. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Framework of the Proposed Method.

Figure 2. Regular Lane Change (RLC).

Figure 3. Lane Merging Scenario (LMS).

Figure 8. Example execution of the learned policy in the Regular Lane Change (RLC) scenario. From top to bottom: lane change trajectory, ego vehicle velocity, reward evolution, and front vehicle distance over time steps.

Figure 9. Example execution of the learned policy in the Lane Merging Scenario (LMS) scenario. From top to bottom: lane change trajectory, ego vehicle velocity, reward evolution, and front vehicle distance over time steps.

Table 1. Reward function components and their functional interpretation.

ID	Description	Triggering Condition	Value	Design Purpose
$R_{1}$	Forward driving	$0 < v \leq v_{d e s i r e d}$	+2.0	Encourage appropriate driving speed
$R_{2}$	Collision Penalty	A collision is detected	−10.0	Strongly penalizes collisions stopping
$R_{3}$	Off-Road Penalty	The vehicle moves outside the drivable area	−10.0	Ensures the vehicle stays on the road
$R_{4}$	Illegal Lane Penalty	The current lane ID is -5, -6, or -7	−10.0	Prevents illegal or oncoming-lane driving
$R_{5}$	Initiate Lane Change	In lane ID = -3 with 0 < steer < 0.7 and v > 0.1	+2.0	Discourage driving on emergency lanes
$R_{6}$	Static Steering Penalty	Vehicle speed < 0.05 with non-zero steering	−2.0	Discourages turning the wheel while stationary

Table 2. Simulation Environment and Reinforcement Learning Parameters.

Parameters	Value	Description
Time step	$10 Hz$	Fixed simulation update interval
Max step per episode	1000	Maximum duration for each episode
Traffic vehicles distribution	7–13 m	Vehicles positioned on both ego and target lanes
Max ego vehicle spawn attempts	100	To avoid repeated failures during ego spawning
Replay buffer size	1,000,000	Maximum number of stored transitions
Batch size	512	Mini-batch size for gradient updates
$Discount factor (γ)$	0.99	Temporal reward decay
$Soft update coefficient (τ)$	$5 \times 10^{- 3}$	For updating target networks
Policy learning rate	$1 \times 10^{- 4}$	Learning rate of the actor network
Q-network learning rate	$1 \times 10^{- 4}$	Learning rate of critic networks
$Entropy temperature learning rate (α_{l r})$	$1 \times 10^{- 4}$	$Learning rate for entropy coefficient α_{l r}$
$Initial entropy coefficient (α)$	0.2	Controls the entropy regularization strength
Policy update delay	2 Step	Actor is updated every 2 critic updates

Table 3. Success Rate Comparison under Different Lane-Change Scenarios.

Method	Regular Lane Change (RLC)	Lane Merging Scenario (LMS)
SAC	78.58%	75.35%
DSAC-T	83.56%	74.26%
Rule-Guidance + DSAC-T	86.11%	77.36%
Rule-Guidance + DSAC-T + Safe Aware	88.45%	79.45%

Table 4. Aggressive Vehicle Ratios Across Evaluated Traffic Scenarios.

Rate	Rule-Guidance + DSAC-T +Safe Aware	SAC	TD-3
0.2	88.21%	84.31%	72.65%
0.4	71.54%	68.99%	65.36%
0.7	27.45%	10.32%	16.96%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Cui, S.; Li, H.; Su, Y.; Huang, J.; Cheng, K.; Li, H. A Rule-Guided Distributional Soft Actor–Critic Algorithm for Safe Lane-Changing in Complex Driving Scenarios. Vehicles 2026, 8, 58. https://doi.org/10.3390/vehicles8030058

AMA Style

Cui S, Li H, Su Y, Huang J, Cheng K, Li H. A Rule-Guided Distributional Soft Actor–Critic Algorithm for Safe Lane-Changing in Complex Driving Scenarios. Vehicles. 2026; 8(3):58. https://doi.org/10.3390/vehicles8030058

Chicago/Turabian Style

Cui, Shuwan, Hao Li, Yanzhao Su, Jin Huang, Kun Cheng, and Huiqian Li. 2026. "A Rule-Guided Distributional Soft Actor–Critic Algorithm for Safe Lane-Changing in Complex Driving Scenarios" Vehicles 8, no. 3: 58. https://doi.org/10.3390/vehicles8030058

APA Style

Cui, S., Li, H., Su, Y., Huang, J., Cheng, K., & Li, H. (2026). A Rule-Guided Distributional Soft Actor–Critic Algorithm for Safe Lane-Changing in Complex Driving Scenarios. Vehicles, 8(3), 58. https://doi.org/10.3390/vehicles8030058

Article Menu

A Rule-Guided Distributional Soft Actor–Critic Algorithm for Safe Lane-Changing in Complex Driving Scenarios

Abstract

1. Introduction

1.1. Motivation

1.2. Literature Review

1.3. Contribution

1.4. Paper Organization

2. Problem Formulation and Background

3. Proposed Approach

3.1. Framework Overview

3.2. Rule-Guided Controller for Safe and Efficient Lane Changing

3.3. Curriculum-Aware Replay Sampling with Decaying Rule Ratio

3.4. Safety Shield for Policy Enforcement

4. Training Details

4.1. State and Action

4.2. Reward Function

5. Simulation

5.1. Traffic Scenario and Vehicle Modeling

5.2. Parameter Settings

5.3. Simulation Cases

5.3.1. Case 1: Regular Lane Change (RLC)

5.3.2. Case 2: Lane Merging Scenario (LMS)

6. Results and Discussion

6.1. Quantitative Analysis

6.2. Qualitative Analysis

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI