Enhancing Lane Change Safety and Efficiency in Autonomous Driving Through Improved Reinforcement Learning for Highway Decision-Making

Wang, Zi; Jiang, Mingzuo; Gu, Shaoqiang; Gu, Yunyang; Wang, Jiaxia

doi:10.3390/electronics14050918

Open AccessArticle

Enhancing Lane Change Safety and Efficiency in Autonomous Driving Through Improved Reinforcement Learning for Highway Decision-Making

by

Zi Wang

¹,

Mingzuo Jiang

^1,*,

Shaoqiang Gu

¹,

Yunyang Gu

¹ and

Jiaxia Wang

²

¹

College of Automation, Jiangsu University of Science and Technology, Zhenjiang 212003, China

²

School of Naval Architecture and Ocean Engineering, Jiangsu University of Science and Technology, Zhenjiang 212003, China

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(5), 918; https://doi.org/10.3390/electronics14050918

Submission received: 3 February 2025 / Revised: 21 February 2025 / Accepted: 24 February 2025 / Published: 25 February 2025

Download

Browse Figures

Versions Notes

Abstract

Autonomous driving (AD) significantly reduces road accidents, providing safer transportation while optimizing traffic flow for greater efficiency and smoothness. However, ensuring safe decision-making in dynamic and complex highway environments, especially during lane-changing maneuvers, remains a challenge. Reinforcement Learning (RL) has become a promising method for developing decision-making systems in AD, particularly Deep Reinforcement Learning (DRL). In this study, we focus on highway lane-change behaviors and propose a novel DRL algorithm, called Huber-regularized Reward-threshold Adaptive Double Deep Q-Network (HRA-DDQN). First, a reward function optimally balances speed, safety, and the necessity of lane changes, ensuring efficient and safe maneuvering in highway scenarios. Second, the dynamic target network update strategy triggered by reward difference is introduced into HRA-DDQN, which enhances the model’s adaptability to varying traffic conditions. Finally, a hybrid loss function, combining Huber loss with L2 regularization, is implemented in HRA-DDQN to improve robustness against outliers and mitigate overfitting. Simulation results demonstrate that the proposed decision framework significantly enhances both driving efficiency and safety, outperforming other methods by yielding higher rewards, lower collision rates, and more stable lane-changing decisions.

Keywords:

autonomous driving; Double Deep Q-Network; highway simulation environment; lane-changing; reinforcement learning; vehicle secure

1. Introduction

Autonomous vehicles have attracted considerable attention and have seen substantial advancements in recent years [1,2,3]. Autonomous driving technology allows vehicles to perform various driving tasks without human intervention. Autonomous driving technology holds the potential to greatly improve road safety by reducing human errors, such as those resulting from fatigue, distraction, or suboptimal decision-making [4,5,6], which are major contributors to traffic accidents. These systems comprise four essential elements: perception, decision-making, planning, and control [7,8].

In RL, neural networks are employed to approximate either the policy or the value function. By taking the state of the environment as an input, the network outputs the probability distribution over actions (in policy gradient methods) or the value of actions (in value-based methods). Through continuous interaction with the environment, the neural networks gradually learn a near-optimal policy or value function, which enhances decision-making performance and robustness. DRL has been widely applied in the AD domain, including perception, decision-making, planning, and control, demonstrating significant potential in enhancing system performance across these critical areas. Ref. [9] proposed enhancing roadside cooperative perception by using multi-modal data from various sensors, including cameras and LiDAR. This DRL-based approach helps vehicles perceive and track objects more accurately. Ref. [10] proposed a Double Deep Q-Network (DDQN) algorithm to improve highway driving decision-making. This method addresses the challenges of large state spaces and continuous actions, enabling autonomous vehicles to perform overtaking maneuvers efficiently and safely. Furthermore, ref. [11] proposed a DRL-based path-planning method using the Deep Deterministic Policy Gradient (DDPG) algorithm to solve the sparse reward problem in AD for mobile robots. Ref. [12] proposed a self-learning algorithm integrating RL with Model Predictive Control (MPC) to improve car-following behavior in autonomous vehicles. The method dynamically adjusts MPC weight coefficients by using RL, allowing adaptation to changing traffic conditions.

In complex and unpredictable environments, it is vital for autonomous vehicles to comprehend the objectives of nearby vehicles and make driving decisions in coordination with this understanding. The proficiency of these decision-making capabilities has a profound impact on the overall performance and safety of autonomous vehicles [13,14,15]. Therefore, decision-making is seen as the “brain” of the AD system [16], playing a pivotal role in ensuring effective vehicle operation.

Theoretically, the decision-making methods of AD primarily include rule-based approaches, optimization-based techniques, and learning-based strategies. Rule-based methods [17,18,19] rely on predefined rules and logic to make decisions, which are suitable for simple and static scenarios but inadequate for managing complex and dynamic environments. Optimization-based methods [4,20,21] employ mathematical models and optimization algorithms to determine optimal decisions under given constraints, providing a solid theoretical foundation but suffering from high computational complexity and limited real-time applicability. It remains challenging to ensure the optimality of solutions for large-scale, complex non-convex optimization problems. The first two decision-making methods are mostly applied in offline settings. In contrast, learning-based methods, particularly DRLs [22,23,24] have attracted considerable attention for their capacity to continuously refine decision-making strategies through interaction with the environment, making them especially well-suited for complex and dynamic driving scenarios.

Among the many tasks that AD cars should handle, lane-change decision-making is particularly important because it directly affects both vehicle safety and operational efficiency. Successful lane changes require accurate environmental perception, precise prediction of the behavior of surrounding vehicles, and smooth execution of the maneuver. Given the complexity and dynamic nature of driving environments, making optimal lane-change decisions is essential for preventing accidents and ensuring smooth driving operations. Ref. [25] applied DRL to optimize lane-change decisions in AD by using a model-based exploration strategy combined with a Conditional Variational Autoencoder (CVAE), enhancing the agent’s ability to explore high-dimensional environments and learn effective decision-making strategies. Ref. [10] proposed a DDQN based decision-making strategy for highway overtaking scenarios. By integrating DRL and a hierarchical control framework, the proposed strategy optimizes decision-making in complex driving environments, significantly improving both efficiency and safety during simulation experiments. Similarly, ref. [26] proposed a decision-making framework for highway driving based on Deep Reinforcement Learning (DRL), which incorporates a safety module. The proposed system framework combines rule-based safety mechanisms with a dynamically learned safety module, which uses a Recurrent Neural Network (RNN) to predict and prevent unsafe future states. Ref. [27] developed a DRL-based highway driving strategy to enable vehicles to make real-time, collision-free decisions.

In this paper, in order to efficiently and safely accomplish highway driving tasks, we propose lane-change decision-making architecture for autonomous vehicles in highway environments. A novel DRL algorithm, called HRA-DDQN, is designed and used to obtain highway decision-making strategies. The performance of the proposed decision-making framework is discussed through a series of simulation experiments. The main contributions of this paper are as follows:

(1): We propose a framework for autonomous highway driving that incorporates the key elements such as speed control, safety, and lane-changing efficiency.
(2): The Huber loss function and L2 regularization are introduced into HRA-DDQN. The Huber loss enhances robustness to outliers, improving stability and convergence, while L2 regularization controls weight growth to prevent overfitting, thus improving generalization and performance in dynamic environments.
(3): A reward difference-triggered target network update strategy is proposed in HRA-DDQN, where the target network parameters are updated when the difference between the current and previous rewards exceeds a predetermined threshold. This enhances adaptability to environmental changes during policy learning.

The paper is structured as follows: Section 2 introduces the preliminary knowledge of the RL algorithm. In Section 3, we present the designed Markov Decision Process (MDP) model and HRA-DDQN. Section 4 demonstrates the vehicle training environment and conducts comparative experiments. Finally, Section 5 provides the conclusion of the paper and outlines potential directions for future work.

2. Preliminaries

This section introduces the RL methods relevant to this study, starting with the MDP. Following this, the principle of Deep Q-Network (DQN) is introduced [28]. Lastly, the Dueling DQN is presented as an improvement algorithm over traditional DQN by decomposing the Q-value function into a state value function and an advantage function.

2.1. Reinforcement Learning

RL is a powerful method for tackling decision-making challenges [29,30], aiming to maximize the overall rewards gained by the agent through its interactions with the environment. The fundamental framework of RL comprises two primary components: the environment and the agent, as shown in Figure 1.

The problem of RL is commonly structured as a MDP, defined by the tuple

(S, A, T, R, γ)

, where

S

is the state space,

A

is the action space,

T

is the state transition model which describes the probability

P (s^{'} | s, a)

of transitioning to the next state

s^{'}

after taking action

a

in the current state

s

.

R (s, a)

denotes the immediate reward after taking action

A

in the current state

S

.

γ \in [0, 1]

is the discount factor used to balance the influence of short-term and long-term rewards by discounting future rewards.

The primary objective of RL is to derive an optimal policy by maximizing the discounted sum of cumulative rewards as follows:

R_{t} = \sum_{k = 0}^{\infty} γ^{k} r_{t + k}

(1)

where

R_{t}

represents the discounted cumulative reward at

t

time step, and

r_{t + k}

is the immediate reward received at

t + k

time step.

The expectation cumulative reward

R_{t}

can be used to evaluate the value of a state, known as the state value function

V_{π} (s)

as follows:

V_{π} (s) = E_{π} (\sum_{k = 0}^{\infty} γ^{k} R_{t + k} | s_{t} = s)

(2)

If the action

a

is taken in the current state

s_{t}

, the state value function

V_{π} (s)

dependents on both the state and the action, known as the state-action value function

Q_{π} (s, a)

as follows:

Q_{π} (s, a) = E_{π} (\sum_{k = 0}^{\infty} γ^{k} R_{t + k} | s_{t} = s, a_{t} = a)

(3)

2.2. DQN

DQN has demonstrated significant performance in addressing RL problems, particularly when handling high-dimensional state spaces [31,32]. DQN consists of two distinct neural networks: an online network and a target network. DQN utilizes neural networks to approximate the Q-function and employs the Bellman equation to guide the learning process. The architecture of DQN is shown in Figure 2.

RL aims to develop an optimal policy that maximizes the total accumulated reward. The Bellman equation captures the relationship between the immediate reward and the expected future reward, expressed as:

Q (s, a) = E [r + γ \max_{a_{t + 1}} Q (s_{t + 1}, a_{t + 1}; θ) | s, a]

(4)

where

θ

denotes the parameters of the online network. The optimal policy

π (s)

is defined as the policy that maximizes the Q-function

Q (s, a)

for each state

S

, given by:

π (s) = \arg \max_{a \in A} Q (s, a)

(5)

The optimal policy guides the agent to select actions that yield the highest expected cumulative reward, thereby guiding the decision-making process in the most effective manner.

In DQN, the current Q-value is estimated through the output

Q (s, a; θ)

of the online network based on the given current state

S

and action

a

, while the target Q-value is determined by the target network using the current reward

r_{t}

and the maximum Q-value of the next state:

y_{t} = r_{t} + γ \max_{a_{t + 1}} Q (s_{t + 1}, a_{t + 1}; θ^{-})

(6)

where

θ^{-}

denotes the parameters of the target network, which are updated at regular intervals by replicating from the current network.

DQN adjusts the parameters

θ

by minimizing the loss function defined as follows:

L (θ) = E [{(y_{t} - Q (s, a; θ))}^{2}]

(7)

The gradient descent method is commonly utilized in an online network to iteratively adjust its parameters. The corresponding gradient is calculated as follows:

\nabla_{θ} L (θ) = E [(y_{i} - Q (s, a; θ)) \nabla_{θ} Q (s, a; θ)]

(8)

The DQN algorithm effectively learns to optimize the Q-values, allowing the agent to make optimal decisions in various scenarios.

2.3. Dueling DQN

The Dueling DQN algorithm is an advanced deep RL method [33,34]. It is especially suitable for complex environments with high-dimensional discrete state spaces. Its core idea is to decompose the Q-value into a state value function and an advantage function, thereby improving the efficiency of policy evaluation and the quality of learning outcomes. The structure of the Dueling DQN is illustrated in Figure 3.

The state value function represents the overall value of a particular state, independent of the specific actions, reflecting the intrinsic value of the state itself. In contrast, the advantage function reflects the relative benefit or disadvantage of taking a specific action in that state compared to other actions. The advantage function is given by:

A (s, a) = Q (s, a) - V (s)

(9)

where

Q (s, a)

represents the action–value function, which indicates the expected return of taking a specific action

a

in a given state

S

On the other hand,

V (s)

represents the state–value function, which signifies the overall expected return when being in state

S

, independent of the specific action taken

The Dueling DQN introduces a dual-stream network architecture where both streams share a common feature learning module (such as convolutional layers for image data). The two streams are combined through an aggregation layer to generate the final Q-value estimation. The traditional DQNs directly estimate Q-values, while Dueling DQN represents a Q-value function by introducing an advantage function and a state value function as follows:

Q (s, a; θ, α, β) = V (s; θ, β) + (A (s, a; θ, α) - \frac{1}{|A|} \sum_{a^{'}} A (s, a^{'}; θ, α))

(10)

where

θ

represents the parameters of the shared convolutional layers, while

α

and

β

are the parameters used to calculate the advantage function

A (s, a)

and state value function

V (s)

, respectively.

A

represents the set of all possible actions, and

|A|

denotes the number of actions in this set.

3. Highway Decision-Making Problem

In this section, we will introduce the proposed decision-making framework designed to address the decision-making challenges of autonomous vehicles in highway scenarios.

Based on the principle of RL algorithm, our goal is to learn an optimal strategy for the ego vehicle to avoid dangerous collisions with obstacles during adversarial situations while minimizing unnecessary lane changes. First, we construct a model based on the MDP to define the decision-making framework of vehicles in highway environments. Secondly, we propose the HRA-DDQN algorithm to solve this MDP problem. The proposed algorithm enhances the model’s performance in handling complex scenarios such as lane changes by designing a refined loss function and introducing a dynamic target network parameter update mechanism. Through the combination of these methods, the approach aims to balance safety and efficiency in autonomous vehicle decision-making on highways.

3.1. Problem Formulation

The highway lane change decision-making problem can be modeled as an MDP [35,36,37], which is generally represented as a tuple

(S, A, T, R, γ)

. We, respectively, define the state space, action space, and reward function as follows.

(1) State Space: In the driving environment on highway discussed in this section, the speed and position are taken as the state variables. The state space can be represented as follows:

s_{i} = {[x_{i}, y_{i}, v_{i}^{x}, v_{i}^{y}]}^{T} i = 0, 1, 2, \dots, N

(11)

where

s_{0}

denotes the state of the ego vehicle,

s_{i} (i = 1, 2, \dots, N)

denotes the states of surrounding vehicles, with

N

representing the number of surrounding vehicles. The definition of each variable is provided as follows:

$x_{i}$ : Refers to the lateral position of the agent’s center.
$y_{i}$ : Refers to the longitudinal position of the agent ’s center.
$v_{i}^{x}$ : Refers to the lateral velocity of the agent.
$v_{i}^{y}$ : Refers to the longitudinal velocity of the agent.

(2) Action Space: The action space for the ego vehicle is defined as a discrete set of five actions: A = {“accelerate”, “decelerate”, “lane change right”, “lane change left”, “lane keep”}.

(3) Reward function: The reward function evaluates the suitability of actions based on the current observations or states at each time step from the environment. To achieve driving behaviors that consider both safety and driving efficiency, we define rewards for various aspects in highway scenarios: speed reward

R_{v}

, collision reward

R_{c}

, lane-keeping reward

R_{l}

, lane-changing reward

R_{d}

, and action stability reward

R_{e}

. Overall, the total reward is designed as the weighted sum as follows:

R^{t o t a l} = c_{v} R_{v} + c_{c} R_{c} + c_{l} R_{l} + c_{d} R_{d} + c_{e} R_{e}

(12)

where

c_{v}

,

c_{c}

,

c_{l}

,

c_{d}

and

c_{e}

are the positive weight coefficients that need to be adjusted according to different driving conditions. The weight coefficients are determined through multiple experiments to balance safety, stability, and efficiency in highway driving scenarios.

Speed reward: A speed reward is given based on the ego vehicle’s target speed as follows:

R_{v} = \{\begin{array}{c} - 2 & i f v < v_{m i n} or v > v_{m a x} \\ 1 - \frac{|v - v_{t a r g e t}|}{v_{t o l e r a n c e}} & o t h e r w i s e \end{array}

(13)

where

v

denotes the current speed of the ego vehicle,

v_{m i n}

denotes the minimum speed,

v_{m a x}

denotes the maximum speed, and

v_{t a r g e t}

denotes the target speed. The term

v_{t o l e r a n c e} = m a x (v_{t a r g e t} - v_{m i n}, v_{m a x} - v_{t a r g e t})

defines the speed tolerance. In this study,

v_{m i n}

,

v_{m a x}

, and

v_{t a r g e t}

are set to 20 m/s, 40 m/s, and 30 m/s, respectively. On the other hand, when the current speed

v

exceeds the maximum speed

v_{m a x}

or falls below the minimum speed

v_{m i n}

, resulting in a negative reward (−2). This indicates that the model penalizes deviations from the desired speed range. On the other hand, the speed reward function evaluates the difference between

v_{t a r g e t}

and

v

, motivating the vehicle to maintain a speed within the specified range. This aligns with the objective of promoting a stable driving behavior. Additionally, by introducing the speed tolerance

v_{t o l e r a n c e}

, the speed reward function is not solely based on the current speed but also considers allowable variations.

Collision reward: To ensure driving safety, the ego vehicle must avoid collisions. Therefore, the collision penalty is designed to be significantly higher compared to the other reward components. The collision reward function is designed as follows:

R_{c} = \{\begin{array}{l} - 5 & i f a g e n t c o l l i d e s \\ 0 & o t h e r w i s e \end{array}

(14)

Lane-keeping reward: To achieve lane-keeping, the ego vehicle needs to drive near the center of the lane. This strategy encourages the vehicle to stay within a safe range, minimizing excessive deviation from the lane center, and reducing the risk of accidents. Therefore, the lane-keeping reward function is defined as follows:

R_{l} = \{\begin{array}{c} 1 & i f | x - c_{i} | \leq k d \\ 0 & o t h e r w i s e \end{array}

(15)

where

c_{i}

represents the center position of the lane. The reward value 1 is assigned when the distance between the vehicle’s position

x

and the lane center

c_{i}

is less than or equal to

k d

(with k and d being predefined constants), indicating that the vehicle remains within the acceptable deviation range. Conversely, if this deviation exceeds the threshold

k d

, a reward value of 0 is given, signifying that the vehicle has drifted away from the lane center.

Lane-changing reward: To effectively guide the vehicle in selecting the optimal lane in complex traffic environments, thereby improving driving efficiency and safety, we encourage reasonable lane changes and penalize unnecessary ones. Thus, the lane-changing reward function is designed as follows:

R_{d} = \{\begin{array}{l} 1.0 & i f t h e l a n e c h a n g e i s a R C \\ - 0.5 & i f t h e l a n e c h a n g e i s a U C \\ 0 & i f t h e l a n e c h a n g e i s a N C \end{array}

(16)

where RC represents a reasonable lane change, UC represents an unreasonable lane change, NC represents no lane change. To qualify a lane change as reasonable, the following criteria should be satisfied:

(1): Speed Differential: The current speed $v_{c u r r e n t}$ needs to be lower than the target speed $v_{t a r g e t}$ .
(2): Adjacent Lane Speed: The average speed of the vehicles in adjacent lanes $v_{l a n e}$ should be at least 10 units higher than the current speed ( $v_{l a n e} > v_{c u r r e n t} + 10$ ) of the ego vehicle.

When both the above two conditions are met, executing a lane change (either to the left or right) is considered a reasonable change (RC) and is rewarded with a value of 1. Conversely, if a lane change is executed without meeting these conditions, it is deemed an unreasonable lane change (UC) with a penalty of −0.5. If no lane change (NC) occurs, the reward is 0.

Action stability reward: In order to enable efficient lane-change decision-making for autonomous vehicles on highways, an action stability reward function is designed. This reward function calculates rewards based on variations in actions, encouraging the ego vehicle to maintain stability. The action stability reward is defined as follows:

R_{e} = \{\begin{array}{l} - e x p (- λ t) & i f a_{t} \neq a_{t - 1} \\ 0 & o t h e r w i s e \end{array}

(17)

where λ is the decay coefficient, t represents the time step. When the action of the vehicle changes (i.e.,

a_{t} \neq a_{t - 1}

), a negative reward is applied, which decays exponentially over time. When the action remains unchanged (i.e.,

a_{t} = a_{t - 1}

), the reward is zero. This reward function encourages the vehicle to maintain stability, ensuring a more efficient and stable driving strategy on the highway.

The selection of λ is critical for balancing stability and responsiveness. A higher value of λ results in a steeper decay, which increases the penalty for rapid action changes and fosters long-term stability. However, this could reduce the system’s ability to respond promptly to dynamic traffic conditions. Conversely, a smaller λ allows for more frequent action changes, which may be beneficial in highly dynamic environments but can lead to erratic behavior if the system changes actions too frequently.

For this study, we empirically determined the optimal value of λ by conducting a series of simulations, which indicated that a moderate value of λ = 0.05 provided a favorable trade-off between stability and responsiveness. This value ensures that the vehicle makes efficient lane changes without excessive delays, while still prioritizing safety by maintaining stability in its decision-making process.

3.2. Huber-Regularized Reward-Threshold Adaptive DDQN

RL has achieved notable progress in tackling complex decision-making problems with the development of DQN [38,39,40,41,42]. However, under certain conditions, DQN may overestimate Q-values, assigning higher Q-values to suboptimal actions rather than optimal ones. This overestimation often leads DQN to select actions based on inflated Q-values, thus potentially impacting performance [43].

To address this overestimation problem, the DDQN was introduced. DDQN improves upon traditional DQN by separating action selection and Q-value estimation into two distinct networks: the online Q-network and the target Q-network. Figure 4 shows the farmwork of DDQN. The online Q-network is tasked with choosing the optimal action by selecting the one with the highest predicted Q-value, while the target Q-network estimates the Q-value of the chosen action. This distinction helps reduce overestimation bias, resulting in a more stable learning process.

In this study, we propose an HRA-DDQN, as shown in Figure 5. This new approach builds upon DDQN, aiming to further improve decision-making accuracy and stability. Initially, the online Q-network

Q (s_{t}, a_{t}; θ)

and the target Q-network

Q (s_{t + 1}, a_{t + 1}; θ^{-})

are established. The experience replay memory stores transition tuples

(s_{t}, a_{t}, r_{t}, s_{t + 1})

, which are used to update the networks. The action

a_{m a x}

is selected by the online Q-network based on the action that yields the maximum Q-value for the next state

s_{t + 1}

:

a_{m a x} = a r g m a x_{a} Q (s_{t + 1}, a; θ)

(18)

The target Q-network is subsequently employed to assess the Q-value of the chosen action:

y_{t} = r_{t} + γ Q^{'} (s_{t + 1}, a_{m a x}; θ^{-})

(19)

where

r_{t}

represents the immediate reward,

γ

is the discount factor, and

Q^{'} (s_{t + 1}, a_{m a x}; θ^{-})

is the estimated Q-value of the next state

s_{t + 1}

using the target Q-network. This target value

y_{t}

is then utilized to update the online Q-network by minimizing the loss function.

The Huber loss function merges the benefits of both the MSE and the MAE. It uses the MSE for smaller errors and the MAE for larger errors, thereby obtaining higher robustness to outliers. The Huber loss is defined as:

H u b e r (x) = \{\begin{array}{l} \frac{1}{2} x^{2} & i f | x | \leq δ \\ δ (| x | - \frac{1}{2} δ) & o t h e r w i s e \end{array}

(20)

where the parameter

δ

serves as a threshold that determines the transition point between the quadratic behavior of MSE and the linear form of MAE. When the absolute error

| x |

is less than

δ

, the Huber loss function is similarly to MSE, being more sensitive to small errors and providing smooth gradients, which facilitates small adjustments during the optimization process. Conversely, when

| x |

exceeds the threshold

δ

,the Huber loss function switches to a linear form, resembling MAE. This linear behavior mitigates the impact of large errors and enhances robustness. By tuning the threshold parameter

δ

, we can balance sensitivity to small errors and robustness to outliers.

To solve the model overfitting problems, an L2 regularization term is added to the loss function in HRA-DQN. This term penalizes the sum of the squared model parameters, encouraging the parameters to take smaller values, thus enhancing the model’s generalization ability. The L2 regularization is defined as:

λ \sum_{j} {‖θ_{j}‖}^{2}

(21)

where

λ

is the regularization coefficient,

θ_{j}

represents the parameters of the online Q-network. The threshold δ of the Huber loss and the regularization coefficient λ of L2 were optimized through grid search and cross-validation on the validation dataset, ensuring a balance between model performance and stability while avoiding overfitting and improving robustness.

In HRA-DDQN, we design the loss function as a combination of the Huber loss function and an L2 regularization term. Compared to the traditional MSE loss function, the Huber loss function is less sensitive to outliers, helping to stabilize the training process. The enhanced loss function can be represented as:

L (θ) = \frac{1}{N} \sum_{i = 1}^{N} Huber (Q (s_{t}, a_{t}; θ) - y_{t}) + λ \sum_{j} {‖θ_{j}‖}^{2}

(22)

where

θ

represents the parameters of the online Q-network, and

N

is the batch size of the experience replay buffer.

The online Q-network parameters

θ

are updated by using the gradient descent algorithm:

θ \leftarrow θ - α \nabla_{θ} L (θ)

(23)

where

α

is the learning rate, and

\nabla_{θ} L (θ)

represents the gradient of the loss function with respect to the network parameters.

In traditional DQN methods, the target network parameters are updated at fixed intervals. However, in HRA-DDQN, the target network parameters are updated based on a specific condition related to the reward difference. This approach enables parameter updates of the target network when significant changes in rewards occur, thereby enhancing the model’s adaptability and improving training effectiveness.

The update condition for the target network parameters is given based on the difference between the previous reward and the current reward. The specific update condition is as follows:

i f |R_{t} - R_{t - 1}| > K, t h e n u p d a t e θ^{-} t o θ

(24)

where

R_{t}

is the current reward at time step t, and

R_{t - 1}

is the previous reward at time step t − 1.

θ^{-}

and

θ

represent the parameters of the target network and the online network, respectively, and K is the predefined reward difference threshold. The pseudo-code for updating the target network parameters based on the reward value difference is shown in Algorithm 1.

Algorithm 1 A reward difference triggered target network update strategy

Initialize: Online Q-network

Q (s_{t}, a_{t}; θ)

,target Q-network

Q (s_{t + 1}, a_{t + 1}, θ^{-})

with random weights, reward difference threshold K, discount factor

γ

, current reward

R_{t}

, previous
reward

R_{t - 1}

. (the value is set to be none)
for episode = 1,2…, M do
for time step t = 1,2…, T do
with a probability of

ε

, a random action

a_{t}

is chosen; otherwise, the action

a_{\max} = a r g m a x_{a} Q (s_{t + 1}, a; θ)

is selected.
current reward

R_{t} = R_{t - 1} + γ Q^{'} (s_{t + 1}, a_{m a x}; θ^{-})

If

R_{t - 1}

is not none then
reward difference =

|R_{t} - R_{t - 1}|

If reward difference > K

update Q (s_{t + 1}, a_{t + 1}; θ^{-})

parameters

θ^{-} \leftarrow θ

else no update to target network parameters
else
previous reward

R_{t - 1} = R_{t}

end for
end for

This target network update strategy, triggered by reward differences, ensures that the target network parameters are updated only when the reward difference exceeds a threshold K. When the reward change is small, the target network parameters are not updated. This approach prevents frequent updates and ensures the stability of the learning process.

To further illustrate the impact of the reward difference threshold, we consider two scenarios in the highway lane-change environment: When the vehicle experiences slight fluctuations between consecutive time steps, the reward difference is smaller than the threshold K. As a result, no update is made to the target network. This ensures that small reward fluctuations do not trigger unnecessary updates, thereby maintaining the stability of the learning process. In contrast, when the reward difference exceeds the threshold, it indicates a significant change in the environment or the vehicle’s decisions, prompting the update of the target network parameters. This helps the model quickly adapt to larger changes, improving decision accuracy and the overall robustness of the model.

After introducing the principle of HRA-DDQN, the pseudocode of HRA-DDQN algorithm used in this work is summarized in the Algorithm 2.

Algorithm 2 HRA-DDQN

Inputs: Replay memory D with capacity N, two Q-networks (online and target). learning rate

α

, discount factor

γ

, regularization coefficient

λ

, Huber loss threshold

δ

,reward difference threshold K.
Initialize: Online Q-network

Q (s_{t}; a_{t}; θ)

, target Q-network

Q (s_{t + 1}, a_{t + 1}; θ^{-})

with random weights, replay memory D.
for episode = 1,2…, M do
for time step t = 1,2…, T do
With a probability of

ε

, a random action

a_{t}

is chosen; otherwise, the action

a_{m a x} = a r g m a x_{a} Q (s_{t + 1}, a; θ)

is selected.
Execute action

a_{t}

, observe reward

r_{t}

and the next state

s_{t + 1}

.
Preprocess the next state

s_{t + 1}

and store the transition

(s_{t}, a_{t}, r_{t}, s_{t + 1})

into the
replay memory D.
Draw a random minibatch of transitions

(s, a, r, s_{t + 1})

from D.
If

s_{t + 1}

is terminal then
Set

y_{t} = r_{t}

.
else
Set

y_{t} = r_{t} + γ Q^{'} (s_{t + 1}, a_{m a x}; θ^{-})

Compute loss L (θ) = \frac{1}{N} \sum_{i = 1}^{N} H u b e r (Q (s_{t}, a_{t}; θ) - y_{t}) + λ {\sum_{j} ‖θ_{j}‖}^{2}

Perform gradient descent on

L (θ) w . r . t . θ

.
If

|R_{t} - R_{t - 1}|

> K then

Update Q (s_{t + 1}, a_{t + 1}; θ^{-})

parameters

θ^{-} \leftarrow θ

end for
end for

4. Case Studies

In this section, the performance of the HRA-DDQN method is evaluated to demonstrate the feasibility and the efficiency of the proposed algorithm. Furthermore, its applicability in highway environments is validated through three different case studies.

4.1. Experimental Setup

4.1.1. Simulator

We utilized the highway-env simulator based on OpenAI Gym [44] to evaluate the performance of the HRA-DDQN. The highway-env offers six distinct driving scenarios: Highway, Intersection, Roundabout, Merge, Racetrack, and Parking. As shown in Figure 6, we construct a four-lane highway environment for training and testing. The silver vehicle denotes the ego vehicle, and the green vehicles denote the surrounding traffic vehicles. The goal of the ego vehicle is to traverse the scenario as quickly as possible while avoiding any collisions on the highway.

4.1.2. Parameter Setting

During training, we train the HRA-DDQN, ETDQN [45], and Dueling DQN models for 10⁵ time steps under identical settings. The neural network architectures for the HRA-DDQN, ETDQN, and Dueling DQN models is a fully connected deep neural network (DNN) consisting of three hidden layers, each with 1024 units.

At every time step t, the interaction data between the agent and the environment are saved in an experience replay buffer with a capacity of N. The network samples use training data from this buffer with a batch size of B. A discount factor γ, which typically takes a value between 0 and 1, is applied to account for the potential future rewards when calculating Q-values. The corresponding training hyperparameters of HRA-DDQN are outlined in Table 1.

4.1.3. Evaluation Metric

To compare the models’ performance, we selected a set of evaluation metrics including: (1) average reward, (2) average speed, (3) average number of steps, (4) average action change frequency, (5) collision rate, and (6) number of lane changes.

The average reward is defined as the total accumulated reward during an episode, divided by the number of episodes. This metric evaluates the overall performance of the algorithm in complete tasks, with higher values indicating better outcomes.

The average speed refers to the vehicle’s mean speed throughout an episode. It is calculated by averaging the speeds at each episode. The target speed is set to 30 m/s, and values close to this target reflect efficient driving behavior.

The average number of steps refers to the total steps taken by the vehicle to complete an episode, averaged across all episodes. It is defined as follows:

\bar{N} = \sum_{k = 1}^{K} N_{k} / K

(25)

where

N_{k}

represents the number of driving steps in the Kth episode. Fewer steps indicate a more efficient driving strategy.

The average action change frequency refers to the average frequency with which the control algorithm initiates actions. It is calculated by dividing the total number of actions by the total number of episodes, and is defined as follows:

\bar{T} = \sum_{k = 1}^{K} T_{k} / K

(26)

where

T_{k}

represents the action change frequency in the Kth episode. Lower values indicate more stable and cautious decision-making.

The collision rate is calculated as the total number of collisions the ego vehicle experiences across all episodes, divided by the total time steps. A lower collision rate implies safer driving performance.

The number of lane changes refers to the total count of lane changes performed by the ego vehicle throughout the episodes. Excessive lane changes may indicate unsafety, so fewer lane changes are generally preferred.

By comparing these metrics, we can thoroughly evaluate the strengths and weaknesses of various algorithms in the context of AD decision-making problems.

4.2. Case 1: Comprehensive Performance Comparison

The purpose of this case is to comprehensively evaluate the safety, decision efficiency, decision stability, and decision accuracy of HRA-DDQN in AD decision-making through the following six aspects: (1) key metrics mentioned in the evaluation metric, (2) trends in the number of lane changes, (3) control actions during a single episode, and (4) Q-value error, (5) statistical analysis, (6) ablation study.

4.2.1. Key Metrics

In the following analysis, we examine the results of key metrics. In Figure 7a, the trends of average reward over training process are shown. We can see that the learning speed and training stability of the HRA-DDQN outperform the other two approaches. The HRA-DDQN shows rapid improvement after approximately 10,000 training steps and consistently demonstrates superior performance, particularly after 20,000 time steps, where the average reward stabilizes above 15. In contrast, both ETDQN and Dueling DQN exhibit relatively lower average rewards with larger fluctuations in the same training range. Notably, Dueling DQN’s average reward frequently drops below zero.

Figure 7b illustrates the variation in average steps throughout the training process. HRA-DDQN demonstrates a significant advantage, with average steps stabilizing between 20 and 40. In contrast, ETDQN and Dueling DQN exhibit substantial fluctuations, particularly ETDQN, which frequently reaches up to 100.

Figure 7c illustrates the average speed variations for the three methods. Significant fluctuations were observed in ETDQN and Dueling DQN during the initial phase of training. However, throughout the training process, HRA-DDQN consistently outperformed the other methods. The average speed of HRA-DDQN converges to the target velocity of 30 m/s and remains relatively stable compared with the other methods.

Figure 7d shows the variation in average action change frequency during training. As training progresses, the action change frequency of the three methods tends to stabilize. However, ETDQN and Dueling DQN exhibit significant fluctuations. In contrast, HRA-DDQN rapidly reduces its action change frequency during the early stages and maintains stability, providing a notable advantage in minimizing unnecessary motion changes and improving efficiency.

Table 2 calculates the overall mean values of the evaluation metrics and the collision rates of the three methods after training 100,000 time steps, and provides a performance comparison of HRA-DDQN, ETDQN, and Dueling DQN.

In Table 2, the average reward for HRA-DDQN is 14.5308, which is significantly higher than ETDQN (4.5162) and Dueling DQN (4.7969). A higher average reward reflects that HRA-DDQN performs better under the given reward mechanism, accomplishing driving objectives more effectively. Additionally, the average speed for HRA-DDQN is closer to the target speed compared to ETDQN and Dueling DQN.

In Table 2, the average number of steps for ETDQN and Dueling DQN are 73.302 and 70.63, respectively, which are considerably higher than HRA-DDQN’s 19.05. Fewer steps suggest that HRA-DDQN is better at guiding the vehicle to select the optimal lane in complex traffic environments, promoting efficient lane changes while penalizing unnecessary maneuvers.

In Table 2, the average action change frequency for HRA-DDQN is 6.938%, lower than ETDQN’s 7.33% and Dueling DQN’s 5.95%. This lower frequency indicates that HRA-DDQN is more effective in avoiding unnecessary lane changes. By minimizing action changes, HRA-DDQN improves decision-making efficiency.

In Table 2, HRA-DDQN also demonstrates the lowest collision rate at 1.37%, significantly outperforming ETDQN and Dueling DQN (7.33% and 5.95%, respectively). It shows that HRA-DDQN can avoid collisions more effectively and ensure higher safety.

4.2.2. Number of Lane Changes

This case specifically assessed the trends in the number of lane changes during training of the three algorithms. Number of lane changes serves as a critical metric for evaluating the stability and efficiency of decision-making. Excessive lane changes may indicate unstable decision-making, while fewer and more rational lane changes reflect an optimized decision-making strategy.

Figure 8 illustrates the change in the number of lane changes for the three methods over 500,000 training steps. ETDQN and Dueling DQN exhibit relatively high lane change counts with significant fluctuations during the initial phase. In contrast, the number of lane changes for HRA-DDQN consistently remains significantly lower than those of ETDQN and Dueling DQN, indicating that HRA-DDQN can gradually optimize decision-making strategies and effectively adapt to complex dynamic environments. These results highlight the significant advantages of HRA-DDQN in terms of decision stability and efficiency.

Figure 9 presents the average lane-change counts of the three methods at different training steps. HRA-DDQN consistently maintains the lowest average lane-change counts, approximately 80 times, compared to around 105 and 95 for ETDQN and Dueling DQN, respectively. These results indicate that HRA-DDQN effectively reduces unnecessary lane-change operations, demonstrating its significant advantage in optimizing lane selection strategies. This performance is attributed to the reward function design of HRA-DDQN, which encourages rational lane changes while penalizing unnecessary ones. Additionally, the incorporation of Huber loss and L2 regularization in HRA-DDQN further enhances the model’s training stability, enabling rapid convergence and a substantial reduction in redundant operations.

4.2.3. Control Actions

To further evaluate the decision-making stability and efficiency of the three algorithms, an analysis of their action transitions within a single episode was conducted. The control actions are represented as follows: 0 represents maintaining the current lane with deceleration, 1 represents changing lanes to the left, 2 represents no action change (maintaining the current lane), 3 represents changing lanes to the right, and 4 represents maintaining the current lane with acceleration. Figure 10 illustrates the variation in action over time steps within a single episode for the three algorithms.

As observed in Figure 10, HRA-DDQN demonstrates significant advantages in terms of action stability. Its action transitions exhibit lower variability, with fewer action switches compared to ETDQN and Dueling DQN. This stability is primarily attributed to its reward function design, which penalizes frequent lane changes and unnecessary acceleration or deceleration, promoting decision consistency. Moreover, compared to the other two algorithms, the curve for HRA-DDQN exhibits a faster convergence to efficient decision-making strategies, requiring fewer action transitions to complete the task. The stable action distribution and rapid convergence not only optimize task execution but also enhance the model’s practicality and reliability in real-world driving scenarios.

4.2.4. Q-Value Errors

To verify the effectiveness of the proposed loss function in improving decision accuracy, we compared the Q-value error trends of HRA-DDQN, Dueling DQN, and ETDQN during the training process. The Q-value error is a core metric of RL algorithm performance, reflecting the deviation between the target Q-value and the predicted Q-value from the online network. A lower Q-value error indicates that the algorithm can more accurately estimate the value of the current state and action, thereby making better decisions.

As shown in Figure 11, the Q-value error of HRA-DDQN decreases rapidly from the early stages of training and eventually stabilizes at approximately 0.1, which is significantly lower than those of Dueling DQN (0.25) and ETDQN (0.3). Furthermore, HRA-DDQN exhibits much smaller error fluctuations during the training process, demonstrating higher convergence and stability. This result can be attributed to the sensitivity of the Huber loss function to small errors and its robustness to outliers, enabling the model to quickly approach the optimal policy.

This experiment further validates the critical role of the proposed loss function in enhancing algorithm performance. In HRA-DDQN, the Huber loss transitions to a linear form when errors are large, effectively mitigating the impact of large errors on training and improving the model’s robustness. Simultaneously, L2 regularization constrains excessive parameter growth, preventing the overfitting problem. These mechanisms collectively make HRA-DDQN significantly more efficient and stable in learning under dynamic traffic scenarios compared to other methods.

4.2.5. Statistical Analysis

To assess whether the observed improvements are statistically significant, we performed a paired t-test on the performance of HRA-DDQN, ETDQN, and Dueling DQN on key metrics such as average reward, average lane changes, and collision rate. We conducted 10 independent experiments for each of the three methods in the same highway scenario and performed paired t-tests. The experimental results are shown in Figure 12.

Figure 12a shows the average reward values during the training process of the three methods. HRA-DDQN achieved an average reward of 15.2, significantly higher than ETDQN (5.1, p < 0.0001) and Dueling DQN (6.2, p < 0.0001), indicating that HRA-DDQN outperforms the other two methods in task completion efficiency and overall performance, and is better at optimizing the decision-making process. In Figure 12b, HRA-DDQN demonstrates statistically significant improvement in reducing the number of lane changes compared to the other two methods (p < 0.0001), further showing its efficient decision-making ability. In terms of collision rate, as shown in Figure 12c, HRA-DDQN performs the best, significantly lower than ETDQN and Dueling DQN (p < 0.0001). The paired t-test results ultimately prove that HRA-DDQN is superior to ETDQN and Dueling DQN, with statistically significant differences. This confirms that the improvements proposed in HRA-DDQN, in terms of reward design, hybrid loss function, and dynamic target network updates, are not only empirically effective but also statistically robust.

4.2.6. Ablation Study

The weight coefficients (

c_{v}

,

c_{c}

,

c_{l}

,

c_{d}

, and

c_{e}

) in the reward function play a crucial role in balancing multiple objectives. We conducted an ablation study on the different weight coefficients in the reward function. The study used comparative experiments to validate the rationality of the selected weight values and their impact on driving efficiency and safety.

All experiments were conducted in the same four-lane highway environment using the HRA-DDQN algorithm. Each training process consisted of 50,000 time steps, and performance metrics were averaged over five independent trials to ensure statistical robustness. The experiments involved changing the value of only one weight coefficient at a time, keeping the others constant, and comparing it with the original weight coefficients set in this paper. The specific weight coefficient settings are shown in Table 3.

Table 4 presents a performance comparison of the ablation experiments for each weight coefficient. The experimental results show that increasing the speed coefficient (

c_{v}

), compared to the base group, brings the vehicle’s speed closer to the desired 30 m/s. However, the collision rate increases by 1.84%, and lane-changing becomes more frequent. This indicates that overemphasizing speed may compromise safety and decision stability. When lowering the collision coefficient (

c_{c}

) compared to the base group, the collision rate surged to 5.41%. The average reward also decreased by 38%, suggesting that strict collision penalties are crucial for safe driving. When increasing the lane-keep coefficient, lane-change coefficient, and action stability coefficient (

c_{l}

,

c_{d}

and

c_{e}

) compared to the base group, it is observed that the collision rates of all three groups significantly increased.

The reward function weight coefficients set in this paper base group achieve the best trade-off between all metrics, validating the rationality of the weight selection. Proper adjustment of the weight coefficients in the reward function is crucial for the performance of autonomous driving systems in complex dynamic environments. By optimizing the weight coefficient settings in the reward function, we can maximize efficiency and decision stability while ensuring safety.

4.3. Case 2: Ego Vehicle Lane Change Process and Safety Analysis

Lane-changing is a complex maneuver influenced by factors such as speed, lane position, and the presence of nearby vehicles. This case first shows the process of an autonomous vehicle executing a lane change in a highway-env with four lanes and then analyzes the number of collisions observed under different methods.

(1): Lane Change Process and Simulation

Figure 13 illustrates the entire process of an autonomous vehicle performing a lane change on a four-lane highway by using the HRA-DDQN algorithm. The ego vehicle (marked in green) starts in the second lane of the highway (as shown in Figure 13a) and transitions to the third lane while avoiding collisions with nearby vehicles (Figure 13b). Once the lane change is complete, the ego vehicle switches back to the second lane without impeding the movement of surrounding vehicles (Figure 13c). This process demonstrates the capability of HRA-DDQN to effectively manage complex traffic scenarios while maintaining safety.

(2): Safety Evaluation

To further evaluate the safety performance of the three algorithms, we first compared the number of collisions for HRA-DDQN, ETDQN, and Dueling DQN. Secondly, the collision counts for the three methods were analyzed and compared.

Figure 14 shows the number of collisions for each algorithm over 50,000 training steps. As shown in Figure 14, the number of collisions decreases for all algorithms as training progresses, indicating that the vehicles progressively learn and refine their driving strategies during training, thereby reducing collision occurrences. Notably, HRA-DDQN demonstrates superior safety performance, particularly after 30,000 training steps.

Figure 15 shows the collision counts for HRA-DDQN, ETDQN, and Dueling DQN after 50,000 training steps. Collision = 1 indicates that the ego vehicle has collided with other vehicles, while Collision = 0 indicates no collisions. Compared to ETDQN and Dueling DQN, HRA-DDQN demonstrates significant advantages in reducing collisions and achieving stability. As training progresses, the collision counts of HRA-DDQN continue to decrease, effectively avoiding collisions after approximately 1600 episodes.

By analyzing the number of collisions and collision counts for the three methods, it is evident that HRA-DDQN leverages the optimized dynamic target network update mechanism and reward function to quickly adapt to dynamic traffic environments and learn more efficient collision avoidance strategies.

(3): Collision performance at different vehicle densities

To evaluate the robustness of HRA-DDQN in dynamic traffic scenarios, we gradually increased the vehicle density in a four-lane highway environment (ranging from 50 to 100 vehicles), with each vehicle density trained for 50,000 time steps. This study investigates the changes in the total number of vehicle collisions under different traffic densities. As the traffic density increases, the model is required to handle more complex driving decision scenarios. Therefore, the experiment aims to assess the performance of different models in high-density traffic environments, particularly their ability to avoid collisions. By observing the variation in collision counts, we gain deeper insights into the model’s adaptability and decision-making efficiency under dynamic and complex traffic conditions, while also providing data support for further model optimization.

Figure 16 shows the collision counts of the three methods under different vehicle densities. As the number of vehicles in the scene increases, the collision counts for all three methods exhibit an upward trend. However, HRA-DDQN demonstrates better stability and a lower collision rate throughout this process. This result indicates that, in complex traffic environments, as the number of vehicles increases, the complexity of the decision-making space also grows. The model needs to make more precise decisions under higher complexity to avoid collisions. Overall, the HRA-DDQN model is able to better adapt to the challenges posed by changes in vehicle density, demonstrating its superiority in dynamic and complex scenarios.

4.4. Case 3: Impact of Key Innovations on Collision Avoidance

To assess the impact of each innovation on the AD decision-making model, we designed Case 3. The simulation setup is consistent with those in Case 1 and Case 2. In this case, one innovation from this study is removed at a time, while the remaining innovations are retained. The key innovations proposed in this study are as follows: (1) the improved design of reward function, (2) the introduction of the Huber loss function and L2 regularization into the HRA-DDQN, and (3) the reward-value-difference-triggered target network update strategy in the HRA-DDQN.

In this case, Method 1 removes Innovation 1, while Innovations 2 and 3 are retained. Method 2 removes Innovation 2, while Innovations 1 and 3 are preserved. Method 3 removes Innovation 3, leaving Innovations 1 and 2 preserved. The baseline retains all the innovative features. This case allows us to observe the individual effects of these innovations on the overall model performance, with a particular emphasis on the collision count, which serves as a critical metric for evaluating the safety of AD systems.

Through the simulation conducted in Case 3, we conclude that the removal of any one of these three innovations leads to a significant deterioration in the model’s safety performance, particularly evident as an increase in collision frequency, as illustrated in Figure 17, the collision counts for Method 1 increased from 377 in the baseline experiment to 595, representing a 58.1% rise. Similarly, the collision counts for Method 2 rose to 702, reflecting an 85.9% increase compared to the baseline. Moreover, the collision counts for Method 3 surged to 865, marking a dramatic increase of 129.4%.

To more comprehensively assess the significance of each innovation, we have summarized the key data from the experiments in Table 5. In Table 5, Method 1 has an average number of lane changes of 107, which is significantly higher than the 82 observed in HRA-DDQN. This indicates that the lack of proper guidance from the reward function leads to more unnecessary lane changes, impacting both efficiency and safety. Method 2 has an average Q-value error of 0.542, which is much higher than HRA-DDQN’s 0.217, highlighting the crucial role of the Huber loss function and L2 regularization in reducing decision errors. Method 3 has a collision rate of 3.84%, which is significantly higher than HRA-DDQN’s 1.16%, suggesting that the target network update mechanism triggered by reward value differences can effectively improve the model’s adaptability to environmental changes, thereby reducing collision risks.

Therefore, the combination of these three innovations is essential to ensuring driving safety. In practical applications, the implementation of these mechanisms is critical to improving the overall performance and safety of AD systems.

To validate the impact of the reward difference threshold K on the overall performance of the HRA-DDQN algorithm, particularly on key metrics such as collision rate, average reward, and average lane changes, we adjusted the value of K to explore the performance of the target network’s dynamic update strategy under different thresholds, thereby determining the optimal K value.

As shown in Table 6, lower K values (such as K = 0.1 and K = 0.25) result in frequent updates to the target network, leading to increased collision rates and unnecessary lane changes, which reduce the efficiency and safety of decision-making. Higher values of K (such as K = 0.8 and K = 1.0) result in slow target network updates, preventing timely responses to environmental changes, which leads to a higher collision rate. K = 0.5 achieves the best balance, with the lowest collision rate (1.37%) and the highest average reward (14.64). This indicates that at this value, target network updates are triggered only when significant changes in the environment occur, avoiding unnecessary updates and maintaining good learning efficiency.

Overall, HRA-DDQN demonstrates exceptional performance across all the key metrics, especially in terms of safety, reaching target speeds, and optimizing lane change strategies, making it highly effective in complex traffic environments.

5. Conclusions

AD decision-making remains a critical challenge for autonomous vehicles. In this study, we propose the HRA-DDQN algorithm to optimize lane-change decisions. The novel reward function balances multiple factors, including driving speed, safety, and lane-change necessity. Additionally, the loss function combines Huber loss with L2 regularization to mitigate outliers and reduce overfitting. Additionally, the dynamic update of target network parameters based on reward difference thresholds, allowing parameter updates only when significant reward changes are detected. This improves adaptability and decision-making without relying on periodic updates, and the parameter does not need to be preset.

The experimental results indicate that the proposed method significantly surpasses the comparison models in several key metrics, including average reward, driving speed, collision rate, decision stability, and so on. These improvements lead to safer and more efficient lane-change maneuvers.

In future work, we aim to extend this approach to more dynamic and complex driving environments, such as urban intersections. Additionally, we plan to integrate surrounding vehicle behavior prediction to further enhance decision accuracy and vehicle safety. Moreover, we recognize the need for real-world feasibility and sensor integration. The model will be adapted to use data from real-world sensors such as LiDAR, cameras, and radar. By applying sensor fusion techniques, the system will improve decision-making accuracy in diverse conditions. We will also focus on handling adversarial driving scenarios, including unexpected pedestrian crossings and aggressive drivers, by incorporating behavior prediction models and robust sensor data processing, ensuring reliable operation in real-world environments.

Author Contributions

Methodology, Z.W. and M.J.; Software, S.G., Y.G. and J.W. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the Postgraduate Research & Practice Innovation Program of Jiangsu Province under Grant SJCX24_2490.

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no conflict of interest.

References

Feng, S.; Xi, J.; Gong, C.; Gong, J.; Hu, S.; Ma, Y. A collaborative decision making approach for multi-unmanned combat vehicles based on the behaviour tree. In Proceedings of the 3rd International Conference on Unmanned Systems (ICUS), Harbin, China, 27–28 November 2020; pp. 395–400. [Google Scholar]
Lee, H.; Kim, N.; Cha, S.W. Model-based reinforcement learning for eco-driving control of electric vehicles. IEEE Access 2020, 8, 202886–202896. [Google Scholar] [CrossRef]
Tian, Y.; Cao, X.; Huang, K.; Fei, C.; Zheng, Z.; Ji, X. Learning to drive like human beings: A method based on deep reinforcement learning. IEEE Trans. Intell. Transp. Syst. 2022, 23, 6357–6367. [Google Scholar] [CrossRef]
Kiran, B.R.; Sobh, I.; Talpaert, V.; Mannion, P.; Sallab, A.A.A.; Yogamani, S.; Pérez, P. Deep reinforcement learning for autonomous driving: A survey. IEEE Trans. Intell. Transp. Syst. 2022, 23, 4909–4926. [Google Scholar] [CrossRef]
Chen, J.; Li, S.E.; Tomizuka, M. Interpretable end-to-end urban autonomous driving with latent deep reinforcement learning. IEEE Trans. Intell. Transp. Syst. 2022, 23, 5068–5078. [Google Scholar] [CrossRef]
Liu, J.; Zhao, L.; Zheng, K.; Zhou, Q. A distributed driving decision scheme based on reinforcement learning for autonomous driving vehicles. In Proceedings of the 91st Vehicular Technology Conference (VTC2020-Spring), Antwerp, Belgium, 25–28 May 2020; pp. 1–5. [Google Scholar]
Tang, J.; Shaoshan, L.; Pei, S.; Zuckerman, S.; Chen, L.; Shi, W.; Gaudiot, J.-L. Teaching autonomous driving using a modular and integrated approach. In Proceedings of the IEEE 42nd Annual Computer Software and Applications Conference (COMPSAC), Tokyo, Japan, 23–27 July 2018. [Google Scholar]
Hoel, C.-J.; Driggs-Campbell, K.; Wolff, K.; Laine, L.; Kochenderfer, M.J. Combining planning and deep reinforcement learning in tactical decision making for autonomous driving. IEEE Trans. Intell. Veh. 2020, 5, 294–305. [Google Scholar] [CrossRef]
Hao, R.; Fan, S.; Dai, Y.; Zhang, Z.; Li, C.; Wang, Y.; Yu, H.; Yang, W.; Yuan, J.; Nie, Z. Rcooper: A real-world large-scale dataset for roadside cooperative perception. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 22347–22357. [Google Scholar]
Liao, J.; Liu, T.; Tang, X.; Mu, X.; Huang, B.; Cao, D. Decision-making strategy on highway for autonomous vehicles using deep reinforcement learning. IEEE Access 2020, 8, 177804–177814. [Google Scholar] [CrossRef]
Park, M.; Lee, S.Y.; Hong, J.S.; Kwon, N.K. Deep deterministic policy gradient-based autonomous driving for mobile robots in sparse reward environments. Sensors 2022, 22, 9574. [Google Scholar] [CrossRef]
Wang, L.; Yang, S.; Yuan, K.; Huang, Y.; Chen, H. A combined reinforcement learning and model predictive control for car-following maneuver of autonomous vehicles. Chin. J. Mech. Eng. 2023, 36, 80. [Google Scholar] [CrossRef]
Do, Q.H.; Tehrani, H.; Mita, S.; Egawa, M.; Muto, K.; Yoneda, K. Human drivers based active-passive model for automated lane change. IEEE Intell. Transp. Syst. Mag. 2017, 9, 42–56. [Google Scholar] [CrossRef]
Xiong, G.; Kang, Z.; Li, H.; Song, W.; Jin, Y.; Gong, J. Decision–making of lane change behavior based on RCS for automated vehicles in the real environment. In Proceedings of the IEEE Intelligent Vehicles Symposium (IV), Changshu, China, 26–30 June 2018; pp. 1400–1405. [Google Scholar]
Shu, K.; Yu, H.; Chen, X.; Chen, L.; Wang, Q.; Li, L.; Cao, D. Autonomous driving at intersections: A critical-turning-point approach for left turns. In Proceedings of the IEEE 23rd International Conference on Intelligent Transportation Systems (ITSC), Rhodes, Greece, 20–23 September 2020; pp. 1–6. [Google Scholar]
Liu, T.; Tian, B.; Ai, Y.; Chen, L.; Liu, F.; Cao, D.; Bian, N.; Wang, F.-Y. Dynamic states prediction in autonomous vehicles: Comparison of three different methods. In Proceedings of the IEEE Intelligent Transportation Systems Conference (ITSC), Auckland, New Zealand, 27–30 October 2019; pp. 3750–3755. [Google Scholar]
Wei, J.; Dolan, J.M.; Litkouhi, B. A prediction- and cost function-based algorithm for robust autonomous freeway driving. In Proceedings of the IEEE Intelligent Vehicles Symposium, San Diego, CA, USA, 21–24 June 2010; pp. 512–517. [Google Scholar]
Sethi, S.P.; Zhang, Q. Hierarchical Decision Making in Stochastic Manufacturing Systems; Springer: Berlin/Heidelberg, Germany, 2012. [Google Scholar]
Paden, B.; Cap, M.; Yong, S.Z.; Yershov, D.; Frazzoli, E. A survey of motion planning and control techniques for self-driving urban vehicles. IEEE Trans. Intell. Veh. 2016, 1, 33–55. [Google Scholar] [CrossRef]
Siboo, S.; Bhattacharyya, A.; Raj, R.N.; Ashwin, S.H. An empirical study of DDPG and PPO-based reinforcement learning algorithms for autonomous driving. IEEE Access 2023, 11, 125094–125108. [Google Scholar] [CrossRef]
Schwarting, W.; Alonso-Mora, J.; Rus, D. Planning and decision-making for autonomous vehicles. Annu. Rev. Control Robot. Auton. Syst. 2018, 1, 187–210. [Google Scholar] [CrossRef]
Chen, Y.; Li, S.; Tang, X.; Yang, K.; Cao, D.; Lin, X. Interaction-aware decision making for autonomous vehicles. IEEE Trans. Transp. Electrif. 2023, 9, 4704–4715. [Google Scholar] [CrossRef]
Lopez, V.G.; Lewis, F.L.; Liu, M.; Wan, Y.; Nageshrao, S.; Filev, D. Game-theoretic lane-changing decision making and payoff learning for autonomous vehicles. IEEE Trans. Veh. Technol. 2022, 71, 3609–3620. [Google Scholar] [CrossRef]
Kendall, A.; Hawke, J.; Janz, D.; Mazur, P.; Reda, D.; Allen, J.-M. Learning to drive in a day. In Proceedings of the 2019 International Conference on Robotics and Automation (ICRA), Montreal, QC, Canada, 20–24 May 2019; pp. 8248–8254. [Google Scholar]
Zhang, S.; Peng, H.; Nageshrao, S.; Tseng, E. Discretionary lane change decision making using reinforcement learning with model-based exploration. In Proceedings of the 2019 18th IEEE International Conference on Machine Learning and Applications (ICMLA), Boca Raton, FL, USA, 16–19 December 2019. [Google Scholar]
Baheri, A.; Nageshrao, S.; Tseng, H.E.; Kolmanovsky, I.; Girard, A.; Filev, D. Deep reinforcement learning with enhanced safety for autonomous highway driving. In Proceedings of the 2020 IEEE Intelligent Vehicles Symposium (IV), Las Vegas, NV, USA, 19 October–13 November 2020; pp. 1550–1555. [Google Scholar]
Makantasis, K.; Kontorinaki, M.; Nikolos, I. Deep reinforcement-learning-based driving policy for autonomous road vehicles. IET Intell. Transp. Syst. 2020, 14, 13–24. [Google Scholar] [CrossRef]
Mnih, V. Playing atari with deep reinforcement learning. arXiv 2013, arXiv:1312.5602. [Google Scholar]
Al-Hamadani, M.N.A.; Fadhel, M.A.; Alzubaidi, L.; Balazs, H. Reinforcement Learning Algorithms and Applications in Healthcare and Robotics: A Comprehensive and Systematic Review. Sensors 2024, 24, 2461. [Google Scholar] [CrossRef]
Bergerot, C.; Barfuss, W.; Romanczuk, P. Moderate confirmation bias enhances decision-making in groups of reinforcement-learning agents. PLoS Comput. Biol. 2024, 20, e1012404. [Google Scholar] [CrossRef]
Van Hasselt, H.; Guez, A.; Silver, D. Deep reinforcement learning with double q-learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Phoenix, AZ, USA, 12–17 February 2016; Volume 30. [Google Scholar] [CrossRef]
Iqbal, A.; Tham, M.L.; Chang, Y.C. Double deep Q-network-based energy-efficient resource allocation in cloud radio access network. IEEE Access 2021, 9, 20440–20449. [Google Scholar] [CrossRef]
Huang, L.; Ye, M.; Xue, X.; Wang, Y.; Qiu, H.; Deng, X. Intelligent routing method based on Dueling DQN reinforcement learning and network traffic state prediction in SDN. Wireless Netw. 2024, 30, 4507–4525. [Google Scholar] [CrossRef]
Din, N.M.U.; Assad, A.; Sabha, S.U.; Rasool, M. Optimizing deep reinforcement learning in data-scarce domains: A cross-domain evaluation of double DQN and dueling DQN. Int. J. Syst. Assur. Eng. Manag. 2024. [Google Scholar] [CrossRef]
Guan, Y.; Li, S.E.; Duan, J.; Wang, W.; Cheng, B. Markov probabilistic decision making of self-driving cars in highway with random traffic flow: A simulation study. J. Intell. Connect. Veh. 2018, 1, 77–84. [Google Scholar] [CrossRef]
He, X.; Yang, H.; Hu, Z.; Lv, C. Robust lane change decision making for autonomous vehicles: An observation adversarial reinforcement learning approach. IEEE Trans. Intell. Veh. 2022, 8, 184–193. [Google Scholar] [CrossRef]
Sharma, O.; Sahoo, N.C.; Puhan, N.B. Highway lane-changing prediction using a hierarchical software architecture based on support vector machine and continuous hidden markov model. Int. J. Intell. Transp. Syst. Res. 2022, 20, 519–539. [Google Scholar] [CrossRef]
Zhu, Z.; Zhao, H. A survey of deep RL and IL for autonomous driving policy learning. IEEE Trans. Intell. Transp. Syst. 2021, 23, 14043–14065. [Google Scholar] [CrossRef]
Chae, H.; Kang, C.M.; Kim, B.D.; Kim, J.; Chung, C.C.; Choi, J.W. Autonomous braking system via deep reinforcement learning. In Proceedings of the 2017 IEEE 20th International Conference on Intelligent Transportation Systems (ITSC), Yokohama, Japan, 16–19 October 2017; pp. 1–6. [Google Scholar]
Xia, W.; Li, H.; Li, B. A control strategy of autonomous vehicles based on deep reinforcement learning. In Proceedings of the 2016 9th International Symposium on Computational Intelligence and Design (ISCID), Hangzhou, China, 10–11 December 2016; Volume 2, pp. 198–201. [Google Scholar]
Li, G.; Yang, Y.; Li, S.; Qu, X.; Lyu, N.; Li, S.E. Decision making of autonomous vehicles in lane change scenarios: Deep reinforcement learning approaches with risk awareness. Transp. Res. Part C Emerg. Technol. 2022, 134, 103452. [Google Scholar] [CrossRef]
Sharma, S.; Tewolde, G.; Kwon, J. Lateral and longitudinal motion control of autonomous vehicles using deep learning. In Proceedings of the 2019 IEEE International Conference on Electro Information Technology (EIT), Brookings, SD, USA, 20–22 May 2019; pp. 1–5. [Google Scholar]
Kim, K. Enhancing Reinforcement Learning Performance in Delayed Reward System Using DQN and Heuristics. IEEE Access 2022, 10, 50641–50650. [Google Scholar] [CrossRef]
Leurent, E. An Environment for Autonomous Driving Decision-Making. Available online: https://github.com/Farama-Foundation/HighwayEnv (accessed on 26 September 2024).
Lu, J.; Han, L.; Wei, Q.; Wang, X.; Dai, X.; Wang, F.-Y. Event-triggered deep reinforcement learning using parallel control: A case study in autonomous driving. IEEE Trans. Intell. Veh. 2023, 8, 2821–2831. [Google Scholar] [CrossRef]

Figure 1. RL framework.

Figure 2. The structure of DQN.

Figure 3. The framework of Dueling DQN.

Figure 4. The framework of DDQN.

Figure 5. The HRA-DDQN input and output of the framework.

Figure 6. Highway driving scenario for a decision-making task involving four lanes.

Figure 7. Training curves of different performance indices. (a) The average reward values during the training process of the three methods. (b) The average number of steps during the training process of the three methods. (c) The average speed during the training process of the three methods. (d) The average action change frequency during the training process of the three methods.

Figure 8. Comparison of lane changes for three methods.

Figure 9. Average number of lane changes in the three methods.

Figure 10. Control actions in one episode of three methods.

Figure 11. Comparison of Q-value error trends during the training process.

Figure 12. Performance evaluation of three methods using paired t-test. (a) The average reward values during the training process of the three methods. (b) The average number of lane changes during the training process of the three methods. (c) The collision rate during the training process of the three methods. The “****” marks indicate highly significant differences (p < 0.0001) in the respective evaluations.

Figure 13. Ego vehicle lane change simulation in a highway environment. (a) The ego vehicle is in the second lane of a four-lane highway. (b) The ego vehicle switches to the third lane. (c) The ego vehicle transitions back to the second lane without impeding the movement of surrounding vehicles. The green color represents the ego vehicle, and the blue color represents surrounding vehicles.

Figure 14. Comparison of collision counts of the three methods.

Figure 15. Collision conditions of the ego vehicle in each compared method.

Figure 16. Number of collisions for different numbers of traffic vehicles after training.

Figure 17. Effect of removing key Innovation points on collision counts.

Table 1. The training hyperparameters for HRA-DDQN.

Hyperparameters	Symbol	Value
Discount factor	γ	0.97
Learning rate	h	0.1
Batch size	B	256
Replay buffer size	N	8192
Speed reward weight	c_v	0.5
Collision reward weight	C_c	5
Lane-keeping reward weight	c_l	0.1
Lane-changing reward weight	c_d	0.5
Action stability reward weight	c_e	2

Table 2. Performance comparison with different methods.

Evaluation Metric	HRA-DDQN	ETDQN	Dueling DQN
Average reward	14.5308	4.5162	4.7969
Average speed	29.8305	27.8895	28.3431
Average number of steps	19.05	73.302	70.63
Average action change frequency	6.938%	10.27%	9.71%
Collision rate	1.37%	7.33%	5.95%

Table 3. Weight configurations for ablation experiments.

	c_v	c_c	c_l	c_d	c_e
Base group	0.5	5	0.1	0.5	2
Experiment 1	1	5	0.1	0.5	2
Experiment 2	0.5	2	0.1	0.5	2
Experiment 3	0.5	5	0.5	0.5	2

Table 4. Performance comparison across ablation experiments.

	Average Reward	Average Speed	Collision Rate	Average Lane Change
Base group	15.68	29.81	1.29%	79
Experiment 1	11.54	30.14	3.13%	91
Experiment 2	9.64	28.47	5.41%	85
Experiment 3	14.62	29.37	1.78%	75
Experiment 4	10.46	28.85	1.81%	82
Experiment 5	12.28	29.17	2.36%	84

Table 5. Impact of removing key innovations on safety and decision-making performance.

	Collision Rate	Average Number of Lane Changes	Average Q-Value Error
HRA-DDON	1.16%	82	0.217
Method 1	2.27%	107	0.389
Method 2	2.83%	92	0.542
Method 3	3.84%	119	0.309

Table 6. The impact of different reward difference thresholds on decision-making performance.

K	Collision Rate	Average Lane Change	Average Reward
0.1	7.31%	121	4.55
0.25	2.59%	97	11.31
0.5	1.37%	84	14.64
0.8	3.94%	72	9.47
1	5.56%	65	5.13

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, Z.; Jiang, M.; Gu, S.; Gu, Y.; Wang, J. Enhancing Lane Change Safety and Efficiency in Autonomous Driving Through Improved Reinforcement Learning for Highway Decision-Making. Electronics 2025, 14, 918. https://doi.org/10.3390/electronics14050918

AMA Style

Wang Z, Jiang M, Gu S, Gu Y, Wang J. Enhancing Lane Change Safety and Efficiency in Autonomous Driving Through Improved Reinforcement Learning for Highway Decision-Making. Electronics. 2025; 14(5):918. https://doi.org/10.3390/electronics14050918

Chicago/Turabian Style

Wang, Zi, Mingzuo Jiang, Shaoqiang Gu, Yunyang Gu, and Jiaxia Wang. 2025. "Enhancing Lane Change Safety and Efficiency in Autonomous Driving Through Improved Reinforcement Learning for Highway Decision-Making" Electronics 14, no. 5: 918. https://doi.org/10.3390/electronics14050918

APA Style

Wang, Z., Jiang, M., Gu, S., Gu, Y., & Wang, J. (2025). Enhancing Lane Change Safety and Efficiency in Autonomous Driving Through Improved Reinforcement Learning for Highway Decision-Making. Electronics, 14(5), 918. https://doi.org/10.3390/electronics14050918

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Enhancing Lane Change Safety and Efficiency in Autonomous Driving Through Improved Reinforcement Learning for Highway Decision-Making

Abstract

1. Introduction

2. Preliminaries

2.1. Reinforcement Learning

2.2. DQN

2.3. Dueling DQN

3. Highway Decision-Making Problem

3.1. Problem Formulation

3.2. Huber-Regularized Reward-Threshold Adaptive DDQN

4. Case Studies

4.1. Experimental Setup

4.1.1. Simulator

4.1.2. Parameter Setting

4.1.3. Evaluation Metric

4.2. Case 1: Comprehensive Performance Comparison

4.2.1. Key Metrics

4.2.2. Number of Lane Changes

4.2.3. Control Actions

4.2.4. Q-Value Errors

4.2.5. Statistical Analysis

4.2.6. Ablation Study

4.3. Case 2: Ego Vehicle Lane Change Process and Safety Analysis

4.4. Case 3: Impact of Key Innovations on Collision Avoidance

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI