Multi-Vehicle Cooperative Decision-Making in Merging Area Based on Deep Multi-Agent Reinforcement Learning

Quan Gan; Bin Li; Zhengang Xiong; Zhenhua Li; Yanyue Liu

doi:10.3390/su16229646

,

and

¹

The State Key Laboratory of Intelligent Transportation System, the Research Institute of Highway Ministry of Transport, Beijing 100088, China

²

The State Key Laboratory of Intelligent Transportation System, the Highway Monitoring and Emergency Response Center, Ministry of Transport of the China, Beijing 100029, China

³

The School of Traffic and Transportation, Beijing Jiaotong University, Beijing 100091, China

^*

Author to whom correspondence should be addressed.

Sustainability2024, 16(22), 9646;https://doi.org/10.3390/su16229646

Version Notes

Order Reprints

Abstract

In recent years, reinforcement learning (RL) methods have shown powerful learning capabilities in single-vehicle autonomous driving. However, few studies have focused on multi-vehicle cooperative driving based on RL, particularly in the dynamically changing traffic environments of highway ramp merge zones. In this paper, a multi-agent deep reinforcement learning (MARL) framework for multi-vehicle cooperative decision-making is proposed based on actor–critic, which categorizes vehicles into two groups according to their origins in the merging area. At the same time, the complexity of the network is reduced and the training process of the model is accelerated by utilizing mechanisms such as partial parameter sharing and experience playback. Additionally, a combination of global and individual rewards is adopted to promote cooperation in connected autonomous vehicles (CAVs) and balance individual and group interests. The training performance of the model is compared under three traffic densities, and our method is also compared with state-of-the-art benchmark methods. The simulation results show that the proposed MARL framework can have stronger policy learning capability and stability under various traffic flow conditions. Moreover, it can also effectively improve the speed of vehicles in the merging zone and reduce traffic conflicts.

Keywords:

multi-agent deep reinforcement learning; actor–critic; connected autonomous vehicles; ramp merge; mixed rewards

1. Introduction

Autonomous driving technology has broad application prospects due to its advantages such as safety, high efficiency, energy saving and environmental protection. With the further development of technology, autonomous driving is expected to gradually achieve large-scale commercial use in the future and change people’s travel methods and habits [,,,]. However, in the real complex environment, formulating a reliable decision-making strategy and exerting control for autonomous driving is still a difficult technical problem to solve.

As a key application scenario of fully autonomous driving, vehicle merging in a highway merging area refers to the process of spontaneous coordination between CAVs to complete merging without any hard constraints such as traffic control [,]. In this case, the ego vehicle needs to interact with surrounding vehicles, negotiate the right of way in the merging area, and minimize the impact on merging vehicles. However, this brings some difficulties and challenges to autonomous driving. First, the autonomous driving system needs to judge the behavior of surrounding vehicles and predict their possible next actions []. This game is not just a simple physical position calculation, but also includes the “cooperation” and “competition” relationship between vehicles [,]. Autonomous driving vehicles need to seize the right opportunity to complete the lane change without invading the space of other vehicles. The behavior of other vehicles is uncertain, which increases the complexity of the game. Secondly, safety should be prioritized when merging, and sudden acceleration and deceleration and other behaviors that may cause accidents should be avoided as much as possible. In addition, an overly conservative driving style, such as frequent waiting or failure to merge smoothly, will reduce passenger comfort []. Finding a balance point that can ensure safety and provide a smooth driving experience is a key issue facing autonomous driving systems. Third, autonomous vehicles often only focus on their own merging needs and ignore the smoothness of the traffic system, which may cause traffic congestion and reduce traffic efficiency []. Therefore, autonomous vehicles must make balanced decisions between individual and group interests, which means that vehicles must not only find the right time to merge for themselves, but also “cooperate” to maintain smooth traffic flow [].

In order to solve the above problems, researchers have proposed some cooperative driving control strategies, which are mainly divided into rule-based strategies, planning-based strategies [] and learning-based strategies. Rule-based strategies aim to use pre-defined rules and heuristic algorithms to allocate vehicle right of way in a very short time. Meng et al. [] compared the two cooperative driving strategies of “temporary negotiation” and “planning” at non-signalized intersections and found that the main difference lies in how to determine the order in which vehicles pass through the intersection, and these two methods often only find local optimal solutions. Liu et al. [] used a local conflict graph to determine the order in which vehicles pass through the intersection and introduced a distributed conflict resolution mechanism to reduce vehicle delay time and improve the pass rate of the intersection. However, in most cases, this strategy adopts the first-in-first-out rule, which is very rigid when facing unknown situations or emergencies [].

In multi-vehicle collaborative driving, optimization-based strategies seek the optimal solution for the overall system through mathematical models and optimization algorithms. Lu et al. [] proposed a dynamically adjustable game model to resolve conflicts between vehicles. This was mainly achieved by designing the game entry mechanism and re-planning the game sequence. At the same time, it also constructed a personalized beneficial function that includes factors such as driving efficiency, safety [] and comfort with a personalized revenue function. Pei et al. [] applied the idea of dynamic programming to the field of intersection collaborative decision-making. They constructed a small-scale state space to describe the solution space of large-scale planning problems, and then gradually searched for the global optimal solution through dynamic programming. Optimization-based strategies can consider the global benefits of the entire system, but they need to be based on accurate mathematical models, and real-time global optimization is required at each time step. When a large number of vehicles are involved, the computational time and resource consumption of the optimization algorithm will increase significantly [].

Depending on whether the data are labeled, learning-based strategies can be divided into supervised learning and reinforcement learning. The input of supervised learning depends on the driving behavior that occurs in the actual scene, and the model can learn how to output the corresponding action under specific input. This depends largely on the quality of the labeled training data and is only applicable to deterministic scenarios. In the reinforcement learning method, each vehicle learns the strategy based on the reward or penalty signal received by interacting with the environment and constantly trying different actions []. On the basis of reinforcement learning, neural networks are used to represent the state and action space, solving the problem that traditional reinforcement learning performs poorly in high-dimensional complex environments. Ye et al. [] designed an integrated framework based on DRL and the VISSIM simulation platform, using the DDPG algorithm to deal with the problem of continuous state and action space; explored the impact of different reward functions on training results; and proposed a regularized reward function to improve training convergence and stability. Experimental results show that compared with the traditional adaptive cruise control (ACC) model, the average vehicle speed of the DRL model has been greatly improved. Wang et al. [] proposed a reinforcement learning model based on a deep Q network (DQN). The vehicle learns lane-changing decisions by observing the surrounding traffic conditions and considers individual efficiency and overall traffic efficiency in the reward function to achieve harmonious lane changing of autonomous vehicles. Compared with rule-based and planning-based strategies, reinforcement learning can improve the decision-making level of autonomous vehicles to a certain extent []. However, most of the above studies are designed for single-vehicle driving, and background vehicles are treated as obstacles in the decision-making process. There is no interaction process between the current vehicle and the background vehicle, which is inconsistent with the actual driving scenario.

To address this problem, multi-vehicle cooperative driving has emerged, which is based on information sharing and decision interaction between vehicles and aims to achieve collective intelligent decision-making to improve traffic safety and efficiency. Multi-agent reinforcement learning is the main method used. Chen et al. [] modeled the two-vehicle cooperative driving problem as a decentralized partially observable Markov decision process (Dec-POMDPs), adopted the Monotonic Value Function Factorization for Deep Multi-Agent Reinforcement Learning (QMIX) algorithm in deep multi-agent reinforcement learning, and minimized the difference between the contribution value distributions of the two vehicles to ensure that the vehicles adopt fair and cooperative strategies. Chen et al. [] proposed an efficient and scalable MARL framework, which uses parameter sharing and local reward mechanisms to enhance the cooperation between agents. In addition, a priority-based safety supervision mechanism was designed to significantly reduce the collision rate during training and improve training efficiency. However, when the number of vehicles increases, such methods will suffer from the curse of dimensionality. In addition, value decomposition reinforcement learning methods such as QMIX are subject to the limitations of additivity and monotonicity [], and are unable to decompose action value functions that do not follow these constraints, and therefore face certain difficulties when applied to practical scenarios.

Therefore, for the merging scenario of highway ramps, this paper proposes a novel decentralized MARL framework. It combines global rewards and local rewards to balance the interests of individual vehicles and the efficiency of overall traffic, and uses centralized training and decentralized execution modes, parameter sharing and other strategies to accelerate the model training process. Specifically, the main contribution of this paper is summarized as follows.

In the context of fully automated driving, the merging problem on highway on-ramps is formulated as a partially observed Markov decision problem, and a decentralized MARL framework is proposed.

In our framework, we use centralized training to share global data, decentralized execution to ensure decision independence, grouping, partial parameter sharing and other mechanisms to accelerate model training. We use a hybrid reward that combines local rewards with global rewards to ensure efficient and smooth driving of a single vehicle while ensuring safe and efficient traffic flow.

Experimental results show that our approach outperforms several state-of-the-art methods in terms of safety, traffic efficiency, and comfort.

The rest of this paper is organized as follows. Section 2 briefly introduces the basics and features of RL and MARL. Section 3 describes the MARL framework used in this paper in detail. Section 4 presents the experiments, results, and discussions. We conclude the paper and discuss future work in Section 5.

2. Materials and Methods

This section introduces the algorithms involved in our work, including the actor–critic framework, centralized training and decentralized execution (CTDE), and the Multi-Agent Deep Deterministic Policy Gradient (MADDPG) algorithm [].

2.1. Actor–Critic Framework

In MARL, agents learn how to take optimal actions in different states to maximize the cumulative returns by interacting with the environment. There are three main methods: policy-based, value-based, and actor–critic-based. The policy-based architectures directly optimize the policy itself, which is suitable for complex continuous control tasks and performs better in exploration. However, the strategy gradient approach is weak in training stability because it is prone to large variance, and it is difficult to capture the complex interactions between agents. The value-based architecture has high sampling efficiency and can achieve stable training in discrete action spaces, but its exploration efficiency is low in high-dimensional scenarios []. Therefore, actor–critic architecture, which combines the advantages of policy gradient and value function approximation, came into being []. It uses an “Actor” network to select actions and another “Critic” network to evaluate the value of these actions to improve learning efficiency and performance []. This framework performs well in many complex tasks, especially when dealing with continuous action space problems in high-dimensional complex environments, such as A3C (Asynchronous Advantage Actor–Critic) and DDPG [].

As shown in Figure 1, the actor is responsible for determining the action that the agent should take in a given state. It directly learns and optimizes a parameter-based

θ

policy

π_{θ} (a | s)

, i.e., a probability distribution or deterministic action for

s

selecting an action in state

a

. The actor improves its policy by maximizing the expectation of the cumulative reward, often using a policy gradient method to update parameters.

Figure 1. The flow of the actor–critic algorithm.

After the actor learns the strategy

π

, the critic module learns the action value function

Q_{π} (s^{'}, a^{'})

to evaluate the value of the current state. The agent selects actions and interacts with the environment based on the strategy provided by the actor

a

, observes the next state

s

and obtains rewards

r

. The critic obtains the action value function

Q_{π} (s, a)

as

Q_{π} (s, a) = r + γ Q_{π} (s^{'}, a^{'})

(1)

The task of the critic is to evaluate the quality of the current strategy, which is achieved by minimizing the estimation error of the value function, as shown in Equation (2).

A_{π} (s, a) = Q_{π} (s, a) - V_{π} (s)

(2a)

V_{π} (s) = \sum_{a} π (a | s) Q_{π} (s, a)

(2b)

where

V_{π} (s)

denotes the average value of performing all possible actions and obtaining the desired reward according to the strategy

π

in state

s

.

0 \leq γ \leq 1

is a discount factor, which is used to weigh the importance of the future return.

This difference indicates

A_{π} (s, a)

the advantage of the selected action over the average behavior of the strategy

π

in the current state. If

A_{π}

is positive, it means that the action is better than the average level; if it is negative, it means it is below the average level. Therefore, the advantage function is often used as the basis for adjusting the policy parameters in the actor and the parameters of the critic median function.

2.2. Centralized Training and Decentralized Execution (CTDE)

CTDE is an important framework in MARL, which is used to solve the problems of collaboration and competition among multiple agents. In multi-vehicle collaborative driving, each CAV can only obtain environmental information near it but cannot observe the global state of the entire road. Therefore, CTDE is suitable for modeling this centralized training and decentralized execution environment.

During the training process, all CAVs can access global information, such as the complete state of the environment, the actions of other agents, rewards, etc., to promote cooperation or confrontation between CAVs and better optimize strategies. In the execution phase, each CAV can only make decisions based on its own local observations to ensure decentralized organization in practical applications.

In order to make CTDE suitable for multi-agent environments, it is crucial to properly handle the relationships between individuals and groups. The latest multi-agent reinforcement learning work based on actor–critic, such as MADDPG and MAPPO, uses global information to train the critic network during the training phase to evaluate individual performance, as shown in Formula (3):

Q_{i} (s, a_{1}, a_{2}, \dots, a_{N}) = E [r_{i} + γ Q_{i} (s^{'}, a_{1}^{'}, a_{2}^{'}, \dots, a_{N}^{'})]

(3)

2.3. Multi-Agent Deep Deterministic Policy Gradient (MADDPG) Algorithm

MADDPG is an extension of DDPG in MARL; it integrates the ideas of CTDE. Each agent updates its own strategy (actor) and value function (critic) through the actor–critic method. During the training process, since the actions of multiple agents will affect each other, the critic module obtains global information to more accurately evaluate the behavior of each agent and estimate the value function. In the execution phase, the actor network can only make decisions based on the local observations of the agent, ensuring the independence of the decision.

When there are

N

agents in the environment, each agent

i

has a local observation

o_{i}^{t}

at time

t

and selects actions

a_{i}^{t}

through strategies

π_{θ_{i}} (a_{i}^{t} | o_{i}^{t})

. The action combination of all agents is

a^{t} = (a_{1}^{t}, a_{2}^{t}, \dots, a_{N}^{t})

and the state combination is

s^{t} = (o_{1}^{t}, o_{2}^{t}, \dots, o_{N}^{t})

. During the training process, the critic network is used to estimate the Q-value function of each agent, that is, the expected cumulative reward that the agent

i

can obtain in the future after performing an action

a_{i}^{t}

in a given state

s^{t}

.

In order to update the parameters of the critic network, MADDPG uses the time difference error to measure the difference between the current Q value and the true return, and updates the parameters of the critic network with the goal of minimizing the time difference.

L (θ_{i}) = E [{(y_{i}^{t} - Q_{i}^{t} (s^{t}, a_{1}^{t}, a_{2}^{t}, \dots, a_{N}^{t}))}^{2}]

(4)

y_{i}^{t} = r_{i}^{t} + γ Q_{i}^{t + 1} (s^{t + 1}, a_{1}^{t + 1}, a_{2}^{t + 1}, \dots, a_{N}^{t + 1})

(5)

The goal of the actor network is to maximize the expected cumulative reward. The policy is generally updated through the policy gradient method. The goal is to select actions that maximize the Q value estimated by the critic. The update goal of the actor network is

\max J (θ_{i}) = E_{π_{θ_{i}}} [Q_{i} (s^{t}, a_{1}^{t}, a_{2}^{t}, \dots, a_{N}^{t})]

(6)

\nabla_{θ_{i}} J (θ_{i}) = E [\nabla_{a_{i}^{t}} Q_{i} (s^{t}, a_{1}^{t}, a_{2}^{t}, \dots, a_{N}^{t}) \nabla_{θ_{i}} π_{θ_{i}} (o_{i}^{t})]

(7)

Among them,

\nabla_{a_{i}^{t}} Q_{i} (s^{t}, a_{1}^{t}, a_{2}^{t}, \dots, a_{N}^{t})

represents the gradient of Q value to action;

a_{i}^{t}

, represents how action affects the long-term reward of the agent; and

\nabla_{θ_{i}} π_{θ_{i}} (o_{i}^{t})

represents how the policy parameters

θ_{i}

affect the action selected by the agent through the policy network

π_{θ_{i}}

. MADDPG interweaves the learning of approximate optimal Q function and optimal action, which not only solves the problem of difficulty in representing continuous state space, but also solves the problem of difficulty in evaluating continuous action space.

3. Methodology

The traffic scenario is a merging area on a highway, which consists of four parts: the front section of the merging area, the acceleration area, the merging area, and the rear section of the merging area. The main lane contains two lanes and allows lane changes. The function of the acceleration lane is to provide vehicles with a speed close to that of the vehicles in the main lane when entering the highway from the ramp. The merging lane is the area where the acceleration lane and the main lane gradually merge, and vehicles complete the merge and enter the main lane in this area. The lengths of the four areas are 120 m, 80 m, 80 m, and 120 m, respectively, with a lane width of 3.75 m and a size of 5 m × 2 m for each vehicle, as shown in Figure 2. The environment is built based on the gymnasium 0.26.3 environment [] developed by OpenAI and uses the Python language.

Figure 2. Cooperative driving scenario of ramp merger; it consists of four parts including the merging area, and the main road contains two travel lanes.

In our proposed multi-vehicle cooperative decision-making, each vehicle exchanges information with others within 100 m via V2V communication and makes a decision. Meanwhile, the central platform exchanges information with the vehicle via V2I communication to collect the traffic conditions in the merging area and evaluates the decision-making behavior of these vehicles.

3.1. Problem Description

This paper focuses on the high-level decision-making behaviors between mainline vehicles and ramp vehicles. As mentioned above, we model the multi-vehicle collaborative decision-making problem as a multi-agent Markov decision process (Multi-Agent MDP). It is defined as

(𝒮, 𝒪, 𝒜, 𝒫, ℛ, γ, 𝒩)

, where

𝒮

is a global state space set, describing the state information of the entire environment,

𝒪 = \{𝒪_{1}, 𝒪_{2}, \dots, 𝒪_{N}\}

is a local observation set, and each agent can only focus on a part of the environment.

𝒜 = \{𝒜_{1}, 𝒜_{2}, \dots, 𝒜_{N}\}

is an action space set, with

𝒜_{i}

representing all possible actions that the agent

i

can take;

𝒫 (s^{t + 1} | s^{t}, a_{1}^{t}, a_{2}^{t}, \dots, a_{N}^{t})

is a joint state transition probability, representing the probability of all agents transferring to the next state after

s^{t}

taking an action in the state

\{a_{1}^{t}, a_{2}^{t}, \dots, a_{N}^{t}\}

;

ℛ = \{ℛ_{1}, ℛ_{2}, \dots, ℛ_{N}\}

is a reward function set, with

ℛ_{i}

representing the reward obtained by the agent

i

after performing an action

a_{i}^{t}

in the global state;

γ

is a discount factor, used to weigh the relative importance of current rewards and future rewards.

𝒩

is the number of autonomous driving vehicles, and each agent has its own actor and critic network.

(1)

State space: The state space of an intelligent connected vehicle

i

is a matrix

L_{i} \times C_{i}

, where

L_{i}

is the number of vehicles that can be observed within the observation range of the current vehicle (ego vehicle), and

C_{i}

is the observed vehicle characteristics, mainly including the following:

➀: Present—Indicates whether there is an observed vehicle near the current vehicle, represented by a 0–1 variable.
➁: X—The longitudinal position of the observed vehicle from the current vehicle.
➂: Y—The lateral position of the observed vehicle from the current vehicle.
➃: Vx—The longitudinal velocity of the observed vehicle relative to the current vehicle.
➄: Vy—The lateral velocity of the observed vehicle relative to the current vehicle.
➅: Heading—The heading angle of the observed vehicle.

In this paper, since the ego vehicle can only observe the information of vehicles near it, the vehicles in front and behind it on the left lane, and the vehicles in front and behind it on the right lane, a total of 7 vehicles, and they need to be within 100 m of the current vehicle, the decision-making of a single autonomous vehicle can be modeled as a partially observable Markov process.

(2): Action space: Action $a_{i}^{t} = \{Lane_left, Crusing, Lane_right, Faster, Slower\}$ represents the driving decision of the intelligent connected vehicle $i$ at a time step $t$ , including left turn, right turn, cruising, acceleration and deceleration. All actions of the vehicle at each time step constitute its action set $𝒜_{i} = \{a_{i}^{1}, a_{i}^{2}, \dots, a_{i}^{T}\}$ , and the action set of all vehicles constitutes the overall action space of the system $𝒜 = 𝒜_{1} \times 𝒜_{2} \times \dots \times 𝒜_{𝒩}$ . After using the reinforcement learning algorithm to learn the high-level decision-making behavior of the vehicle, the low-level controller will generate corresponding steering and throttle control signals to manipulate the autonomous vehicle. In this paper, the longitudinal acceleration/deceleration behavior of the vehicle is given by the intelligent driver model (IDM), and the lane changing behavior of the vehicle is given by the minimization of total braking caused by lane change (MOBIL) model [].
(3): Reward function: The reward function defines how the vehicle obtains feedback after performing certain actions, which directly affects the behavior and learning strategy of the agent. In this paper, our goal is to make each CAV pass through the merging area safely and quickly, while also hoping to keep the traffic flow smooth and avoid traffic congestion; thus, the reward function of CAV $i$ at time step $t$ is defined as follows:

$R_{i} = α \cdot R_{l o c a l, i} + β \cdot R_{g l o b a l}$

(8)

where $R_{l o c a l, i}$ and $R_{g l o b a l}$ represent individual and group rewards, respectively.

$R_{l o c a l, i} = R_{c, i} + R_{s, i} + R_{v, i} + R_{a c c, i} + R_{l c, i}$

(9a)

$R_{c, i} = \{\begin{cases} - 5 if collision \\ 0 others \end{cases}$

(9b)

$R_{s, i} = \{\begin{cases} - 1 if d i s_{t o_f r e n o t} < d i s_{t h r e s h o l d} = 2 * v_{i} \\ 0 others \end{cases}$

(9c)

$R_{v, i} = \{\begin{cases} \frac{v_{i} - v_{\min}}{v_{\max} - v_{\min}} if v_{\min} < v_{i} < v_{\max} \\ 0 others \end{cases}$

(9d)

$R_{a c c, i} = \{\begin{cases} - \frac{|a_{i}|}{a_{t h r e s h o l d}} if a b s (a_{i}) > a_{t h r e s h o l d} \\ 0 ohters \end{cases}$

(9e)

$R_{l c, i} = \{\begin{cases} - 0.5 if change lane \\ 0 others \end{cases}$

(9f)

The local reward is composed of 5 individual assessments.

R_{c, i}

is the collision assessment, which gives a higher penalty to the collision behavior of vehicles.

R_{s, i}

is the maintenance of safe distance assessment;

R_{v, i}

is the vehicle fast passing assessment, which encourages vehicles to pass through the merging zone at a faster speed and avoid speeding or too low a speed;

R_{a c c, i}

is the smoothness assessment, which penalizes rapid acceleration and deceleration in order to ensure the smoothness of vehicles;

R_{l c, i}

is the lane changing assessment, which gives appropriate penalties to each lane change since frequent lane changing increases the confusion in the merging zone; and E is the lane changing assessment, which gives appropriate penalties to each lane change. E is lane changing assessment; frequent lane changing will increase the confusion in the merging zone, so each lane changing is given the appropriate penalty.

R_{g l o b a l} = R_{a v e_v} + R_{d e n s e}

(10a)

R_{a v e_v} = \frac{v_{mean}}{v_{\max}}

(10b)

R_{d e n s} = \{\begin{cases} - α_{d e n s} \cdot (ρ - ρ_{t h r e s h o l d}) if ρ > ρ_{t h r e s h o l d} \\ 0 others \end{cases}

(10c)

In the global rewards, in order to improve the speed of the overall traffic flow, to prevent individual vehicles that are too slow from affecting the efficiency of the whole traffic flow, we use average speed assessment

R_{a v e_v}

; when this value is larger, the efficiency of the traffic is higher.

R_{d e n s e}

is the assessment of the traffic density; if the density of vehicles in the local area is too high, a group of negative incentives is given to promote the vehicles to maintain a reasonable distance between each other and avoid congestion. The architecture of our MARL framework is shown in Figure 3.

Figure 3. The architecture of our MARL framework: the control module is given to the existing IDM and MOBIL models, and the focus of the research is the multi-vehicle cooperative decision-making module based on MADDPG.

3.2. Balance Between Individual and Group Benefits

In highway merging scenarios, road resources are limited and each vehicle has its own goals. For example, ramp vehicles need to merge into the main road as quickly and safely as possible, while main road vehicles want to maintain their speed without being interrupted. To achieve their respective goals, the behaviors of vehicles are inevitably competitive. At the same time, in order to ensure the safety and smoothness of traffic, vehicles must cooperate to a certain extent [].

In the reinforcement learning of self-driving cars, each car only considers its surrounding environment. This method can encourage the car to pursue its own optimal strategy and improve its local performance. However, since each agent only focuses on its own short-term interests, the local reward mechanism may lead to “selfish” behavior, which in turn leads to conflicts or incoordination. In contrast, the global reward mechanism can guide the agent to make decisions from the global perspective of the system and avoid conflicts between individuals. However, the impact of a single agent’s behavior on the global reward is relatively indirect and weak, resulting in the sparsity of the reward signal, and it is difficult for the agent to accurately evaluate the contribution of its behavior to the overall system.

In order to solve the shortcomings of local and global reward mechanisms, a hybrid reward mechanism was introduced, which combines the advantages of local and global rewards. Hybrid rewards take into account the behavioral performance of individual agents and reflect the goals of the overall system. For example, in the merging area, hybrid rewards not only focus on factors such as the speed and distance of individual vehicles, but also encourage them to optimize traffic flow through coordinated actions. By reasonably adjusting the weights of local and global rewards, hybrid rewards can balance the relationship between individual optimization and system optimization and encourage agents to take into account the overall efficiency and safety of the system while safeguarding their own interests. Therefore, the hybrid reward set for the agent i in this scenario is as shown in Equation (8).

By adjusting the parameters

α

and

β

, the priority of individual interests and group interests can be dynamically adjusted. For example, when

α

is larger, more attention is paid to the local performance of the individual, while when

β

is larger, the agent tends to pay more attention to the overall performance of the system.

In order to balance individual and group interests, each CAV has its own policy network

π_{θ_{i}} (o_{i})

and value assessment network

Q_{i} (o_{i}, a_{1}, a_{2}, \dots, a_{N})

to select actions

a_{i}^{k}

based on local observations

o_{i}

. The core goal of the policy network is to learn a strategy for each CAV that maximizes cumulative rewards in a complex environment:

\begin{array}{l} J (π_{θ_{i}}) & = 𝔼_{o_{i} ~ 𝒪, a_{i} ~ π_{θ_{i}}} (Q_{i} (o_{i}, a_{1}, a_{2}, \dots, a_{N})) \\ = \sum_{k = 1}^{|A|} π_{θ_{i}} (a_{k} | o_{i}) Q_{i} (o_{i}, a_{1}, a_{2}, \dots, a_{N}) \end{array}

(11)

MADDPG uses the policy gradient method to optimize the parameters of the policy network. Since the Q network provides the policy network with the value information of the action, the update of the policy network is achieved by maximizing the Q value. In order to optimize the policy network, the policy gradient method is used to calculate

θ_{i}

the gradient relative to the parameters. The gradient formula of the policy network is as follows:

\nabla_{θ_{i}} J (θ_{i}) = 𝔼_{o_{i} ~ 𝒪} [\sum_{k = 1}^{|A|} \nabla_{θ_{i}} \log π_{θ_{i}} (a_{k} | o_{i}) Q_{i} (o_{i}, a_{1}, a_{2}, \dots, a_{N})]

(12)

Q_{i} (o_{i}, a_{1}, a_{2}, \dots, a_{N}) = 𝔼 [\sum_{t = 0}^{\infty} γ^{t} r_{i}^{t} | o_{i}^{0} = o_{i}, a_{1}^{0} = a_{1}, a_{2}^{0} = a_{2}, \dots, a_{N}^{0} = a_{N}]

(13)

Among them,

\nabla_{θ_{i}} π_{θ_{i}} (a_{k} | o_{i})

indicates how to change the parameters of the policy network

θ_{i}

to adjust the selection probability of each action, and the dynamic value function

Q_{i} (o_{i}, a_{1}, a_{2}, \dots, a_{N})

is used to guide the gradient update of the policy network, which indicates the contribution of each action selection to the future cumulative return under the current policy.

Although the actor network makes decisions based on local information, it learns how to adjust its strategy from the global feedback of the critic network. Its task is to estimate the Q value of each state–action pair, and its main goal is to minimize the prediction error of the Q value. Based on the Bellman equation, the critic network gradually learns the correct Q value of the state–action pair by minimizing the error between the true Q value and the predicted Q value, that is,

L (ϕ) = 𝔼_{(s, a, r, s^{'})} [{(Q_{i} (o_{i}, a_{1}, a_{2}, \dots, a_{N;} θ_{i}^{Q}) - y_{i})}^{2}]

(14)

y_{i} = r_{i} + γ \cdot Q_{i}^{target} (o_{'}^{i}, a_{'}^{1}, a_{'}^{2}, \dots, a_{'}^{N})

(15)

This Q value reflects the impact of each agent’s actions on the overall system in the global state, and the strategy update of a single agent is adjusted based on this value.

3.3. Accelerating the Training Process

In the merging area of the highway, the behavior of vehicles shows a complex relationship of competition and cooperation. Ramp vehicles want to merge into the main road quickly and safely, while main road vehicles want to maintain their own speed and driving trajectory. This conflict leads to competition for resources, especially when the traffic volume is large. At the same time, in order to ensure traffic safety, main road vehicles and ramp vehicles must also cooperate to a certain extent. For example, main road vehicles should slow down or change lanes appropriately to provide merging space for ramp vehicles. Vehicles also need to coordinate speeds and maintain a reasonable distance to jointly maintain the stability of traffic flow and avoid conflicts. In order to balance individual and global interests in competition and cooperation, each vehicle equips with an independent actor and critic network in traditional MADDPG, so that it can learn the optimal strategy based on its own observations. However, as the number of agents increases, training multiple independent actor–critic networks will greatly increase the search space, resulting in an increase in computational cost and training complexity.

In general, vehicles from the same source usually face similar environments and tasks, so we can share their network parameters here to reduce redundant calculations. Specifically, let multiple vehicles share the first few layers of the actor–critic network to process low-level feature extraction of the environment, such as the vehicle’s speed, position, and distance to surrounding vehicles. At the same time, the high-level decision-making part of the vehicle remains independent so that it can make personalized decisions based on its specific local environment. The structure of the neural network is shown in Figure 4. For the actor network, the input is the information of five vehicles within the surrounding observation range. The first layer of the network consists of 256 neurons, which are shared by vehicles from the same source. The second layer consists of 128 neurons, corresponding to the high-level decision of each vehicle. The activation function uses a softmax function to output the probability distribution of the vehicle’s 5 actions.

π_{θ_{i}} (a_{k} | o_{i}) = \frac{\exp (f_{θ_{i}} {(o_{i})}_{k})}{\sum_{j} \exp (f_{θ_{i}} {(o_{i})}_{j})}

(16)

Figure 4. The structure of the proposed network: the first layer, for shallow feature extraction, is the parameter sharing-part. (a) Network structure of the actor part with the output as a collection of actions of multi-vehicles. (b) Network structure of the critic part with the output as a collection of Q values.

The critic network inputs the global observation of the vehicle and the action information of the agent. The fully connected layer’s neuron structure is the same as the actor network, except that the output layer finally outputs a value, namely the state–action value Q. The architecture of the proposed network is shown in Figure 4.

4. Experiments and Results

This chapter introduces the design and results of the experiment. We set up three groups of experiments based on the road traffic density, and the number of CAVs is as follows:

(1): Low density: 6–10 CAVs;
(2): Medium density: 9–13 CAVs;
(3): High density: 12–16 CAVs.

Vehicles are randomly generated in the front section of the merge area, with two-thirds of the CAVs generated on the main line and the rest on the ramp. The initial speed of the main line is randomly generated between 25 and 27 m/s, the initial speed of the ramp is set between 12 and 15 m/s, and the vehicle decision frequency is 10 Hz. The average reward is calculated every 200 training cycles, and other network parameters are shown in Table 1. The experiments were conducted on a Windows 10 platform with a NAIDIA GeForce RTX 3060 Ti processor and 64 GB memory, and the programming language is Python.

Table 1. Parameters of the network.

4.1. Study of the Learning Curve

In this subsection, using the average reward as an indicator, the proposed method is compared with the baseline of other methods under the same conditions. Specifically, we design two sets of comparison tests. One set is the overall learning curve under different-density traffic flow conditions, showing the convergence efficiency and effect of the network. The second one is a comparison with different methods to show the superiority of our method in performing cooperative driving tasks in merging areas.

Figure 5 is a comparison of the results of our method and the QMIX algorithm under three levels of traffic density. It can be seen that in the case of low density, both algorithms converge to a relatively stable reward level in a relatively short time, but it can still be seen that the results of QMIX have a large volatility. In the case of medium density, the convergence speed of our algorithm is significantly faster than that of QMIX, especially in the first 20 episodes; the reward value of ours increases rapidly, while QMIX lags behind. The final reward value of ours is significantly higher than that of QMIX; the fluctuation in the algorithm results is smaller, and the performance is more stable. In the high-density traffic flow scenario, similar situations are also shown in the above two scenarios. In summary, the algorithm we proposed has demonstrated strong adaptability and strategy learning capabilities under different traffic flows.

Figure 5. Comparison between the results of our proposed method and QMIX algorithm under three different traffic flows: (a) under low density; (b) under medium density; (c) under high density.

As the traffic density increases, the convergence speed and stability of the two algorithms in low-density scenarios are the most outstanding. In medium-density and high-density scenarios, although they still perform well, the convergence speed is relatively slow and the fluctuation increases slightly. This is mainly because as the complexity of the environment increases, the difficulty of the task increases, and the intelligent agent needs more exploration and learning, so the learning process of the intelligent agent produces more fluctuations.

4.2. Traffic Efficiency

In this subsection, we use the average speed of vehicles in the merging area as an indicator. When the traffic density is not high, the road resources are relatively abundant and the conflict between vehicles is not obvious, meaning that the average speeds of vehicles under different methods will not be obviously different. Therefore, we only compare the difference in average vehicle speed in high-density traffic environments, as shown in Figure 6.

Figure 6. Average speed during training of our proposed method and QMIX algorithm under high traffic flow.

It can be seen that when there are more vehicles on the road, compared to the QMIX method, our proposed method improves the average speed of vehicles from 22.5 m/s to 23.5 m/s. In addition, the convergence of our algorithm is significantly faster than that of QMIX, and the range of fluctuation is smaller, which is more effective to improve the efficiency of traffic.

5. Conclusions and Discussion

This article uses the improved MARL algorithm to verify the problem of multi-vehicle cooperative driving in highway merging scenarios. By introducing mixed rewards, individual interests and group benefits are balanced, allowing smart cars to automatically learn appropriate driving strategies in competitive and mixed environments. On the other hand, global data sharing is achieved by using centralized training; the actions of CAVs are evaluated from an overall perspective, decentralized execution is used to ensure the independence of decision-making, and mechanisms such as grouping and partial parameter sharing are used to accelerate model training. This was verified in traffic flow environments of various densities. The experimental results show that, compared with the state-of-the-art method, the convergence speed and effectiveness of the proposed method are significantly improved and the fluctuation is smoother. At the same time, the traffic efficiency is improved effectively under high-density traffic. However, we found that the initial free exploration in RL will reduce the efficiency of learning and increase the consumption of computing resources. Advanced optimization algorithms, such as the polyploid algorithm [], fast firework algorithms [], etc., do not require a precise understanding of the dynamic model of the environment. This makes them perform well in uncertain environments and with a lack of environmental knowledge, significantly speeding up searches. Therefore, the next step will be to further combine RL with advanced optimization algorithms to ensure real-time decision-making.

Author Contributions

Methodology, B.L.; project administration, B.L. and Z.X.; supervision, B.L., Z.L. and Y.L.; writing—original draft, Q.G.; writing—review and editing, Z.X.; software, Q.G. and Z.X. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Basic Scientific Research Business Expenses Special Funds from National Treasury, grant number 0123KY03011100, and in part by the National Key Research and Development Program of China, grant number 2022YFB43004.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflict of interest.

References

Xu, H.; Feng, S.; Zhang, Y.; Li, L. A Grouping-Based Cooperative Driving Strategy for CAVs Merging Problems. IEEE Trans. Veh. Technol. 2019, 68, 6125–6136. [Google Scholar] [CrossRef]
Zhang, Y.; Xu, Q.; Wang, J.; Wu, K.; Zheng, Z.; Lu, K. A Learning-Based Discretionary Lane-Change Decision-Making Model with Driving Style Awareness. IEEE Trans. Intell. Transp. Syst. 2023, 24, 68–78. [Google Scholar] [CrossRef]
Zhao, C.; Li, Z.; Li, L.; Wu, X.; Wang, F. A negotiation-based right-of-way assignment strategy to ensure traffic safety and efficiency in lane changes. IET Intell. Transp. Syst. 2021, 15, 1345–1358. [Google Scholar] [CrossRef]
Li, L.; Gan, J.; Zhou, K.; Qu, X.; Ran, B. A novel lane-changing model of connected and automated vehicles: Using the safety potential field theory. Phys. Stat. Mech. Its Appl. 2020, 559, 125039. [Google Scholar] [CrossRef]
Pei, H.; Feng, S.; Zhang, Y.; Yao, D. A Cooperative Driving Strategy for Merging at On-Ramps Based on Dynamic Programming. IEEE Trans. Veh. Technol. 2019, 68, 11646–11656. [Google Scholar] [CrossRef]
Li, C.; Huang, D.; Wang, T.; Qin, J. Lane-changing decision rule with the difference of traffic flow’s variation in multi-lane highway for connected and autonomous vehicles. Transp. Saf. Environ. 2023, 5, tdac062. [Google Scholar] [CrossRef]
Yan, Y.; Peng, L.; Shen, T.; Wang, J.; Pi, D.; Cao, D.; Yin, G. A Multi-Vehicle Game-Theoretic Framework for Decision Making and Planning of Autonomous Vehicles in Mixed Traffic. IEEE Trans. Intell. Veh. 2023, 8, 4572–4587. [Google Scholar] [CrossRef]
Hang, P.; Lv, C.; Huang, C.; Xing, Y.; Hu, Z. Cooperative Decision Making of Connected Automated Vehicles at Multi-Lane Merging Zone: A Coalitional Game Approach. IEEE Trans. Intell. Transp. Syst. 2022, 23, 3829–3841. [Google Scholar] [CrossRef]
Fu, M.; Li, S.; Guo, M.; Yang, Z.; Sun, Y.; Qiu, C.; Wang, X.; Li, X. Cooperative decision-making of multiple autonomous vehicles in a connected mixed traffic environment: A coalition game-based model. Transp. Res. Part C Emerg. Technol. 2023, 157, 104415. [Google Scholar] [CrossRef]
Hang, P.; Lv, C.; Xing, Y.; Huang, C.; Hu, Z. Human-Like Decision Making for Autonomous Driving: A Noncooperative Game Theoretic Approach. IEEE Trans. Intell. Transp. Syst. 2021, 22, 2076–2087. [Google Scholar] [CrossRef]
Zhou, Y.; Zhong, X.; Chen, Q.; Ahn, S.; Jiang, J.; Jafarsalehi, G. Data-driven analysis for disturbance amplification in car-following behavior of automated vehicles. Transp. Res. Part B Methodol. 2023, 174, 102768. [Google Scholar] [CrossRef]
Lopez, V.; Lewis, F.; Liu, M.; Wan, Y.; Nageshrao, S.; Filev, D. Game-Theoretic Lane-Changing Decision Making and Payoff Learning for Autonomous Vehicles. IEEE Trans. Veh. Technol. 2022, 71, 3609–3620. [Google Scholar] [CrossRef]
Xu, H.; Zhang, Y.; Cassandras, C.G.; Li, L.; Feng, S. A bi-level cooperative driving strategy allowing lane changes. Transp. Res. Part C Emerg. Technol. 2020, 120, 102773. [Google Scholar] [CrossRef]
Meng, Y.; Li, L.; Wang, F.-Y.; Li, K.; Li, Z. Analysis of Cooperative Driving Strategies for Nonsignalized Intersections. IEEE Trans. Veh. Technol. 2018, 67, 2900–2911. [Google Scholar] [CrossRef]
Liu, C.; Lin, C.-W.; Shiraishi, S.; Tomizuka, M. Distributed Conflict Resolution for Connected Autonomous Vehicles. IEEE Trans. Intell. Veh. 2018, 3, 18–29. [Google Scholar] [CrossRef]
Fayazi, S.A.; Vahidi, A. Mixed-Integer Linear Programming for Optimal Scheduling of Autonomous Vehicle Intersection Crossing. IEEE Trans. Intell. Veh. 2018, 3, 287–299. [Google Scholar] [CrossRef]
Lu, X.; Zhao, H.; Li, C.; Gao, B.; Chen, H. A Game-Theoretic Approach on Conflict Resolution of Autonomous Vehicles at Unsignalized Intersections. IEEE Trans. Intell. Transp. Syst. 2023, 24, 12535–12548. [Google Scholar] [CrossRef]
Cao, Z.; Xu, S.; Jiao, X.; Peng, H.; Yang, D. Trustworthy safety improvement for autonomous driving using reinforcement learning. Transp. Res. Part C Emerg. Technol. 2022, 138, 103656. [Google Scholar] [CrossRef]
Pei, H.; Zhang, Y.; Zhang, Y.; Feng, S. Optimal Cooperative Driving at Signal-Free Intersections with Polynomial-Time Complexity. IEEE Trans. Intell. Transp. Syst. 2022, 23, 12908–12920. [Google Scholar] [CrossRef]
Noh, S. Decision-Making Framework for Autonomous Driving at Road Intersections: Safeguarding Against Collision, Overly Conservative Behavior, and Violation Vehicles. IEEE Trans. Ind. Electron. 2019, 66, 3275–3286. [Google Scholar] [CrossRef]
Hoel, C.-J.; Driggs-Campbell, K.; Wolff, K.; Laine, L.; Kochenderfer, M.J. Combining Planning and Deep Reinforcement Learning in Tactical Decision Making for Autonomous Driving. IEEE Trans. Intell. Veh. 2020, 5, 294–305. [Google Scholar] [CrossRef]
Ye, Y.; Zhang, X.; Sun, J. Automated vehicle’s behavior decision making using deep reinforcement learning and high-fidelity simulation environment. Transp. Res. Part C 2019, 107, 155–170. [Google Scholar] [CrossRef]
Wang, G.; Hu, J.; Li, Z.; Li, L. Harmonious Lane Changing via Deep Reinforcement Learning. IEEE Trans. Intell. Transp. Syst. 2022, 23, 4642–4650. [Google Scholar] [CrossRef]
Shi, H.; Zhou, Y.; Wu, K.; Wang, X.; Lin, Y.; Ran, B. Connected automated vehicle cooperative control with a deep reinforcement learning approach in a mixed traffic environment. Transp. Res. Part C Emerg. Technol. 2021, 133, 103421. [Google Scholar] [CrossRef]
Chen, S.; Wang, M.; Song, W.; Yang, Y.; Fu, M. Multi-Agent Reinforcement Learning-Based Decision Making for Twin-Vehicles Cooperative Driving in Stochastic Dynamic Highway Environments. IEEE Trans. Veh. Technol. 2023, 72, 12615–12627. [Google Scholar] [CrossRef]
Chen, D.; Hajidavalloo, M.R.; Li, Z.; Chen, K.; Wang, Y.; Jiang, L.; Wang, Y. Deep Multi-Agent Reinforcement Learning for Highway On-Ramp Merging in Mixed Traffic. IEEE Trans. Intell. Transp. Syst. 2023, 24, 11623–11638. [Google Scholar] [CrossRef]
Pina, R.; Silva, V.D.; Hook, J.; Kondoz, A. Residual Q-Networks for Value Function Factorizing in Multiagent Reinforcement Learning. IEEE Trans. Neural Netw. Learn. Syst. 2024, 35, 1534–1544. [Google Scholar] [CrossRef]
Lowe, R.; Wu, Y.; Tamar, A.; Harb, J.; Abbeel, O.P.; Mordatch, I. Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments. Available online: https://proceedings.neurips.cc/paper_files/paper/2017/file/68a9750337a418a86fe06c1991a1d64c-Paper.pdf (accessed on 4 November 2017).
Wu, S.; Tian, D.; Duan, X.; Zhou, J.; Zhao, D.; Cao, D. Continuous Decision-Making in Lane Changing and Overtaking Maneuvers for Unmanned Vehicles: A Risk-Aware Reinforcement Learning Approach with Task Decomposition. IEEE Trans. Intell. Veh. 2024, 9, 4657–4674. [Google Scholar] [CrossRef]
Zhang, X.; Wu, L.; Liu, H.; Wang, Y.; Li, H.; Xu, B. High-Speed Ramp Merging Behavior Decision for Autonomous Vehicles Based on Multiagent Reinforcement Learning. IEEE Internet Things J. 2023, 10, 22664–22672. [Google Scholar] [CrossRef]
Zhu, L.; Cheng, J.; Zhang, H.; Zhang, W.; Liu, Y. Multi-Robot Environmental Coverage with a Two-Stage Coordination Strategy via Deep Reinforcement Learning. IEEE Trans. Intell. Transp. Syst. 2024, 25, 5022–5033. [Google Scholar] [CrossRef]
Lewis, F.L.; Vrabie, D.; Vamvoudakis, K.G. Reinforcement Learning and Feedback Control: Using Natural Decision Methods to Design Optimal Adaptive Controllers. IEEE Control Syst. 2012, 32, 76–105. [Google Scholar] [CrossRef]
Leurent, E. An Environment for Autonomous Driving Decision-Making. GitHub Repository. GitHub. 2018. Available online: https://github.com/eleurent/highway-env (accessed on 4 November 2018).
Chang, H.; Liu, Y.; Sheng, Z. Distributed Multi-Agent Reinforcement Learning for Collaborative Path Planning and Scheduling in Blockchain-Based Cognitive Internet of Vehicles. IEEE Trans. Veh. Technol. 2024, 73, 6301–6317. [Google Scholar] [CrossRef]
Dulebenets, M.A. An Adaptive Polyploid Memetic Algorithm for scheduling trucks at a cross-docking terminal. Inf. Sci. 2021, 565, 390–421. [Google Scholar] [CrossRef]
Chen, M.; Tan, Y. SF-FWA: A Self-Adaptive Fast Fireworks Algorithm for effective large-scale optimization. Swarm Evol. Comput. 2023, 80, 101314. [Google Scholar] [CrossRef]

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Setting	Item	Value
Training Setting	Size of Replay Buffer	5000 episodes
	Batch Size	128 episodes
	Discount Factor	0.99
Network Setting	Optimizer	RMSProp
Network Setting	Learning Rate	0.0001