1. Introduction
Autonomous driving technology has broad application prospects due to its advantages such as safety, high efficiency, energy saving and environmental protection. With the further development of technology, autonomous driving is expected to gradually achieve large-scale commercial use in the future and change people’s travel methods and habits [
1,
2,
3,
4]. However, in the real complex environment, formulating a reliable decision-making strategy and exerting control for autonomous driving is still a difficult technical problem to solve.
As a key application scenario of fully autonomous driving, vehicle merging in a highway merging area refers to the process of spontaneous coordination between CAVs to complete merging without any hard constraints such as traffic control [
5,
6]. In this case, the ego vehicle needs to interact with surrounding vehicles, negotiate the right of way in the merging area, and minimize the impact on merging vehicles. However, this brings some difficulties and challenges to autonomous driving. First, the autonomous driving system needs to judge the behavior of surrounding vehicles and predict their possible next actions [
7]. This game is not just a simple physical position calculation, but also includes the “cooperation” and “competition” relationship between vehicles [
8,
9]. Autonomous driving vehicles need to seize the right opportunity to complete the lane change without invading the space of other vehicles. The behavior of other vehicles is uncertain, which increases the complexity of the game. Secondly, safety should be prioritized when merging, and sudden acceleration and deceleration and other behaviors that may cause accidents should be avoided as much as possible. In addition, an overly conservative driving style, such as frequent waiting or failure to merge smoothly, will reduce passenger comfort [
10]. Finding a balance point that can ensure safety and provide a smooth driving experience is a key issue facing autonomous driving systems. Third, autonomous vehicles often only focus on their own merging needs and ignore the smoothness of the traffic system, which may cause traffic congestion and reduce traffic efficiency [
11]. Therefore, autonomous vehicles must make balanced decisions between individual and group interests, which means that vehicles must not only find the right time to merge for themselves, but also “cooperate” to maintain smooth traffic flow [
12].
In order to solve the above problems, researchers have proposed some cooperative driving control strategies, which are mainly divided into rule-based strategies, planning-based strategies [
13] and learning-based strategies. Rule-based strategies aim to use pre-defined rules and heuristic algorithms to allocate vehicle right of way in a very short time. Meng et al. [
14] compared the two cooperative driving strategies of “temporary negotiation” and “planning” at non-signalized intersections and found that the main difference lies in how to determine the order in which vehicles pass through the intersection, and these two methods often only find local optimal solutions. Liu et al. [
15] used a local conflict graph to determine the order in which vehicles pass through the intersection and introduced a distributed conflict resolution mechanism to reduce vehicle delay time and improve the pass rate of the intersection. However, in most cases, this strategy adopts the first-in-first-out rule, which is very rigid when facing unknown situations or emergencies [
16].
In multi-vehicle collaborative driving, optimization-based strategies seek the optimal solution for the overall system through mathematical models and optimization algorithms. Lu et al. [
17] proposed a dynamically adjustable game model to resolve conflicts between vehicles. This was mainly achieved by designing the game entry mechanism and re-planning the game sequence. At the same time, it also constructed a personalized beneficial function that includes factors such as driving efficiency, safety [
18] and comfort with a personalized revenue function. Pei et al. [
19] applied the idea of dynamic programming to the field of intersection collaborative decision-making. They constructed a small-scale state space to describe the solution space of large-scale planning problems, and then gradually searched for the global optimal solution through dynamic programming. Optimization-based strategies can consider the global benefits of the entire system, but they need to be based on accurate mathematical models, and real-time global optimization is required at each time step. When a large number of vehicles are involved, the computational time and resource consumption of the optimization algorithm will increase significantly [
20].
Depending on whether the data are labeled, learning-based strategies can be divided into supervised learning and reinforcement learning. The input of supervised learning depends on the driving behavior that occurs in the actual scene, and the model can learn how to output the corresponding action under specific input. This depends largely on the quality of the labeled training data and is only applicable to deterministic scenarios. In the reinforcement learning method, each vehicle learns the strategy based on the reward or penalty signal received by interacting with the environment and constantly trying different actions [
21]. On the basis of reinforcement learning, neural networks are used to represent the state and action space, solving the problem that traditional reinforcement learning performs poorly in high-dimensional complex environments. Ye et al. [
22] designed an integrated framework based on DRL and the VISSIM simulation platform, using the DDPG algorithm to deal with the problem of continuous state and action space; explored the impact of different reward functions on training results; and proposed a regularized reward function to improve training convergence and stability. Experimental results show that compared with the traditional adaptive cruise control (ACC) model, the average vehicle speed of the DRL model has been greatly improved. Wang et al. [
23] proposed a reinforcement learning model based on a deep Q network (DQN). The vehicle learns lane-changing decisions by observing the surrounding traffic conditions and considers individual efficiency and overall traffic efficiency in the reward function to achieve harmonious lane changing of autonomous vehicles. Compared with rule-based and planning-based strategies, reinforcement learning can improve the decision-making level of autonomous vehicles to a certain extent [
24]. However, most of the above studies are designed for single-vehicle driving, and background vehicles are treated as obstacles in the decision-making process. There is no interaction process between the current vehicle and the background vehicle, which is inconsistent with the actual driving scenario.
To address this problem, multi-vehicle cooperative driving has emerged, which is based on information sharing and decision interaction between vehicles and aims to achieve collective intelligent decision-making to improve traffic safety and efficiency. Multi-agent reinforcement learning is the main method used. Chen et al. [
25] modeled the two-vehicle cooperative driving problem as a decentralized partially observable Markov decision process (Dec-POMDPs), adopted the Monotonic Value Function Factorization for Deep Multi-Agent Reinforcement Learning (QMIX) algorithm in deep multi-agent reinforcement learning, and minimized the difference between the contribution value distributions of the two vehicles to ensure that the vehicles adopt fair and cooperative strategies. Chen et al. [
26] proposed an efficient and scalable MARL framework, which uses parameter sharing and local reward mechanisms to enhance the cooperation between agents. In addition, a priority-based safety supervision mechanism was designed to significantly reduce the collision rate during training and improve training efficiency. However, when the number of vehicles increases, such methods will suffer from the curse of dimensionality. In addition, value decomposition reinforcement learning methods such as QMIX are subject to the limitations of additivity and monotonicity [
27], and are unable to decompose action value functions that do not follow these constraints, and therefore face certain difficulties when applied to practical scenarios.
Therefore, for the merging scenario of highway ramps, this paper proposes a novel decentralized MARL framework. It combines global rewards and local rewards to balance the interests of individual vehicles and the efficiency of overall traffic, and uses centralized training and decentralized execution modes, parameter sharing and other strategies to accelerate the model training process. Specifically, the main contribution of this paper is summarized as follows.
In the context of fully automated driving, the merging problem on highway on-ramps is formulated as a partially observed Markov decision problem, and a decentralized MARL framework is proposed.
In our framework, we use centralized training to share global data, decentralized execution to ensure decision independence, grouping, partial parameter sharing and other mechanisms to accelerate model training. We use a hybrid reward that combines local rewards with global rewards to ensure efficient and smooth driving of a single vehicle while ensuring safe and efficient traffic flow.
Experimental results show that our approach outperforms several state-of-the-art methods in terms of safety, traffic efficiency, and comfort.
The rest of this paper is organized as follows.
Section 2 briefly introduces the basics and features of RL and MARL.
Section 3 describes the MARL framework used in this paper in detail.
Section 4 presents the experiments, results, and discussions. We conclude the paper and discuss future work in
Section 5.
2. Materials and Methods
This section introduces the algorithms involved in our work, including the actor–critic framework, centralized training and decentralized execution (CTDE), and the Multi-Agent Deep Deterministic Policy Gradient (MADDPG) algorithm [
28].
2.1. Actor–Critic Framework
In MARL, agents learn how to take optimal actions in different states to maximize the cumulative returns by interacting with the environment. There are three main methods: policy-based, value-based, and actor–critic-based. The policy-based architectures directly optimize the policy itself, which is suitable for complex continuous control tasks and performs better in exploration. However, the strategy gradient approach is weak in training stability because it is prone to large variance, and it is difficult to capture the complex interactions between agents. The value-based architecture has high sampling efficiency and can achieve stable training in discrete action spaces, but its exploration efficiency is low in high-dimensional scenarios [
29]. Therefore, actor–critic architecture, which combines the advantages of policy gradient and value function approximation, came into being [
30]. It uses an “Actor” network to select actions and another “Critic” network to evaluate the value of these actions to improve learning efficiency and performance [
31]. This framework performs well in many complex tasks, especially when dealing with continuous action space problems in high-dimensional complex environments, such as A3C (Asynchronous Advantage Actor–Critic) and DDPG [
32].
As shown in
Figure 1, the actor is responsible for determining the action that the agent should take in a given state. It directly learns and optimizes a parameter-based
policy
, i.e., a probability distribution or deterministic action for
selecting an action in state
. The actor improves its policy by maximizing the expectation of the cumulative reward, often using a policy gradient method to update parameters.
After the actor learns the strategy
, the critic module learns the action value function
to evaluate the value of the current state. The agent selects actions and interacts with the environment based on the strategy provided by the actor
, observes the next state
and obtains rewards
. The critic obtains the action value function
as
The task of the critic is to evaluate the quality of the current strategy, which is achieved by minimizing the estimation error of the value function, as shown in Equation (2).
where
denotes the average value of performing all possible actions and obtaining the desired reward according to the strategy
in state
.
is a discount factor, which is used to weigh the importance of the future return.
This difference indicates the advantage of the selected action over the average behavior of the strategy in the current state. If is positive, it means that the action is better than the average level; if it is negative, it means it is below the average level. Therefore, the advantage function is often used as the basis for adjusting the policy parameters in the actor and the parameters of the critic median function.
2.2. Centralized Training and Decentralized Execution (CTDE)
CTDE is an important framework in MARL, which is used to solve the problems of collaboration and competition among multiple agents. In multi-vehicle collaborative driving, each CAV can only obtain environmental information near it but cannot observe the global state of the entire road. Therefore, CTDE is suitable for modeling this centralized training and decentralized execution environment.
During the training process, all CAVs can access global information, such as the complete state of the environment, the actions of other agents, rewards, etc., to promote cooperation or confrontation between CAVs and better optimize strategies. In the execution phase, each CAV can only make decisions based on its own local observations to ensure decentralized organization in practical applications.
In order to make CTDE suitable for multi-agent environments, it is crucial to properly handle the relationships between individuals and groups. The latest multi-agent reinforcement learning work based on actor–critic, such as MADDPG and MAPPO, uses global information to train the critic network during the training phase to evaluate individual performance, as shown in Formula (3):
2.3. Multi-Agent Deep Deterministic Policy Gradient (MADDPG) Algorithm
MADDPG is an extension of DDPG in MARL; it integrates the ideas of CTDE. Each agent updates its own strategy (actor) and value function (critic) through the actor–critic method. During the training process, since the actions of multiple agents will affect each other, the critic module obtains global information to more accurately evaluate the behavior of each agent and estimate the value function. In the execution phase, the actor network can only make decisions based on the local observations of the agent, ensuring the independence of the decision.
When there are agents in the environment, each agent has a local observation at time and selects actions through strategies . The action combination of all agents is and the state combination is . During the training process, the critic network is used to estimate the Q-value function of each agent, that is, the expected cumulative reward that the agent can obtain in the future after performing an action in a given state .
In order to update the parameters of the critic network, MADDPG uses the time difference error to measure the difference between the current
Q value and the true return, and updates the parameters of the critic network with the goal of minimizing the time difference.
The goal of the actor network is to maximize the expected cumulative reward. The policy is generally updated through the policy gradient method. The goal is to select actions that maximize the
Q value estimated by the critic. The update goal of the actor network is
Among them, represents the gradient of Q value to action; , represents how action affects the long-term reward of the agent; and represents how the policy parameters affect the action selected by the agent through the policy network . MADDPG interweaves the learning of approximate optimal Q function and optimal action, which not only solves the problem of difficulty in representing continuous state space, but also solves the problem of difficulty in evaluating continuous action space.
3. Methodology
The traffic scenario is a merging area on a highway, which consists of four parts: the front section of the merging area, the acceleration area, the merging area, and the rear section of the merging area. The main lane contains two lanes and allows lane changes. The function of the acceleration lane is to provide vehicles with a speed close to that of the vehicles in the main lane when entering the highway from the ramp. The merging lane is the area where the acceleration lane and the main lane gradually merge, and vehicles complete the merge and enter the main lane in this area. The lengths of the four areas are 120 m, 80 m, 80 m, and 120 m, respectively, with a lane width of 3.75 m and a size of 5 m × 2 m for each vehicle, as shown in
Figure 2. The environment is built based on the gymnasium 0.26.3 environment [
33] developed by OpenAI and uses the Python language.
In our proposed multi-vehicle cooperative decision-making, each vehicle exchanges information with others within 100 m via V2V communication and makes a decision. Meanwhile, the central platform exchanges information with the vehicle via V2I communication to collect the traffic conditions in the merging area and evaluates the decision-making behavior of these vehicles.
3.1. Problem Description
This paper focuses on the high-level decision-making behaviors between mainline vehicles and ramp vehicles. As mentioned above, we model the multi-vehicle collaborative decision-making problem as a multi-agent Markov decision process (Multi-Agent MDP). It is defined as , where is a global state space set, describing the state information of the entire environment, is a local observation set, and each agent can only focus on a part of the environment. is an action space set, with representing all possible actions that the agent can take; is a joint state transition probability, representing the probability of all agents transferring to the next state after taking an action in the state ; is a reward function set, with representing the reward obtained by the agent after performing an action in the global state; is a discount factor, used to weigh the relative importance of current rewards and future rewards. is the number of autonomous driving vehicles, and each agent has its own actor and critic network.
- (1)
State space: The state space of an intelligent connected vehicle is a matrix , where is the number of vehicles that can be observed within the observation range of the current vehicle (ego vehicle), and is the observed vehicle characteristics, mainly including the following:
- ➀
Present—Indicates whether there is an observed vehicle near the current vehicle, represented by a 0–1 variable.
- ➁
X—The longitudinal position of the observed vehicle from the current vehicle.
- ➂
Y—The lateral position of the observed vehicle from the current vehicle.
- ➃
Vx—The longitudinal velocity of the observed vehicle relative to the current vehicle.
- ➄
Vy—The lateral velocity of the observed vehicle relative to the current vehicle.
- ➅
Heading—The heading angle of the observed vehicle.
In this paper, since the ego vehicle can only observe the information of vehicles near it, the vehicles in front and behind it on the left lane, and the vehicles in front and behind it on the right lane, a total of 7 vehicles, and they need to be within 100 m of the current vehicle, the decision-making of a single autonomous vehicle can be modeled as a partially observable Markov process.
- (2)
Action space: Action
represents the driving decision of the intelligent connected vehicle
at a time step
, including left turn, right turn, cruising, acceleration and deceleration. All actions of the vehicle at each time step constitute its action set
, and the action set of all vehicles constitutes the overall action space of the system
. After using the reinforcement learning algorithm to learn the high-level decision-making behavior of the vehicle, the low-level controller will generate corresponding steering and throttle control signals to manipulate the autonomous vehicle. In this paper, the longitudinal acceleration/deceleration behavior of the vehicle is given by the intelligent driver model (IDM), and the lane changing behavior of the vehicle is given by the minimization of total braking caused by lane change (MOBIL) model [
25].
- (3)
Reward function: The reward function defines how the vehicle obtains feedback after performing certain actions, which directly affects the behavior and learning strategy of the agent. In this paper, our goal is to make each CAV pass through the merging area safely and quickly, while also hoping to keep the traffic flow smooth and avoid traffic congestion; thus, the reward function of CAV
at time step
is defined as follows:
where
and
represent individual and group rewards, respectively.
The local reward is composed of 5 individual assessments.
is the collision assessment, which gives a higher penalty to the collision behavior of vehicles.
is the maintenance of safe distance assessment;
is the vehicle fast passing assessment, which encourages vehicles to pass through the merging zone at a faster speed and avoid speeding or too low a speed;
is the smoothness assessment, which penalizes rapid acceleration and deceleration in order to ensure the smoothness of vehicles;
is the lane changing assessment, which gives appropriate penalties to each lane change since frequent lane changing increases the confusion in the merging zone; and E is the lane changing assessment, which gives appropriate penalties to each lane change. E is lane changing assessment; frequent lane changing will increase the confusion in the merging zone, so each lane changing is given the appropriate penalty.
In the global rewards, in order to improve the speed of the overall traffic flow, to prevent individual vehicles that are too slow from affecting the efficiency of the whole traffic flow, we use average speed assessment
; when this value is larger, the efficiency of the traffic is higher.
is the assessment of the traffic density; if the density of vehicles in the local area is too high, a group of negative incentives is given to promote the vehicles to maintain a reasonable distance between each other and avoid congestion. The architecture of our MARL framework is shown in
Figure 3.
3.2. Balance Between Individual and Group Benefits
In highway merging scenarios, road resources are limited and each vehicle has its own goals. For example, ramp vehicles need to merge into the main road as quickly and safely as possible, while main road vehicles want to maintain their speed without being interrupted. To achieve their respective goals, the behaviors of vehicles are inevitably competitive. At the same time, in order to ensure the safety and smoothness of traffic, vehicles must cooperate to a certain extent [
34].
In the reinforcement learning of self-driving cars, each car only considers its surrounding environment. This method can encourage the car to pursue its own optimal strategy and improve its local performance. However, since each agent only focuses on its own short-term interests, the local reward mechanism may lead to “selfish” behavior, which in turn leads to conflicts or incoordination. In contrast, the global reward mechanism can guide the agent to make decisions from the global perspective of the system and avoid conflicts between individuals. However, the impact of a single agent’s behavior on the global reward is relatively indirect and weak, resulting in the sparsity of the reward signal, and it is difficult for the agent to accurately evaluate the contribution of its behavior to the overall system.
In order to solve the shortcomings of local and global reward mechanisms, a hybrid reward mechanism was introduced, which combines the advantages of local and global rewards. Hybrid rewards take into account the behavioral performance of individual agents and reflect the goals of the overall system. For example, in the merging area, hybrid rewards not only focus on factors such as the speed and distance of individual vehicles, but also encourage them to optimize traffic flow through coordinated actions. By reasonably adjusting the weights of local and global rewards, hybrid rewards can balance the relationship between individual optimization and system optimization and encourage agents to take into account the overall efficiency and safety of the system while safeguarding their own interests. Therefore, the hybrid reward set for the agent i in this scenario is as shown in Equation (8).
By adjusting the parameters and , the priority of individual interests and group interests can be dynamically adjusted. For example, when is larger, more attention is paid to the local performance of the individual, while when is larger, the agent tends to pay more attention to the overall performance of the system.
In order to balance individual and group interests, each CAV has its own policy network
and value assessment network
to select actions
based on local observations
. The core goal of the policy network is to learn a strategy for each CAV that maximizes cumulative rewards in a complex environment:
MADDPG uses the policy gradient method to optimize the parameters of the policy network. Since the
Q network provides the policy network with the value information of the action, the update of the policy network is achieved by maximizing the
Q value. In order to optimize the policy network, the policy gradient method is used to calculate
the gradient relative to the parameters. The gradient formula of the policy network is as follows:
Among them, indicates how to change the parameters of the policy network to adjust the selection probability of each action, and the dynamic value function is used to guide the gradient update of the policy network, which indicates the contribution of each action selection to the future cumulative return under the current policy.
Although the actor network makes decisions based on local information, it learns how to adjust its strategy from the global feedback of the critic network. Its task is to estimate the
Q value of each state–action pair, and its main goal is to minimize the prediction error of the
Q value. Based on the Bellman equation, the critic network gradually learns the correct
Q value of the state–action pair by minimizing the error between the true
Q value and the predicted
Q value, that is,
This Q value reflects the impact of each agent’s actions on the overall system in the global state, and the strategy update of a single agent is adjusted based on this value.
3.3. Accelerating the Training Process
In the merging area of the highway, the behavior of vehicles shows a complex relationship of competition and cooperation. Ramp vehicles want to merge into the main road quickly and safely, while main road vehicles want to maintain their own speed and driving trajectory. This conflict leads to competition for resources, especially when the traffic volume is large. At the same time, in order to ensure traffic safety, main road vehicles and ramp vehicles must also cooperate to a certain extent. For example, main road vehicles should slow down or change lanes appropriately to provide merging space for ramp vehicles. Vehicles also need to coordinate speeds and maintain a reasonable distance to jointly maintain the stability of traffic flow and avoid conflicts. In order to balance individual and global interests in competition and cooperation, each vehicle equips with an independent actor and critic network in traditional MADDPG, so that it can learn the optimal strategy based on its own observations. However, as the number of agents increases, training multiple independent actor–critic networks will greatly increase the search space, resulting in an increase in computational cost and training complexity.
In general, vehicles from the same source usually face similar environments and tasks, so we can share their network parameters here to reduce redundant calculations. Specifically, let multiple vehicles share the first few layers of the actor–critic network to process low-level feature extraction of the environment, such as the vehicle’s speed, position, and distance to surrounding vehicles. At the same time, the high-level decision-making part of the vehicle remains independent so that it can make personalized decisions based on its specific local environment. The structure of the neural network is shown in
Figure 4. For the actor network, the input is the information of five vehicles within the surrounding observation range. The first layer of the network consists of 256 neurons, which are shared by vehicles from the same source. The second layer consists of 128 neurons, corresponding to the high-level decision of each vehicle. The activation function uses a softmax function to output the probability distribution of the vehicle’s 5 actions.
The critic network inputs the global observation of the vehicle and the action information of the agent. The fully connected layer’s neuron structure is the same as the actor network, except that the output layer finally outputs a value, namely the state–action value
Q. The architecture of the proposed network is shown in
Figure 4.
4. Experiments and Results
This chapter introduces the design and results of the experiment. We set up three groups of experiments based on the road traffic density, and the number of CAVs is as follows:
- (1)
Low density: 6–10 CAVs;
- (2)
Medium density: 9–13 CAVs;
- (3)
High density: 12–16 CAVs.
Vehicles are randomly generated in the front section of the merge area, with two-thirds of the CAVs generated on the main line and the rest on the ramp. The initial speed of the main line is randomly generated between 25 and 27 m/s, the initial speed of the ramp is set between 12 and 15 m/s, and the vehicle decision frequency is 10 Hz. The average reward is calculated every 200 training cycles, and other network parameters are shown in
Table 1. The experiments were conducted on a Windows 10 platform with a NAIDIA GeForce RTX 3060 Ti processor and 64 GB memory, and the programming language is Python.
4.1. Study of the Learning Curve
In this subsection, using the average reward as an indicator, the proposed method is compared with the baseline of other methods under the same conditions. Specifically, we design two sets of comparison tests. One set is the overall learning curve under different-density traffic flow conditions, showing the convergence efficiency and effect of the network. The second one is a comparison with different methods to show the superiority of our method in performing cooperative driving tasks in merging areas.
Figure 5 is a comparison of the results of our method and the QMIX algorithm under three levels of traffic density. It can be seen that in the case of low density, both algorithms converge to a relatively stable reward level in a relatively short time, but it can still be seen that the results of QMIX have a large volatility. In the case of medium density, the convergence speed of our algorithm is significantly faster than that of QMIX, especially in the first 20 episodes; the reward value of ours increases rapidly, while QMIX lags behind. The final reward value of ours is significantly higher than that of QMIX; the fluctuation in the algorithm results is smaller, and the performance is more stable. In the high-density traffic flow scenario, similar situations are also shown in the above two scenarios. In summary, the algorithm we proposed has demonstrated strong adaptability and strategy learning capabilities under different traffic flows.
As the traffic density increases, the convergence speed and stability of the two algorithms in low-density scenarios are the most outstanding. In medium-density and high-density scenarios, although they still perform well, the convergence speed is relatively slow and the fluctuation increases slightly. This is mainly because as the complexity of the environment increases, the difficulty of the task increases, and the intelligent agent needs more exploration and learning, so the learning process of the intelligent agent produces more fluctuations.
4.2. Traffic Efficiency
In this subsection, we use the average speed of vehicles in the merging area as an indicator. When the traffic density is not high, the road resources are relatively abundant and the conflict between vehicles is not obvious, meaning that the average speeds of vehicles under different methods will not be obviously different. Therefore, we only compare the difference in average vehicle speed in high-density traffic environments, as shown in
Figure 6.
It can be seen that when there are more vehicles on the road, compared to the QMIX method, our proposed method improves the average speed of vehicles from 22.5 m/s to 23.5 m/s. In addition, the convergence of our algorithm is significantly faster than that of QMIX, and the range of fluctuation is smaller, which is more effective to improve the efficiency of traffic.