Multi-Agent Decision-Making Modes in Uncertain Interactive Traffic Scenarios via Graph Convolution-Based Deep Reinforcement Learning

As one of the main elements of reinforcement learning, the design of the reward function is often not given enough attention when reinforcement learning is used in concrete applications, which leads to unsatisfactory performances. In this study, a reward function matrix is proposed for training various decision-making modes with emphasis on decision-making styles and further emphasis on incentives and punishments. Additionally, we model a traffic scene via graph model to better represent the interaction between vehicles, and adopt the graph convolutional network (GCN) to extract the features of the graph structure to help the connected autonomous vehicles perform decision-making directly. Furthermore, we combine GCN with deep Q-learning and multi-step double deep Q-learning to train four decision-making modes, which are named the graph convolutional deep Q-network (GQN) and the multi-step double graph convolutional deep Q-network (MDGQN). In the simulation, the superiority of the reward function matrix is proved by comparing it with the baseline, and evaluation metrics are proposed to verify the performance differences among decision-making modes. Results show that the trained decision-making modes can satisfy various driving requirements, including task completion rate, safety requirements, comfort level, and completion efficiency, by adjusting the weight values in the reward function matrix. Finally, the decision-making modes trained by MDGQN had better performance in an uncertain highway exit scene than those trained by GQN.


Introduction
Artificial intelligence is in its golden development age due to the exponential increase in data production and the continuous improvements in computing power [1]. Autonomous driving is one of the main uses. A comprehensive autonomous driving system integrates sensing, decision-making, and motion-controlling modules [2][3][4]. As the "brains" of connected autonomous vehicles (CAVs) [5], the decision-making module formulates the most reasonable control strategy according to the state feature matrix transmitted by the sensing module, the vehicle state, and the cloud transmission information [6]. Moreover, it sends the determined control strategy to the motion-controlling module, including high-level behavior and low-level control requirements [7,8]. It is crucial to complete autonomous driving tasks safely and efficiently by making reasonable decisions based on other modules [9].
In an uncertain interactive traffic scenario, the driving environment has rigorous dynamic characteristics and high uncertainty, and the influences of driving behaviors of different traffic participants will be transmitted continuously [10]. On the level of transportation overall, all traffic participants need to cooperate efficiently [11]. At the traffic participant level, individuals need to judge the risk factors and make appropriate decisions sensitively based on dynamic scene changes [12]. In [13], a Gaussian mixture model and important weighted least squares probability classifier were combined and used for scene modeling. That model can identify the braking strength levels of new drivers under the condition of insufficient driving data.
The key to participating in traffic scenarios for CAVs is that each CAV needs to generate appropriate and cooperative behavior to match human vehicles and other CAVs. Therefore, CAVs urgently demand efficient and accurate multi-agent decision-making technology to effectively handle the interactions between different traffic participants.
The current multi-agent decision-making technologies mainly focus on deep reinforcement learning (DRL) due to the excellent performance of DRL in high-dimensional dynamic state space [14]. The keys of DRL in uncertain interactive traffic scenarios can be summarized as follows: (1) Efficient modeling of interactive traffic scenes and accurate representation of state features. (2) Generating reasonable and cooperative decision-making behaviors based on uncertain scene changes and individual task requirements. The design of the reward function is an essential part of the DRL application. Concretizing and numericizing task objectives realizes the communication between objectives and algorithms. Therefore, the design of the reward function determines whether the agent can generate reasonable and cooperative decision-making behaviors [15]. In addition, the accurate modeling of the interactive traffic environment and representation of state characteristics are the requirements for agents to generate expected behaviors.
In studies of traffic scenarios using the DRL method, researchers found that in uncertain interactive traffic scenarios, sparse reward problems lead to agents having a lack of effective information guidance [16], making the algorithm difficult to converge. In order to solve the sparse reward problem, researchers divide the original goal into different sub-goals and give reasonable rewards or punishments. In [17], the DDPG was adopted to settle the autonomous braking problem. The reward function was split into three parts to solve the problems: braking too early, braking too late, and braking too quickly. In [18], the reward function was divided into efficiency and safety rewards to train the comprehensive safety and efficiency decision model. In addition, considering the changes in driving task requirements and scene complexity, some studies have given different weights to the sub-reward functions based on the decomposition of the reward function, so as to train different decision-making modes. In [19], the reward function was divided into sub-reward functions based on security, efficiency, and comfort. The system can realize different control targets by adjusting the weight values in the reward function.
In the decision-making research involving uncertain interactive traffic scenarios, the DRL method only takes the individual characteristics of each vehicle as the input. It ignores the interactive influence of transitivity between vehicles. This will result in CAVs not generating reasonable and cooperative behavior, which may reduce total traffic efficiency and the occurrence of traffic accidents. Graph representation can accurately describe the interactions between agents, representing the relationship between vehicles in uncertain interactive traffic scenarios. Therefore, some researchers focused on the graph reinforcement learning (GRL) method and modeled the interaction with graph representation [20]. In [21], a hierarchical GNN framework was proposed and combined with LSTM to model the interactions of heterogeneous traffic participants (vehicles, pedestrians, and riders) to predict their trajectories. The GRL method combines GNN and DRL: the features of interactive scenes are processed by GNN, and cooperative behaviors are generated by the DRL framework [22]. In [23], the traffic network was modeled by dynamic traffic flow probability graphs, and a graph convolutional policy network was used in reinforcement learning.
This paper proposes an innovative, dynamic reward function matrix, and various decision-making modes can be trained by adjusting the weight coefficient of the reward function matrix. Additionally, the weight coefficient is further set as a function of reward, forming an internal dynamic reward function. In the traffic environment adopted in this paper, the randomness and interactions between HVs and CAVs are strengthened by using some human vehicles making uncertain lane changes. Two GRL algorithms are used in this paper, GQN and MDGQN. Finally, we report a simulation based on the SUMO platform and a comparative analysis from various perspectives, such as reward functions, algorithms, and decision-making modes. The schematic diagram of the designed framework is shown in Figure 1.

CAV Mask Matrix
Adjacency Matrix 1 0  To summarize, the contributions of this paper are as follows: 1.
Innovative dynamic reward function matrix: we propose a reward function matrix including a decision-weighted coefficient matrix, an incentive-punished-weighted coefficient matrix, and a reward-penalty function matrix. By adjusting the decisionweighted coefficient matrix, the decision-making modes of different emphases among driving task, traffic efficiency, ride comfort, and safety can be realized. Based on the premise that the incentive-punished function matrix separates the reward and the penalty, the optimization of individual performance can be achieved by adjusting the incentive and punishment ratio of each sub-reward function. In addition, the weight coefficient of the incentive-punished-weighted coefficient matrix is further set as a function of reward functions, which can reduce the impact of proportional adjustment on important operations, such as collision rate.

2.
Adjust the parameters to train multiple decision-making modes: We compare the proposed reward function matrix with the traditional reward function under the same conditions. By adjusting the parameters of the decision-weighted coefficient matrix and the incentive-punished-weighted coefficient matrix, we can achieve aggressive or conservative incentive and punitive decision-making modes, respectively. Specifically, the four decision-making modes trained in this paper are the aggressive incentive (AGGI), aggressive punishment (AGGP), conservative incentive (CONI), and conservative punishment (CONP).

3.
Modeling of interactive traffic scene and evaluation of decision-making modes: We designed a highway exit scene with solid interactions between CAVs and human vehicles (HVs) and adopted two algorithms to verify their differences. Addition-ally, we also propose a set of indicators to evaluate the performance of driverless decision-making and used them to verify the performance differences among various algorithms and decision-making modes.
This article is organized as follows. Section 2 introduces the problem formulation. Section 3 introduces the methods used. Section 4 proposes the reward function matrix. Section 5 describes and analyzes the simulation results. Section 6 summarizes this paper and gives the future development directions.

Scenario Construction
As is shown in Figure 2, a 3-lane highway containing two exits was designed, and the three lanes were sorted from bottom to top as first, second, and third lanes. All the vehicles (HVs and CAVs) enter the road segment from the left of "Highway 0". The color of vehicle indicates its type and intention. In this paper, the vehicles were set as follows for safety principles and actual human driving rules: • A white body with a colored roof represents an HV, and an all-blue vehicle represents a CAV; • The HV with the red roof (HV1) is set to appear in three lanes randomly and can only go out from "Highway 2"; • The HV with the orange roof (HV2) is set to emerge from the second lane and can go out from "Exit 1" when it is in the first lane and go out from "Highway 2" when it is in the other lanes; • The HV with the green roof (HV3) is set to only appear from the first lane and can only go out from "Exit 0"; • When there is no car at the longitudinal distance of 10 m in the lane to be changed, the HV2 will be switched to the right; • The CAV is set to appear in three lanes randomly and can go out from "Highway 2", "Exit 0", and "Exit 1".
The interaction between CAVs and HVs is enhanced through the above setting. The setting of HV1 enhances the uncertainty of the scene considerably, increasing the requirements for CAVs' decision-making. Straightening away from "Highway 2" is the most straightforward choice with the lowest collision risk, owing to CAVs' least lane change decisions. In addition, due to the uncertain lane change behavior of HV2, it is difficult for CAVs to explore "Exit 1" safely, and the collision risk of leaving the highway from "Exit 1" is higher than it is for other exits. Specific scenario setting parameters are shown in Table 1:

State Representation
The scenario is modeled as an undirected graph. Each vehicle in this scenario is regarded as the node of the graph, and the interaction between vehicles is considered the edge of the graph. Three matrices can represent the state space: a node features matrix X t , an adjacency matrix A t , and a CAV mask matrix M t ; each of them are described in the following.
Node Features Matrix: The vehicle's features are speed, position, location, and intention, denoted as The node features matrix represents the features of each vehicle in the constructed scenario, which can be described as follows: Adjacency matrix: In this paper, the interactions between vehicles are embodied in the information sharing between vehicles, expressed by the adjacency matrix. The calculation of the adjacency matrix is based on three assumptions: • All CAVs can share information in the constructed scenario; • Information cannot be shared between HVs; • Vehicles can share information with themselves a ii = 1. • All CAVs can share information with HVs in their sensing range.
The derivation of the adjacency matrix is as follows: where a ij denotes the edge value of the ith vehicle and the jth vehicle. a ij = 1 denotes that the ith vehicle and the jth vehicle share information; a ij = 0 denotes that the ith vehicle and the jth vehicle share no information. CAV mask matrix: CAV mask is used to filter out the embeddings of HVs after the GCN fusion block. The specific mathematical expression is as follows: where m i = 0 or 1. If the ith vehicle is controlled by the GRL algorithm, m i = 1; otherwise, m i = 0.

Action Space
At each time step, each CAV has a discrete action space representing potential actions to be executed at the next time step, as follows: where a lane−change indicates that the lane change action can be taken, including a left lane change, right lane change, or straight line; a acceleration denotes the discrete acceleration that CAV can take, and its value is equalized by

Methods
This section describes the principles of the methods used, including graph convolutional neural networks, Q-learning, GQN, and MDGQN.

Graph Convolutional Neural Network
GCN is a neural network model that directly encodes graph structure. The goal is to learn a function of features on a graph. A graph can be represented by G = (V, E) in theory, where V denotes the set of nodes in the graph, the number of nodes in the graph is denoted by N, and E denotes the set of edges in the graph. The state of G is considered a tuple of 3 matrices of information: feature matrix X, adjacency matrix A, and degree matrix D.

•
The adjacency matrix A is used to represent connections between nodes; • The degree matrix D is a diagonal matrix, and D ii = ∑ j A ij ; • The feature matrix X is applied to represent node features, X ∈ R N×F , where F represents the dimensions of the feature.
GCN is a multi-layer graph convolution neural network, a first-order local approximation of spectral graph convolution. Each convolution layer only deals with first-order neighborhood information, and multi-order neighborhood information transmission can be realized by adding several convolution layers.
The propagation rules for each convolution layer are as follows [24]: whereÂ = A + I N is the adjacency matrix of the undirected graph G with added selfconnections. I N is the identity matrix;D ii = ∑ jÂij and W (l) are layer-specific trainable weight matrices. σ(·) denotes an activation function, such as the ReLU(·) = max(0, ·). H (l) ∈ R N×D is the matrix of activations in the lth layer, H(0) = X.

Deep Q-Learning
Q-learning [25] is a value-based reinforcement learning algorithm. Q t is the expectation that Q(s t , a t ) can obtain benefits by taking action a t (a t ∈ A t ) under the state of s t (s t ∈ S t ) at time t. The environment will feed back the corresponding reward R t according to the agent's action. Each time step produces a quadruplet (s t , a t , r t , s t+1 ), and it is stored in the experience replay. Deep Q-learning [26] replaces the optimal action-value function with the deep neural network Q(s, a, ω). The following is a description of the principle of the algorithm.
The predicted values of DQN can be calculated according to the given four-tuple (s t , a t , r t , s t+1 ):q The TD target can be calculated based on the actual observation reward r t : DQN updates network parameters according to the following formula: where α represents the learning rate.

Graph Convolutional Q-Network (GQN)
As described in Section 3.2, Q-learning uses a Q-Table to store the Q value of each state-action pair. However, if the state and action space are high-dimensionally continuous, there will be the curse of dimensionality; that is, with a linear increase in dimensions, the calculation load increases exponentially. GQN [27] replaces the optimal action-value function Q (s, a) with the graph convolutional neural network Q(s, a, θ): where Z is the node embeddings output from graph convolution layer. ρ represents the neural network block, including the fully connected layer. θ is the aggregation of all the weights. The specific training process of GQN is the same as DQN [26]. Firstly, the predictive value of GQN can be calculated according to the four-tuple (s t , a t , r t , s t+1 ) sampled from the experience replay:q However, since the Q-network uses the same estimate to update itself, it will cause bootstrapping and lead to deviation propagation. Therefore, another neural network can be used to calculate the TD target, called the target network Q(s, a, θ − now ). Its network structure is precisely the same as that of the Q-network, but the parameter θ − now is different from θ now . Selecting the optimal action and forward propagation of the target network: Calculation of TD targetŷ t and TD error δ t : The gradient ∇ θ Q(s t , a t , θ now ) is calculated by the backpropagation of a Q-Network, and the parameter of the Q-Network is updated by gradient descent: where θ new is the parameter of the updated Q-Network. α represents the learning rate. Finally, GQN adopts the weighted average of the two networks to update the target network parameters: where τ represents soft update rate.

Multi-Step Double Graph Convolutional Q-Network (MDGQN)
MDGQN further adopts double Q-learning and a multi-step TD target algorithm based on GQN. As shown in Section 3.3, the target network cannot completely avoid bootstrapping, since the parameters of the target network are still related to the Q-Network. The double Q-learning algorithm is improved based on the target network. The Q-learning with the target network uses the target network to select the optimal action and calculate the Q value of the optimal action. However, the double Q-learning selects the optimal action according to the Q-Network and uses the target network to calculate the Q value of the optimal action. Equation (11) is modified as follows: The multi-step TD target algorithm can balance the significant variance caused by Monte Carlo and the significant deviation caused by bootstrapping. Equation (13) is modified as follows:ŷ whereŷ t is called the m-step TD target.

Reward Functions
The design of the reward function is an important criterion and goal of the DRL training process. Based on four aspects-the results the of the driving task, traffic efficiency, ride comfort, and safety-the reward function is divided into four blocks. Further, we propose a reward function matrix including a decision-weighted coefficient matrix, an incentive-punished-weighted coefficient matrix, and a reward-penalty function matrix.
Specific parameters are described below:

1.
A k1 is the decision-weighted coefficient matrix. k r , k e , k c , and k s denote the weights of reward and penalty functions based on the results of the driving task, traffic efficiency, ride comfort, and safety.

2.
A k2 is the incentive-punished-weighted coefficient matrix. k rI , k eI , k cI , and k sI denote the weights of reward functions based on the results of the driving task, traffic efficiency, ride comfort, and safety, respectively. k rP , k eP , k cP , and k sP denote the weights of penalty functions based on the results of the driving task, traffic efficiency, ride comfort, and safety, respectively.

3.
A r is a reward-penalty function matrix. r result−I , r e f f iciency−I , r com f ort−I , and r sa f e−I are reward functions based on the results of the driving task, traffic efficiency, ride comfort, and safety, respectively. r result−P , r e f f iciency−P , r com f ort−P , and r sa f e−P are penalty functions based on the results of the driving task, traffic efficiency, ride comfort, and safety, respectively.
By adjusting the weight coefficient of the decision-weighted coefficient matrix and incentive-punished-weighted coefficient matrix, DRL can train different goal-oriented decision-making modes. In the decision-making process of autonomous vehicles, the upper control module can choose different decision-making modes according to different needs. In order to select a decision-making model with excellent comprehensive performance and strong contrast, we conducted multiple sets of experiments, and some experimental data were put in Appendix A. This paper determined four decision modes: AGGI, AGGP, CONI, and CONP, by adjusting the parameters in Tables 2 and 3. Table 2. Weight coefficient of the decision-weighted coefficient matrix.

Parameters
Conservative Aggressive The weight of task k r 0.  Table 3. Weight coefficient of the incentive-punished-weighted coefficient matrix.

Parameters Incentive Punished
The weight of incentive sub-reward function k rI0 , k eI0 , k cI0 , k sI0 0.6 0.4 The weight of punished sub-reward function.k rP0 , k eP0 , k cP0 , k sP0 0.4 0.6 Since the change of weight coefficient will dilute some essential rewards or punishments, this paper improves the reward function against this defect. The weight coefficient of the incentive-punished-weighted coefficient matrix is further set as a functional of reward functions, which forms an internal dynamic reward function. The specific formula is as follows: k rI = k rI0 · exp (r result−P + r com f ort−P )/100,000 k eI = k eI0 · exp (r result−P + r com f ort−P )/80,000 k sI = k sI0 · exp (r result−P + r com f ort−P )/150,000 Based on [3], the specific reward functions and penalty functions were designed. Firstly, we designed the corresponding reward function and penalty function based on the results of the driving tasks. The independent variable of the reward function is the number of CAVs and HV2 reaching destinations, which aims to train decisions that can assist HVs in completing driving tasks. The penalty function is designed based on collisions.
r result−I = 300(n r1 + n r2 ) (21) r result−P = −60,000, i f collision (22) where n r1 is the number of CAVs leaving from "Exit 1". n r2 is the number of CAVs leaving from "Exit 1". To train the decision-making model to improve traffic efficiency, this paper divides the speed interval of CAVs into three parts. The corresponding reward and penalty functions were designed to curb speeding, encourage high-speed driving, and punish low-speed blocking for these three-speed ranges. In order to make CAVs faster and more stable to explore the optimal speed, we used the exponential function to design a soft reward function [28].
where v y is the velocity of the CAV. v y max represents the maximum velocity allowed by the current lane; its value is 25 m · s −1 , or 15 m · s −1 .
In order to improve the ride comfort of all vehicles in this traffic section, the corresponding reward function and penalty function are designed based on the acceleration and lane change times of all vehicles. r com f ort−I = exp(3) × n c1 (25) r com f ort−P = −2000 × n c2 − exp m 2 (26) where n c1 is the number of vehicles with acceleration of [−2 m · s −2 , 2 m · s −2 ]. n c2 is the number of vehicles with acceleration of (−∞, −4.5 m · s −2 ]. m is the number of lane changes in this traffic section within 0.5 s. Superior security performance is the premise of developing decision technology [29]. In [30], the length of CAVs' safety time is one of the most important factors affecting road safety. This paper introduces the safety time of the CAVs into the corresponding reward function. The definition of the safety time is as follows: where y AV is the longitudinal position of the CAV. y leader and y f ollower are the longitudinal positions of the front and rear vehicles of CAV, respectively. v leader and v f ollower are the longitudinal speeds of the front and rear vehicles of CAV, respectively. A driving hazard diagram is proposed to represent the degree of danger of the vehicle's state based on safety time t sa f e− f ollower and t sa f e−leader . As shown in Figure 3, three primary colors are used to represent the degrees of danger in this state. The red region represents collision accident danger, and the deeper the color, the greater the likelihood. The yellow area indicates that the vehicle needs to pay attention to the occurrence of a possible emergency. The green area indicates that the vehicle is in a safe state. By dividing Figure 3 into five categories, the sub-reward functions were designed. On the basis of the above principles, a security-based reward function is proposed for training security decisions. The formula is shown in Table 4.

Parameter Setting
In this section, we show the solid interaction scene designed through SUMO. Firstly, based on this scene, the proposed reward function is compared with the traditional reward function. The two algorithms were used to train four decision-making modes, and the differences between different algorithms and modes are compared. Finally, the performances of four decision-making modes based on the two algorithms are evaluated. The main parameters for algorithms are listed in Table 5.

Evaluation Indexes
In order to further evaluate the test performance of each decision-making mode, four kinds of evaluation indexes are proposed, namely, efficiency index, safety index, comfort index, and task index.

1.
Efficiency: The average longitudinal velocity of CAVs and all vehicles in this scenario is proposed and used to evaluate the traffic efficiency of CAVs and the comprehensive efficiency of total traffic flow under different decision modes. Their functions are defined as follows:v where v AV−ij represents the longitudinal velocity of the ith CAV detected at the jth time step. v ij is the longitudinal velocity of the ith vehicle detected at the jth time step. N indicates the total number of CAVs' longitudinal velocities detected in all-time steps. N all indicates the total number of all vehicles' longitudinal velocities detected in all-time steps.

2.
Safety: Due to the presence of multiple CAVs in the test scenario, we define the collision rate as the probability of each CAV collision accident in each episode.
where N collision is the number of collisions in a single episode. N CAV represents the number of CAVs.

3.
Comfort: For the evaluation of comfort, we mainly studied the deceleration of all vehicles under different decision-making modes. All vehicles' total emergency braking times N braking in a single episode are defined as comfort evaluation indexes. It should be noted that the deceleration between (−∞, −4.5 m · s −2 ] is regarded as emergency braking.

4.
Task: Based on the solid interaction between CAVs and HVs in the test scenario, the driving task completion rates of CAVs and HV2 are used as the task indexes. Their functions are defined as follows.
where N CAV−Exit1 is the number of CAVs that left via "Exit 1". N HV2−Exit1 indicates the number of HV2 that left via "Exit 1". N HV2 represents the number of HV2.

Validation of the Reward Function
The proposed reward function was used in GQN and MDGQN to train different decision-making modes while improving training efficiency. In [27], the traditional reward function only considers intention reward, speed reward, lane-changing penalty, and collision penalty. In this article, this traditional reward function is used as the baseline, and the reward of the CONP decision-making mode training process is compared. In order to allow them to make a fair comparison based on the maximum and minimum reward values they can achieve, the reward values in the training process are normalized.
As shown in Figure 4, using the proposed reward function for training can explore the maximum reward rapidly. The traditional reward function's training process experiences repeated reward reductions, and the maximum reward cannot be explored in the specified number of rounds. In addition, the training stability is greatly improved by comparing the reward fluctuation in the training process with the baseline. After the 200th episode, the average normalized reward values of Baseline-GCQ, CONP-GCQ, and CONP-MDGCQ were 0.7399, 0.9882, and 0.9886, respectively; and their variances were 0.0871, 0.0028, and 0.0049. Under the condition of using the GQN algorithm, the CONP decision-making model was 35.40% better than the average of baseline, and the variance is only 3.27% higher than that of the baseline. In summary, the proposed reward function can promote the fast convergence of the algorithm and greatly enhance the stability of the training process.

Training of Different Decision-Making Modes
As shown in Figure 5, under the four decision-making modes, the MDGQN algorithm converged faster than GQN, and the fluctuation of reward decline decreased significantly. This verified that MDGQN effectively alleviates the overestimation problem after using the multi-step TD target and double Q-learning algorithm to explore the optimal action more quickly. Based on MDGQN algorithm analysis, the AGGI decision-making mode converged in the 80th episode. In contrast, the AGGP decision-making mode did not fully converge until the 160th episode, which further verifies that the reward and punishment ratio of the reward function affects the convergence speed of the algorithm. In summary, by comparing the four decision-making modes, the aggressive decision-making mode uses more punishment than the conservative decision-making mode. The incentive decisionmaking mode can promote the convergence of the algorithm faster than the punished decision-making mode. Furthermore, the training decision-making mode was tested, and the reward mean and variance are compared. As shown in Figure 6, MDGQN explored higher reward thresholds and averages than GQN in all four decision modes. Under AGGP, AGGI, CONP, and CONI decision-making modes, the corresponding test reward variances of MDGQN were 93.91, 80.26, 64.71, and 87.10; these were less than 108.51, 108.19, 93.78, and 97.01 for GQN. It can be concluded that the test stability of the CONP decision mode is the strongest, but the ability to distinguish algorithm differences is poor. AGGI is the decision-making mode that can best reflect the differences between algorithms, but the test stability decreases slightly.

Evaluation of Decision-Making Modes
Based on Figure 7, the two algorithms are first compared. In the four modes, the average speed of CAVs was better than that of GQN except for the CONI decision mode. MDGQN performed better than GQN in the speed of total traffic flow under the four modes. However, the proportion of CAVs' average speed increase in the total traffic flow's average speed increase is less than 1. In the CONI decision-making mode, the average speed of CAVs trained by MDGQN was lower than that of GQN, but the total traffic flow speed increased. This shows that based on the designed reward function, the MDGQN algorithm can obtain decisions that improve the total traffic efficiency by training CAVs. As shown in Table 6, MDGQN had a lower collision rate, fewer emergency braking times, and a higher arrival rate of CAVs and HV2 than GQN under the same decision-making mode, except for a slight increase in the number of emergency braking incidents under CONP decision-making mode. This can be analyzed from the corresponding data. Under the CONP decision-making mode, the p HV2 of MDGQN was 0.717, which was 18.12% higher than that of GQN. When the HV2 increases in "Exit 1", the uncertainty of the scene will enhance, increasing the number of emergency brake events. From the above results, it can be concluded that the comprehensive performance of the MDGQN algorithm is superior to that of GQN in this specific autonomous driving scene, and the MDGQN is more suitable for solid, interactive, uncertain scenes.
Additionally, the four decision-making modes are compared. Compared with the conservative decision-making mode, the aggressive decision-making mode had a higher average speed, a higher collision rate, more emergency braking incidents, and a higher vehicle arrival rate. Compared with the punished decision-making model, the average speed of the incentive decision-making mode was slightly higher, the collision rate was higher, the number of emergency braking incidents was higher, and the arrival rate of HV2 was further improved. Among the four decision-making modes, AGGI decision-making mode had the highest traffic efficiency; CAVs and HV2 had the highest arrival rate, but their collision rate and number of emergency braking incidents were are also the highest. The collision rate and emergency braking incidents of CONP decision mode were the lowest, but the traffic efficiency was the lowest, and the arrival rates of CAVs and HV2 were the lowest. Those performance differences are consistent with the weight coefficient allocation of the proposed reward function matrix. This fully proves that the proposed reward function matrix can effectively train various decision-making modes to adapt to autonomous driving scenarios.

Conclusions
We proposed a reward function matrix, including a decision-weighted coefficient matrix, an incentive-punished-weighted coefficient matrix, and a reward-penalty function matrix. By adjusting the weight coefficient of the reward function matrix, various decisionmaking modes can be trained. We trained four decision-making modes, namely, AGGI, AGGP, CONI, and CONP; and the GQN algorithm and the MDGQN algorithm based on GQN improvement were used for verification by comparison. A large number of simulation results proved the following three conclusions. Firstly, the proposed reward function can promote the fast convergence of the algorithm and greatly improve the stability of the training process. Taking the CONP decision-making mode as an example, the average normalized reward value after the 200th round was 35.40% higher than that of the baseline, and the variance was only 3.27% greater than that of the baseline. Secondly, the comprehensive performance of the MDGQN algorithm is superior to that of GQN. Under the four decision-making modes, the averages and variances of test reward values of MDGQN are better than those of GQN. In terms of driving performance, MDGQN performs better than GQN, except that the number of brakes increases slightly in CONP decision-making mode. Finally, the proposed reward function matrix can effectively train various decision-making modes to adapt to different autonomous driving scenarios. With an increase in incentive weight, the comparison effect of the algorithm is more obvious, but the security will decrease. In our future work, we will further study the interactions of autonomous vehicles and decision mode switching.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A
In the training process of decision-making mode, we adjusted different weight coefficients to find a mode with high comprehensive performance. To better validate the effectiveness of the proposed reward function matrix at training different decision patterns, we list four other sets of parameter settings, as shown in the Table A1. The simulation results are shown in Table A2. From this we can conclude that too much of an increase in the proportion of safety will lead to a decline in the completion rate of the driving tasks, especially CAVs' loss of the driving task objectives. Excessive increases in the proportions of the driving tasks and efficiency can lead to a rapid decline in safety. An increase in incentive rewards will lead to a decrease in security, and higher task completion rate and efficiency. Based on the above results, four groups of reasonable parameters were selected, and four decision models, CONP, CONI, AGGP, and AGGI, were trained.