Generalized Single-Vehicle-Based Graph Reinforcement Learning for Decision-Making in Autonomous Driving

In the autonomous driving process, the decision-making system is mainly used to provide macro-control instructions based on the information captured by the sensing system. Learning-based algorithms have apparent advantages in information processing and understanding for an increasingly complex driving environment. To incorporate the interactive information between agents in the environment into the decision-making process, this paper proposes a generalized single-vehicle-based graph neural network reinforcement learning algorithm (SGRL algorithm). The SGRL algorithm introduces graph convolution into the traditional deep neural network (DQN) algorithm, adopts the training method for a single agent, designs a more explicit incentive reward function, and significantly improves the dimension of the action space. The SGRL algorithm is compared with the traditional DQN algorithm (NGRL) and the multi-agent training algorithm (MGRL) in the highway ramp scenario. Results show that the SGRL algorithm has outstanding advantages in network convergence, decision-making effect, and training efficiency.


Introduction
In autonomous driving, the decision-making system is mainly used to produce advanced actions of vehicles, such as lane changing, acceleration, braking, and so on. Tactical decision-making for autonomous driving is challenging due to the diversity of environments, the uncertainty in the sensor information, and the complex interaction with other road users [1,2]. Traditional vehicle trajectory modeling and tracking control methods, such as genetic algorithms, neural networks, and their optimizations, have played a positive role in the research of decision-making [3].
The operational space of an autonomous vehicle (AV) can be diverse and vary significantly. Due to this, formulating a rule-based decision-maker for selecting driving maneuvers may not be ideal [4]. With the development of deep learning, the domain of reinforcement learning (RL) has become a robust learning framework now capable of learning complex policies in high-dimensional environments [5]. Therefore, using reinforcement learning to solve decision-making problems has gradually become the mainstream of research. Carl-Johan Hoel et al. introduce a method based on deep reinforcement learning for automatically generating a general-purpose decision-making function [6]. They trained an RL agent to handle a truck-trailer combination's speed and lane change decisions in a simulated environment. Hongbo Gao et al. solved sequential decision optimization problems based on the inverse reinforcement learning algorithm, and the proposed method was verified in terms of efficiency [7]. Some algorithms for planning and decision-making are based on analyzing driver behavior [8,9].
The realization of the decision-making is based on the understanding and analysis of high-dimensional environmental information. However, the traditional reinforcement learning algorithm only has a good capacity for decision-making based on low-dimension features [10]. It will have the problem of insufficient understanding in the face of more complex scenario information. The deep neural network (DNN) has a strong ability to learn representations and for the generalization of matching patterns from high-dimensional data. Therefore, deep reinforcement learning (DRL) algorithms are effective in tasks requiring feature representation and policy learning, e.g., autonomous driving decision-making [11]. Using the functional approximation ability of a deep neural network (DNN), an intelligent controller integrating artificial intelligence technologies such as deep learning (DL) and reinforcement learning (RL) is designed to maintain and avoid obstacles in lanes [12][13][14], decision-making [15][16][17][18][19][20][21], longitudinal control [22], merger maneuvers [23], human-like driving strategies [24][25][26], and other large-scale autonomous driving control tasks. Yingjun Ye et al. put forward a framework for decision-making training and learning. It consists of a deep reinforcement learning (DRL) training program and a high-fidelity virtual simulation environment [27]. Compared with the DQN algorithm based on a value function, the deep deterministic policy gradient (DDPG) algorithm based on an action policy can solve the continuity problem of the action space. Haifei Zhang et al. used the DDPG algorithm to solve the control problem of automatic driving based on a reasonable reward function, deep convolution network, and exploration policy [28].
Interaction between vehicles in common public transport scenarios is necessary and pervasive. However, there are relatively few studies on how autonomous vehicles interact in public environments with reinforcement learning. Realizing coordination between vehicles in a shared environment is challenging due to the unique feature of vehicular mobility, which makes it infeasible to apply the existing reinforcement learning methods directly. Chao Yu et al. proposed using a dynamic coordination graph to model the continuously changing topology during vehicles' interactions and developed two basic learning approaches to coordinate the driving maneuvers for a group of vehicles [29].
Cooperative vehicle-infrastructure systems and automatic driving technology are developing rapidly [30]. The graph neural network (GNN) has gained increasing popularity in various domains, including social network analysis [31,32]. GNN can extract the relational data representations and generate useful node embeddings on the node features and the features from neighboring nodes. The interactions between the ego vehicle and other surrounding vehicles can also be represented by the dynamic potential field (DPF) and embedded in the gap acceptance model to ensure safety and personalization during driving [33]. Jiqian Dong et al. proposed a novel deep reinforcement learning (DRL)-based approach combining the graphic convolution neural network (GNN) and deep Q network (DQN), namely the graphic convolution Q network, as the information fusion module and decision processor [34]. The proposed model can aggregate the information obtained by collaborative perception and output collaborative lane change decisions for multiple vehicles. Even in the case of highly dynamic and partially observed mixed traffic, the intention can be satisfied. However, the above multi-agent training-based GNN reinforcement learning (MGRL) has the following problems in the actual verification process (the scenario of highway ramp exiting): 1.
Multi-agent training simultaneously increases the computational network complexity, resulting in a higher overall training time cost. Therefore, each parameter modification requires a more prolonged verification time, which is not conducive to the development and adjustment of the algorithm.

2.
The reward and punishment offset each other in multi-agent overall training, resulting in poor network convergence in the training process. Due to the mutual influence of multiple agents' reward values, it cannot accurately evaluate the current state, resulting in the very unstable fluctuation of the loss curve.

1.
Based on retaining interactive feature extraction (GNN), the trained object is transferred from multi-agent to single-agent. Through the internal processing of the network model, the training results of single-agent can be applied to the application scenarios of multi-agent. This training method can significantly reduce the time cost of model training and eliminate the interaction of rewards and punishments between agents to improve the training effect.

2.
An inductive reward function is designed to improve the convergence speed of the model. The reward function incorporates driving intention, collision, lane change frequency, and vehicle speed into the calculation. The trained model simultaneously performs well in terms of task success rate, safety, driving stability, and traffic efficiency.

3.
The dimension enhancement of the action space is realized by changing the network structure. Adding vehicle longitudinal velocity control gives the model a more robust control ability for task success rate, safety, and traffic efficiency.

4.
In the training process, the real-time screening of stored data can improve the training speed. The random generation mechanism of vehicles' number, type, and position in the training scenario can improve the model's generalization ability and avoid overfitting.
The paper is organized as follows. Section 2 introduces the proposed SGRL algorithm in detail. Section 3 shows the model training and testing of the SGRL algorithm and comparison algorithms (MGRL and NGRL). Section 4 shows the results of training and testing and analyzes the comparison. Finally, Section 5 derives the conclusions and proposes future improvement directions.

Method
The proposed SGRL methods are used to model the Markov Decision Process (MDP). The agents can explore the environment by observing states, taking actions, and receiving rewards, as shown in Figure 1. The implementation core of the SGRL method is based on the preprocessing of input data, the structure of the deep neural network, and the setting of the output end.

Network Input
In each time step in the training process, the vehicles in the scenario can be divided into human-driven vehicles (HVs) and autonomous vehicles (AVs). At each step t, the input of the training model can be divided into two parts: feature matrix X t and correlation matrix C t . The state s t = (X t , C t ). Concerning the feature matrix X t , the necessary basic information based on highway scenarios is included. The total number of human-driven vehicles (HVs) and autonomous vehicles (AVs) is set to N = N HV + N AV . Then, the specification of the matrix is N × 8. The eight characteristic parameters of each vehicle describe the speed, lateral position, longitudinal position, and destination information, denoted as x i = (v i , p i , l i , I i ), To serialize multiple data at the same scale, the parameters are set as follows: Based on GNN, we can introduce a correlation matrix C t to calculate the relationship between vehicle nodes. Considering that the sensors installed on autonomous vehicles have fixed sensing ranges, only vehicles within the sensor sensing range are recorded in the correlation matrix. C t is a square matrix with the specification N × N. The row data i of the matrix represent the relationship between vehicle i and other vehicles through binary data. The value C ij of the vehicle j within the set distance of the target vehicle i will be set to 1, as shown in Figure 2. Considering that the number of vehicles in the scene is dynamic, this situation has been considered when setting the feature matrix and the correlation matrix. The vehicles in the environment are divided into HVs and AVs, located in the feature matrix's upper and lower parts. When the number of vehicles is less than N, the positions in the matrix are occupied by 0.
To ensure the authenticity of the simulation process, all vehicles appear and disappear successively in the scenario, so the feature information and correlation information obtained at the step t cannot reach the predetermined matrix size. To ensure the standard calculation of GNN, matrix data filling is needed. Matrix segmentation prevents data confusion and error correspondence caused by random packing, as shown in Figure 3.

Network Structure
In the whole network structure, the function of graph convolution is to obtain the interaction information between vehicles. The function of the full connection layer is to parse the matrix information. Each newly obtained matrix in the network needs to be input into the full connection layer for recording and information analysis. In addition, the number of nodes in each layer also plays a decisive role in the model's performance. The optimal number of nodes is selected through comparative experiments.
Data records for comparative tests are shown in Table 1. At each step t, the feature matrix X t is first fed to a Fully Connected Network (FCN) encoder ψ and then the matrix H t ∈ N×128 is obtained (Equation (1)). The original eight characteristic values will be recoded to 128 values. FCN will be executed twice consecutively.
Next is the calculation of graph convolution. The input of the convolution function is the newly obtained feature matrix H t and correlation matrix C t . The function output matrix K t has the same specification as the original matrix H t (Equation (2)).
The original feature information and the new information obtained by graph convolution are fused by matrix stitching online as the total input of the subsequent neural network η.
The feature data density changes at each stage are as follows: The overall structure of the SGRL algorithm is shown in Figure 4.

Network Output
Since the vehicle longitudinal control model built into the simulation environment is still rule-based, it cannot incorporate the interaction between vehicles into the control model. The SGRL algorithm fuses longitudinal and lateral control into the same model by increasing the output dimension, which significantly simplifies the complexity of the control process and solves the coupling problem of two directions.
The SGRL algorithm is trained based on DQN, so the output action space can only be discrete. For AVs, the longitudinal control is mainly reflected in the acceleration, and the lateral control is primarily reflected in the lane change direction. The improvement of the output action resolution is conducive to improving the control sensitivity, but it also increases the computational complexity and reduces the control frequency. In the algorithm of this paper, a compromise solution is selected. The longitudinal control is set to 11 discrete values in the interval [−5, 5], and the lateral control is set to three actions: keeping, turning left and turning right. Thus, the output matrix of the model is set as A t ∈ R N×33 , as shown in Figure 5.

Reward Function
The setting of the reward function needs clear guidance and a strong correlation with training objectives. In the SGRL algorithm, the task success rate, security, and traffic efficiency should be considered simultaneously. The corresponding reward values are set as intention reward, crash reward, and speed reward.

Intention Reward
To guide autonomous vehicles to complete driving tasks from the corresponding ramps out of the highway, the SGRL algorithm constructs reward gradients for different lanes, as shown in Figure 6. Merge_0 and merge_1, respectively, refer to the autonomous vehicles that need to drive away from ramp_0 and ramp_1 according to the task requirements. In the simulation experiment, the driving route of all autonomous vehicles is entirely determined by the reinforcement learning controller. Therefore, vehicles belonging to the merge_0 category will not necessarily exit from ramp_0; that is, they cannot complete the driving task. Figure 6. Intention reward gradient diagram. The task completion of autonomous vehicles is seen as a factor in judging the quality of the current reinforcement learning model, which is manifested in the reward values that can be obtained when each vehicle is in different driving sections and lanes. The strip area represents the reward types that the corresponding vehicles can obtain from the area. Green represents reward (1 × R I−Base ), blue represents punishment (−1 × R I−Base ), and orange represents serious punishment (−2 × R I−Base ). R I−Base is set to 1 for normalization.
To ensure practical guidance for all locations, the reward value R I−t for the fixed area is set to a constant value. To ensure the interaction between various rewards and to pass it to the training process, it is necessary to pay attention to the consistency of the numerical scale when setting R I−t . Moreover, the R I−t needs to distinguish between positive and negative values, which is more conducive to the training. In addition, if the autonomous vehicle completes the task, it can obtain greater R I−t , and if the task fails, it will also obtain a greater negative R I−t .

Crash Reward
The safety of autonomous driving is the basis of all other characteristics. The simulation platform SUMO can detect collisions at step t and output the number N collision−t of vehicles involved in collisions. The collision reward is calculated according to N collision−t (Equation (3)).

Speed Reward
To control the consistency of the reward scale, the speed reward value R S−t needs to be normalized. v max = max(v max −vehicle , v max −highway ) is the max speed for the AV. To ensure the positive and negative values of R S−t , the calculation is shown as Equation (4).

Total Reward
The general method to calculate the total reward is a weighted summation of each component. Direct addition will lead to mutual coverage of rewards, resulting in poor training effects. The SGRL algorithm uses a new aggregation method, as shown in Equation (5).
This calculation method takes the task success rate and safety as the primary considerations and considers traffic efficiency simultaneously. The problem of vehicle parking for intention rewards can be completely avoided. Moreover, the weight relationship between rewards can be adjusted by parameters ω I and ω C . The parameter settings of this paper are ω I = 1, ω C = 2.

Model Training and Testing
The most prominent feature of the SGRL algorithm is training for a single agent, but the obtained model can be applied to a multi-agent environment. For trained agents, the vehicles around them can be considered the same category-surrounding vehicles. Therefore, the work done by the GNN-based reinforcement learning model can be interpreted as planning and decision-making based on the characteristics and relationships of the target agent and its surrounding vehicles. The model obtained by single-agent training can be directly transplanted to other agents in the same scenario, as shown in Figure 7. To prevent the overfitting of the reinforcement learning process, we randomize the distribution of training vehicles and the task of autonomous vehicles in each episode based on fixed rules, as shown in Figure 8. Get the transition (s t , a t , r t , s t+1 ) Store in the buffer ## Training step ## For step t = T 0 + 1 to T(total steps) do With the probability e choose random action for agent i: a t−r With the probability 1 − e do: State decoding: (X t , C t ) = s t Double FCN: Compute Q values: Select a * t = arg max ∧ Q θ (s t ) Execute a * t and get a new state s t+1 Store in the buffer Set s t = s t+1 ## Training model at training step ## Sample a batch from the buffer and calculate the Q target:

Algorithm 2 SGRL Testing Steps
Initialize the simulation environment ## Testing step ## For step t = 1 to T(total test steps) do State decoding: (X t , C t ) = s t For AV i = 1 to N AV do Move other AV's features to the back of HV: Calculate Q based on trained model: and get a new state s t+1 Set s t = s t+1

Baseline Models
Two baseline models are introduced for comparative analysis in the simulation: the traditional DQN algorithm (NGRL) and the multi-agent training algorithm (MGRL). In the NGRL algorithm, the GNN part is removed, and the correlation matrix is used as input by splicing with the feature matrix.
The model of the MGRL algorithm is consistent with the SGRL proposed in this paper in terms of network structure, but its training process is based on multi-agent environment interaction.
By comparing the three models, the specific effects of the GNN structure and singleagent training of SGRL can be effectively analyzed. The simulation scenario is shown in Figure 9.

Simulator Parameters
The simulation scenario is a long highway with three lanes. There are exit ramps at one-third and two-thirds of the total length respectively. The speed limit for the whole road is set as 20 m/s (76 km/h) for all the vehicles. The AVs and HVs are put into the scenario at the probability of 0.1 and 0.4 from the left side of the road. And the initial speed and lane position are random.
There are six types of vehicles in the scenario, including HVs that travel straight through the highway, vehicles (HVs and AVs) that want to exit from ramp_0 and vehicles (HVs and AVs) that want to exit from ramp_1. The simulation environment controls the driving of HVs, and the AVs are completely controlled by the reinforcement learning model SGRL in real time.
The number of vehicles in the experiment is set as shown in Table 2.

Training Results
All three models were trained for 1000 episodes. The symbolic data of the training process are the reward, average Q value and loss value. Average rewards are obtained by averaging rewards against steps. The specific changes are shown in Figures 10-13.
For the comparison of reward values in the training process, under the same reward value calculation, the reward values and average reward values of the SGRL algorithm can converge faster and have better final convergence.
The changing trend of the Q value and loss value of the SGRL algorithm in the training process is consistent with the learning process. According to the loss results, SGRL has a faster convergence speed.        The SGRL algorithm has obvious advantages regarding task success rate and average vehicle speed. In terms of collision times and average training step length, SGRL also meets driving safety requirements.
In addition, SGRL has apparent advantages in training efficiency under the premise of the same training episodes. The specific hardware parameters and training time are shown in Table 3. The comparison of the training time is shown in Table 4.

Testing Results
The trained model needs to be verified by the test process. To fully verify the algorithm's effectiveness, we adjust the total length of the road under the premise of the same traffic flow (20 vehicles per episode). The three algorithms are tested on 1000 m, 750 m and 500 m roads to simulate different traffic flow and congestion levels.
Each test process includes 1000 episodes. In the test process, the reward value can still be used as an essential evaluation of model performance. The simulation results of reward value and average reward value are shown in Figures 18 and 19.  For the most complex and congested 500 m highway scenario, the longitudinal motion spatial distribution of the three algorithms throughout the test cycle is as shown in Figure 20. The longitudinal control of autonomous vehicles will become more and more complex with the increase in road congestion, so the test results of the 500 m scene are the optimal reference. The test results show that the action output of the SGRL algorithm is mainly concentrated near 0, and the probability of large acceleration is acceptable. To compare the task success rate, security, and traffic efficiency, the average velocity, number of collisions, success rate, and average step per episode must be collected.
The data of different methods are listed in Table 5, and the mean of the above data is shown in Figures 21-24.     It can be seen from the figure and table that SGRL has apparent advantages in task success rate and average velocity. In terms of the number of collisions and the test steps, although the SGRL value is not the best, it is consistent with the best value.

Results Discussion
It can be seen that the SGRL algorithm has outstanding advantages over the MGRL algorithm and the NGRL algorithm. The SGRL algorithm can converge faster during the training process and achieve better data performance. In the testing process, the SGRL algorithm has outstanding data performance in terms of task success rate, average vehicle speed, and security.
The MGRL algorithm directly sums the reward value of all autonomous vehicles in the process of reward value calculation, so there is a phenomenon in which the excellent performance of vehicle behavior and the poor performance of vehicle behavior offset each other, which is not conducive to the adequate updating of parameters, and also causes the final convergence speed to be slow, and thus the convergence effect is not good.
The NGRL algorithm lacks the calculation consideration of the interaction process between vehicle individuals. Therefore, in the decision-making process, due to the relatively simple understanding of the environment, it cannot obtain sufficient data support, so the data performance is poor.
The SGRL algorithm considers and solves the above problems and optimizes the network structure. According to the comparison of data, it can be verified that the improvement of SGRL is obviously effective.

Conclusions
This paper proposes a generalized single-vehicle-based graph neural network reinforcement learning algorithm (SGRL algorithm). This algorithm combines GNN with deep reinforcement learning to solve the vehicle planning problem in the scenario of highway driving out of the ramp. The SGRL model is trained for a single agent and can be tested in multi-agent scenarios. At the same time, the algorithm sets up an improved reward function to provide a clear direction for training.
Comparing the three algorithms, the conclusions are as follows: • Firstly, a training mode for single-agent training extended to multi-agent scenarios is proposed and verified in terms of training effectiveness and performance. The algorithm improves the analytical ability of the DRL by increasing the number of network nodes, thereby increasing the control dimension of vehicle decision-making to longitudinal and lateral dimensions. • Secondly, the proposed SGRL algorithm simplifies the training mode and improves the training efficiency without affecting the training effect. This helps to adjust complex parameters and reduce time costs. • Thirdly, SGRL is more sufficient in the training process to achieve a better convergence effect. The fluctuation is the smallest after the training data are stable. This shows that the proposed SGRL algorithm has outstanding training ability and is more suitable for decision-making based on reinforcement learning in multi-agent scenarios. • Finally, the newly designed reward function effectively solves the problem of mutual influence between longitudinal and lateral control. SGRL can achieve higher task success rates and average velocity in the training and testing process. This shows that the new reward function, the training method for a single agent, and the incorporation of GNN effectively improve the decision performance of the model.
In future research, the continuity of model action space can be added to this algorithm, which will effectively improve the driving fluency of vehicles. In addition, the relationship between multiple agents in the scenario should not be limited to physical characteristics: decision-making, driving intention, and task priority can also be incorporated into the calculation process. Funding: This research received no external funding.

Conflicts of Interest:
The authors declare no conflict of interest.