Autonomous vehicle decision and control through reinforcement learning with traffic flow randomization

Most of the current studies on autonomous vehicle decision-making and control tasks based on reinforcement learning are conducted in simulated environments. The training and testing of these studies are carried out under rule-based microscopic traffic flow, with little consideration of migrating them to real or near-real environments to test their performance. It may lead to a degradation in performance when the trained model is tested in more realistic traffic scenes. In this study, we propose a method to randomize the driving style and behavior of surrounding vehicles by randomizing certain parameters of the car-following model and the lane-changing model of rule-based microscopic traffic flow in SUMO. We trained policies with deep reinforcement learning algorithms under the domain randomized rule-based microscopic traffic flow in freeway and merging scenes, and then tested them separately in rule-based microscopic traffic flow and high-fidelity microscopic traffic flow. Results indicate that the policy trained under domain randomization traffic flow has significantly better success rate and calculative reward compared to the models trained under other microscopic traffic flows.


Introduction
In recent years, autonomous vehicles have received increasing attention as they have the potential to free drivers from the fatigue of driving and facilitate efficient road traffic [1].With the development of machine learning, rapid progress has been achieved in the development of autonomous vehicles.In particular, reinforcement learning, which enables vehicles to learn driving tasks through trial and error, continuously improves the learned policies.Compared to supervised learning, reinforcement learning does not require the manual labeling or supervision of sample data [2][3][4][5].However, reinforcement learning models require tens of thousands of trial-and-error iterations for policy learning, and real vehicles on the road can hardly withstand so many trials.Therefore, the current mainstream research on autonomous driving with reinforcement learning focuses on using virtual driving simulators for training.
Lin et al. [6] utilized deep reinforcement learning within a driving simulator, Simulation of Urban Mobility (SUMO), to train autonomous vehicles, enabling them to merge safely and smoothly at on-ramps.Peng et al. [7] also employed deep reinforcement learning algorithms within a SUMO to train a model for lane changing and car following.They tested the model by reconstructing scenes using NGSIM data, and the results indicate that the models based on reinforcement learning demonstrate higher efficacy than those based on rule-based approaches.Mirchevska et al. [8] used fitted Q-learning for high-level decisionmaking on a busy simulated highway.However, the microscopic traffic flows of these studies are based on rule-based models, such as the Intelligent Driver Model (IDM) [9][10][11] and the Minimize Overall Braking Induced by Lane Change (MOBIL) model.These are mathematical models based on traffic flow theory [12].They tend to simplify vehicle motion behavior and do not consider the interaction of multiple vehicles.Autonomous vehicles trained with reinforcement learning in such microscopic traffic flows may perform exceptionally well when tested in the same environments.However, when the trained model is applied to more realistic or real-world traffic flows, their performance may significantly deteriorate, and they could even cause traffic accidents.This is due to the discrepancies between simulated and real-world traffic flows.
For research on sim-to-real transfer, numerous methods have been proposed to date.For instance, robust reinforcement learning has been explored to develop strategies that account for the mismatch between simulated and real-world scenes [13].Meta-learning is another approach that seeks to learn adaptability to potential test tasks from multiple training tasks [14].Additionally, the domain randomization method used in this article is acknowledged as one of the most extensively used techniques to improve the adaptability to real-world scenes [15].Domain randomization relies on randomized parameters aimed at encompassing the true distribution of real-world data.Sheckells et al. [16] applied domain randomization to vehicle dynamics, using stochastic dynamic models to optimize the control strategies for vehicles maneuvering on elliptical tracks.Real-world experiments indicated that the strategy was able to maintain performance levels similar to those achieved in simulations.However, few studies have applied domain randomization to microscopic traffic flows and investigated its efficacy.
In recent years, many driving simulators have been moving towards more realistic scenes.One type includes data-based driving simulators (InterSim [17] and TrafficGen [18]), which train neural network models by extracting vehicle motion characteristics from realworld traffic datasets, resulting in interactive microscopic traffic flows.However, the simulation time is much longer than for most rule-based driving simulators due to the complexity of the models.The other kind includes theory-based interactive traffic simulators, which can generate long-term interactive high-fidelity traffic flows by combining multiple modules (LimSim [19]).The traffic flow generated by LimSim closely resembles an actual dataset with a normal distribution, sharing similar means and standard deviations [20].
This paper proposes a domain randomization method for rule-based microscopic traffic flows for reinforcement learning-based decision and control.The parameters of the car-following and lane-changing models are randomized with Gaussian distributions, making the microscopic traffic flows more random and behaviorally uncertain, thus exposing the agent to a more complex and variable driving environment during training.To investigate the impact of domain randomization, this paper will train and test agents using microscopic traffic flow without randomization, high-fidelity microscopic traffic flow, and domain-randomized traffic flow for freeway and merging scenes.
The rest of this paper is structured as follows: Section 2 introduces the relevant microscopic traffic flows.Section 3 describes the proposed domain randomization method.Section 4 presents the simulation experiments and the analysis of the results for the freeway and merging scenes.Finally, the conclusions are drawn in Section 5.

Microscopic Traffic Flow
Microscopic traffic flow models take individual vehicles as the research subject and mathematically describe the driving behaviors of the vehicles, such as acceleration, overtaking, and lane changing.

Rule-Based Microscopic Traffic Flow
This paper utilizes IDM and SL2015 as the default car-following and lane-changing models, respectively.The following is a detailed introduction to them.

IDM Car-Following Model
IDM was originally proposed by Treiber in [9], capable of describing various traffic states from free flow to complete congestion with a unified formulaic approach.The model takes the preceding vehicle's speed, the ego vehicle's speed, and the distance to the preceding vehicle as inputs to output the ego vehicle's safe acceleration.The acceleration of the ego vehicle at each timestep is where a represents the maximum acceleration of the ego vehicle, v(t) is the current speed of the ego vehicle, v 0 is the desired speed of the ego vehicle, δ is the acceleration exponent, ∆v(t) is the speed difference between the ego vehicle and the preceding vehicle, s is the current distance between the ego vehicle and the preceding vehicle, and s * (v(t), ∆v(t)) is the desired following distance.The desired distance is defined as follows: where s 0 is the minimum gap, T is the bumper-to-bumper time gap, and b represents the maximum deceleration.

SL2015 Lane-Changing Model
The safety distance required for the lane-changing process is calculated as follows: where d lc,veh (t) denotes the safety distance required for lane changing, v(t) represents the velocity of the vehicle at time t, l veh is the length of the vehicle, a 1 and a 2 are safety factors, and the threshold speed v c differentiates between urban roads and highways.
The profit b ln (t) at time t for changing lanes is calculated as follows: where v(t, ln) is the velocity of the vehicle in the target lane at the next timestep, v(t, lc) is the safe velocity in the current lane, and v max (lc) is the maximum velocity allowed in the current lane.The goal here is to maximize the velocity difference, thereby increasing the benefit of changing lanes.
If the profit b ln (t) for the current timestep is greater than zero, then this profit will be added to the cumulative profit.Conversely, if the profit for the current timestep is less than zero, the cumulative profit will be halved to moderate the desire to change to the target lane.If the cumulative profit is larger than a threshold, lane change can be initiated.

LimSim High-Fidelity Microscopic Traffic Flow
The study employs the LimSim driving simulation platform's high-fidelity microscopic traffic flow.The high-fidelity microscopic traffic flow in LimSim is based on optimal trajectory in the Frenet frame [21].Within the circular area around the ego vehicle, the microscopic traffic flow is updated based on each optimal trajectory.

Trajectory Generation
In the Frenet coordinate system, the motion state of a vehicle can be described by the tuple [s, ṡ, s, d, ḋ, d], where s represents the longitudinal displacement, ṡ the longitudinal velocity, s the lateral acceleration, d represents the lateral displacement, ḋ the lateral velocity, and d the latitudinal acceleration.

Lateral Trajectory Generation
The lateral trajectory curve can be expressed by the following fifth-order polynomial: ( The trajectory start point is known as D 0 = [d 0 , ḋ0 , d0 ], and a complete polynomial trajectory can be determined once the end point ] is specified.As vehicles travel on the road, they use the road centerline as the reference line for navigation, and the optimal state should be moving parallel to the centerline, which means the end point would be . Equidistant sampling points are selected between the start point and end point, and the multiple polynomial segments are connected to form many complete lateral trajectories.

Longitudinal Trajectory Generation
Longitudinal trajectory curve can be expressed with a fourth-degree polynomial: S 0 = [s 0 , ṡ0 , s0 ] is the start point and S 1 = [ ṡ1 , s1 ] is the end point.Equidistant sampling points are selected between the start point and end point, and the multiple polynomial segments are connected to form many complete longitudinal trajectories.

Optimal Trajectory Selection
The trajectory selection process involves evaluating a cost function that includes key components: trajectory smoothness, which is determined by the heading and curvature differences between the actual and reference trajectories; vehicle stability, indicated by the differences in acceleration and jerk between the actual and reference trajectories; collision risk, assessed by the risk level of collision with surrounding vehicles; speed deviation, gauged by the velocity difference between the actual trajectory and the reference speed; and lateral trajectory deviation, measured by the lateral distance difference between the actual trajectory and the reference trajectory.
The total cost function is utilized to evaluate the set of candidate trajectories in Section 2.2.1, followed by an assessment of their compliance with vehicle dynamics constraints, such as turning radius and speed/acceleration limits.The trajectory that not only satisfies the vehicle dynamics constraints but also incurs the minimum cost is selected as the final valid trajectory.
Vehicles within a 50 m perception range of the ego vehicle will be subject to the Frenet optimal trajectory control, with a trajectory being planned every 0.5 s and having a duration of 5 s.

Domain Randomization for Rule-Based Microscopic Traffic Flow
The domain randomization method is based on randomizing the model parameters in the IDM car-following model and the SL2015 lane-changing model.The randomized parameters are shown in Table 1 and are described below.
There are five randomized parameters in the IDM model."δ" is the acceleration exponent and "T" is the time gap in the IDM model, respectively."a max ", "a min ", and "v max " are the upper and lower limits of vehicle acceleration and the upper limit of vehicle speed, respectively.
There are two randomized parameters in the SL2015 model."lcSpeedGain" indicates the degree to which a vehicle is eager to change lanes to gain speed; the larger the value, the more inclined the vehicle is to change lanes."lcAssertive" is another parameter that significantly influences the driver's lane-changing model [22]; a lower "lcAssertive" value makes the vehicle more inclined to accept smaller lane-changing gaps, leading to more aggressive lane-changing behavior.
Ref. [23] found that the parameters δ, T, a max , a min , and v max are close to Gaussian distributions.Consequently, we adopt Gaussian distributions for all the domain-randomized parameters.All the randomized parameters follow Gaussian distributions within the interval [s min , s max ], with distribution s (µ, σ 2 ).Here, s max and s min are the upper and lower bounds of the randomization interval.µ is set to be (s max + s min )/2, and σ is set to be (s max − s min )/6.Thus, when a vehicle is generated, the probability that its randomized parameter value will fall within [s min , s max ] is 99.73%.
When each vehicle is initialized on the road for each episode, these randomized parameters are generated and assigned to it.

Simulation Experiment
In this section, we create freeway and merging environments in the open-source SUMO driving simulator [24] and establish the communication between SUMO and the reinforcement learning algorithm via TraCI [25].The timestep for the agent to select actions and observe environment state is set at 0.1 s.We create non-randomized microscopic traffic flow, the high-fidelity microscopic traffic flow of LimSim, and the domain-randomized microscopic traffic flows.We train the reinforcement learning-based autonomous vehicles under different microscopic traffic flows in freeway and merging scenes, respectively.

Merging 4.1.1. Merging Environment
We establish the merging environment inspired by Lin et al. [6].A control zone for the merging vehicle is established, spanning 100 m to the rear of the on-ramp's merging point and 100 m to the front of the merging point, as depicted in Figure 1.The red vehicle, operating under reinforcement learning control, is tasked with executing smooth and safe merging within the designated control area.

State
In defining the state of the reinforcement learning environment, the merging vehicle is projected onto the main road to produce the projected vehicle, and then a total of five vehicles are considered: two vehicles before the projected vehicle, two vehicles after the projected vehicle, and the projected vehicle.In order to utilize the observable information reasonably, the distance ( ) of these five vehicles to the merging point, as well as their velocities (v , are included in the state representation.These parameters form a state representation with eleven variables, defined as

Action
The action space we have defined is a continuous variable: acceleration within [−4.5, 2.5] m/s 2 .This range is consistent with the normal acceleration range of surrounding vehicles.

Reward
We aim for the merging vehicle to maintain a safe distance from the preceding and following vehicles, ensure comfort, and avoid coming to a stop or making the following vehicle brake sharply.Therefore, the reward function is expressed as follows: After merging, the merging vehicle is safer when its position is in the middle between the preceding and following vehicles.The corresponding penalizing reward is defined as where w m represents the weight factor, and ∆v max is the maximum allowable speed difference.The variable w is defined to measure the distance gap among the merging vehicle, its first preceding vehicle, and its first following vehicle.The details are as follows: where l p1 and l m represent the lengths of the first preceding vehicle and the merging vehicle, both measuring at 5 m.When the first following vehicle performs braking in the control zone, a penalizing reward is defined as where w b is the weight and a f 1 is acceleration of the first following vehicle.In order to improve the comfort level of the merging vehicle, we define a penalizing reward for jerk: where w j is the weight, j max is maximum allowed jerk, and ȧm is jerk of the merging vehicle.
In addition, if the merging vehicle comes to a stop, a penalty of R stop = −0.5 is imposed.When a merged vehicle collides with any vehicle, a penalty of R collision = −1 is applied.Conversely, if the merging vehicle successfully reaches its destination, a reward of R success = 1 is granted.Table 2 shows the values of the above-mentioned parameters of the merging vehicle.[26].The SAC algorithm uses the classical framework of reinforcement learning, actor-critic, which helps to optimize the value function and the policy at the same time, and it consists of a parameterized soft-Q function Q θ (s t , a t ) and a tractable policy π ϕ (a t |s t ).The parameters of these networks are θ and ϕ.This approach considers a more general maximum entropy objective that not only seeks to maximize rewards but also maintains a degree of randomness in action selection, as follows: where ρ π denotes the state-action distribution under the policy π, while H(π(•|s t )) signifies the entropy of the policy at state s t , thereby enhancing the unpredictability of the chosen actions.The temperature parameter α plays a pivotal role as it calibrates the balance between entropy and reward within the objective function, subsequently influencing the formulation of the optimal policy.The hyperparameters of SAC are the same as in Ref. [6].

Results under Different Microscopic Traffic Flows Training
In the merging environment, we trained 200,000 timesteps in each of three different microscopic traffic flows.The training was carried out on an NVIDIA RTX 3060 graphics card paired with an Intel i7-12700F processor.It required approximately 1 h to complete the training using both SUMO's default non-randomized and domain-randomized traffic flows.In contrast, the training under the condition of high-fidelity traffic flow took 3.5 h.Vehicle generation probability is 0.56, and the traffic density on the main road was approximately 16 vehicles per kilometer.

Testing
The trained policy was tested with 1000 episodes in the merging environment.We evaluated the trained policy based on the merging vehicle's success rate defined by the completion of an episode without any collisions and the average reward value over the entire testing period.

Comparison and Analysis
The training curves depicted in Figure 2 suggest that there is minimal visible difference in the rate of convergence and the rewards achieved by strategies trained under different microscopic traffic flows.
Table 3 shows that the policy trained under rule-based traffic flow without randomization and high-fidelity microscopic traffic flow yields poor results when adapted to domain-randomized rule-based traffic flow.Conversely, the policy trained under domainrandomized rule-based traffic flows consistently achieves success rates above 90% when tested across all three traffic flows.High-fidelity microscopic traffic flows closely resemble actual traffic scenes, so we used them as the test traffic flow with increased traffic densities.The impact of changes in traffic density is shown in Table 4.It can be observed that the policy trained under rule-based traffic flow without randomization experiences a gradual decline in success rates and rewards as traffic density increases.In contrast, the policy trained under domainrandomized rule-based traffic flow consistently maintains a higher success rate.ϕ is the vehicle generation probability of the microscopic traffic flow, defined as the number of vehicles that are generated from the lane starting point per second.

Ablation Study
In order to strengthen the understanding of individual domain-randomized parameters' role in the model's performance, we analyzed their individual impact on the training outcomes through an ablation study.For the ablation study, we separately ablated each of the domain-randomized parameters.Subsequently, policies were individually trained under the traffic flows with domain-randomized parameter ablation.Finally, the trained policies were tested under both the domain-randomized (all parameters randomized) and high-fidelity traffic flows.The results of the ablation study are shown in Table 5.
It can be observed that a decline occurs in the performance of the policies trained under the traffic flows with ablations when tested under the high-fidelity traffic flow.Moreover, the ablation of v max significantly affects performance.

Freeway 4.2.1. Freeway Environment
We used a straight two-lane freeway measuring 1000 m in length, inspired by Lin et al. [27].Figure 3    ), and the speed and acceleration (v ego t , a ego t ) of the ego vehicle.These parameters form a state representation with ten variables as follows:

Action
The action space is defined as follows: where acc ego t is a continuous action that indicates the acceleration of ego vehicle.Meanwhile, the discrete actions '0' and '1' dictate lane-changing behavior.The '0' means to keep the current lane and the '1' means an instantaneous lane change to the other lane.

Reward
We have formulated a reward function aligned with practical driving objectives, incentivizing behaviors such as avoiding collisions, obeying speed limits, preserving comfortable driving conditions, and maintaining a safe following distance.The total reward R total is expressed as follows: In order to penalize frequent lane changes, the penalty R act is defined as follows: where y t is ego vehicle's lateral position and ω 0 < ω 1 .If the vehicle changes lanes within the safety distance d safe , it incurs a penalty of ω 0 .Alternatively, changing lanes outside d safe results in a penalty of ω 1 .
It is essential to ensure that the ego vehicle maintains a safe following distance from the preceding vehicle, and the corresponding penalizing reward R distance is defined as The objective of R jerk is to ensure driving comfort.It is defined as where a t and a t−1 denote the acceleration at the current and previous moments, respectively.In order to promote the ego vehicle speed that enables overtaking, the penalty R v is defined as follows: When there is no opportunity to overtake the vehicle ahead, the ego vehicle should travel at a steady speed similar to that of the preceding vehicle.Consequently, we introduce a threshold d * .As long as d p ∈ [d safe , d safe + d * ], the ego vehicle will not incur penalties of R distance or R v .
In the equations presented above, ω i denotes the corresponding weights.The key parameters for the freeway scene are presented in Table 6.The SAC algorithm in Section 4.1.2can only solve continuous-action space problems.When dealing with continuous-discrete hybrid action space for freeway lane change, we adopt the Parameterized SAC (PASAC) algorithm, inspired by Lin et al. [27].
PASAC is based on SAC.The actor network produces continuous outputs, which include both continues actions and the weights for the discrete actions.An argmax function is utilized to select the discrete action associated with the maximum weight.
The freeway environment, having a hybrid continuous-discrete action space, requires the agent to be trained with the PASAC algorithm.The hyperparameters of PASAC are the same as SAC.

Results Under Different Microscopic Traffic Flows Training
In the freeway environment, we trained 400,000 timesteps in each of three different microscopic traffic flows.It required 1.5 h to complete the training using both rule-based traffic flows without randomization and with domain randomization.In contrast, the training under the condition of high-fidelity traffic flow took 5 h.Vehicle generation probability is 0.14 vehicles per second, and the traffic density on the main road was approximately 11 vehicles per kilometer on each straightaway.

Testing
The trained policy was tested with 1000 episodes in the freeway environment.We evaluated the trained policy based on the ego vehicle's success rate defined by the completion of an episode without any collisions, and the average reward value over the entire testing period.

Comparison and Analysis
In Figure 4, it can be observed that the policies all tend to converge around 200 episodes.Throughout the training process, aside from the initially lower reward of the domainrandomized traffic flows, the convergence rates and final rewards of the three curves are closely aligned.
The results of testing are shown in Table 7.It can be observed that the policy trained under domain-randomized rule-based traffic flows has the highest success rates when tested under different microscopic traffic flows.The policy trained under rule-based and high-fidelity traffic flows without randomization cannot adapt to domain-randomized rule-based traffic flow.

Figure 2 .
Figure 2. Undiscounted episode reward during training under three traffic flows.
depicts a standard lane-changing scenario in SUMO, where the ego vehicle is indicated by the red car and the surrounding vehicles are represented by the green cars. 100m

Figure 3 .
Figure 3.The ego vehicle overtakes along the arrow trajectory in the freeway.State The state of the environment is centered on the ego vehicle and four nearby vehicles: one directly in the front and one directly behind it in the same lane, and two similarly positioned vehicles in the adjacent lane.At time t, the state is defined by the longitudinal distance (d p t , d f t , d adjacent p t , d adjacent f t) of these four vehicles from the ego vehicle, their respec-

Table 2 .
Parameter values for the merging vehicle.

Table 3 .
The results of testing the trained policies regarding merging.

Table 4 .
The impact of traffic densities on three trained policies with high-fidelity traffic flow.

Table 6 .
Parameters for freeway simulation.