Deep Reinforcement Learning for Traffic Signal Control Model and Adaptation Study

Deep reinforcement learning provides a new approach to solving complex signal optimization problems at intersections. Earlier studies were limited to traditional traffic detection techniques, and the obtained traffic information was not accurate. With the advanced in technology, we can obtain highly accurate information on the traffic states by advanced detector technology. This provides an accurate source of data for deep reinforcement learning. There are many intersections in the urban network. To successfully apply deep reinforcement learning in a situation closer to reality, we need to consider the problem of extending the knowledge gained from the training to new scenarios. This study used advanced sensor technology as a data source to explore the variation pattern of state space under different traffic scenarios. It analyzes the relationship between the traffic demand and the actual traffic states. The model learned more from a more comprehensive state space of traffic. This model was successful applied to new traffic scenarios without additional training. Compared our proposed model with the popular SAC signal control model, the result shows that the average delay of the DQN model is 5.13 s and the SAC model is 6.52 s. Therefore, our model exhibits better control performance.


Introduction
Due to the limited space on urban roads, a series of traffic problems, including traffic congestion and traffic accidents, have arisen. This causes serious economic losses and constrains the sustainable development of cities. Increasing traffic congestion has become a common problem in cities. To address this problem, some researchers have proposed measures to build an intelligent traffic system using intelligent technologies. Traffic signal control is the core element of an intelligent traffic system and an important means to solve traffic problems [1].
With the development of communication technology, sensor technology in urban traffic systems has been enhanced to efficiently and accurately acquire complex traffic information to improve traffic signal control strategies [2]. Previous researchers have utilized a loop coil sensor to collect traffic flow data. The traffic data are utilized as the basis for setting the parameters of the traffic signal to achieve fixed timing optimization [3]. Traditional loop coil sensor technology can detect the number of vehicles but cannot identify the vehicle type and continuous traffic flow [4]. Manual surveys are often required to determine the distribution of vehicle types. The signal control strategies implemented with such traffic data are not accurate.
To obtain traffic data conveniently, video and radar sensor technology have been widely applied, which detects a larger range than the loop coil sensor and obtains traffic information for a cross-section. The video sensor captures a real-time scene of the intersection by camera. It is passed to the handler for processing to achieve traffic count and speed  Table 2. Comparison of sensors, detailing the range of applications and advantages and disadvantages of different sensors.

Loop Coil Sensor Video, Radar Sensors Connected Vehicles
Scope of detection section detection area detection holographic detection Advantage obtains traffic obtains the entire detection area information intersection's information is limited of one segment with high accuracy Disadvantage prone to break down highly influenced technology is not mature by the environment In past decades, researchers relied on loop coil sensors to obtain traffic status information. Fixed-timing signal control methods were utilized to improve traffic efficiency and capacity. Subsequently, adaptive control methods have been widely applied in signal control systems. Compared with fixed timing methods, these improved the flexibility of signal control and vehicle throughput efficiency [11]. Since these traditional signal control methods optimize signals across time periods and cycles [12], it is difficult to deal with the complex time-varying traffic demand. Therefore, some researchers have proposed data-driven models [13,14]. However, traffic information at intersections is complex and diverse. Limited by the sensor technology, the effect of data-driven models with lowprecision traffic status information is not ideal. With the development of technologies such as vehicle networking, the traffic information data that can be obtained are more refined. This enables researchers to develop new signal control models. Among these, the application of deep reinforcement learning (DRL) to the field of traffic signal control has become a research hotspot.
Much research has been conducted to solve traffic signal control problems with the DRL algorithm based on advanced sensor technology, which has achieved great effects in practice. However, cities have many intersections and a high dimensionality of traffic demand. Each intersection needs to be trained with a corresponding set of signal control models [15]. Retraining the model to calibrate the parameters takes a significant amount of time [16][17][18]. Hence, model training is a problem.
Therefore, several important issues remain to be solved, which include: (1) Finding a suitable model that can be applied to more than one intersection scenario; (2) Determining the relationship between the traffic demand at the intersection and the actual traffic state; (3) Designing a traffic scenario that represents a more comprehensive traffic demand.
In this paper, we research an adaptable DRL signal control model to deal with these problems.

Related Work
The DRL algorithm is a combination of reinforcement learning and deep learning. Reinforcement learning acquires knowledge by autonomously interacting with the environment through a trial-and-error learning model, which is similar to a human being [19]. Minh et al. first proposed the deep Q-network model [20], which is suitable for processing high-dimensional data. Some researchers have applied deep reinforcement learning (DRL) to the field of signal control and achieved positive effects in solving complex traffic congestion problems.
Much of the existing work uses deep reinforcement learning techniques to solve complex signal optimization problems at intersections. Firstly, DRL traffic signal control measures can be divided into three kinds: value-based, policy-based, and actor-critic approaches, which are shown in Table 3. According to the training policy, the DRL signal control methods can be divided into on-policy and off-policy. The deep Q-learning (DQN) is an on-policy algorithm. The off-policy signal control model does not interact with the environment in real-time. It takes a batch of samples from the experience replay buffer for learning [21][22][23]. Kim proposed a cooperative traffic signal control with traffic flow prediction (TFP-CTSC) for a multi-intersection. The results indicated that the model improved the travel efficiency of vehicles on the road network [24]. Rasheed proposed a multiagent DQN (MADQN) and investigated its use to further address the problem of dimensionality under traffic network scenarios with high traffic volume and disturbances. The simulation results showed that the proposed scheme reduced the total travel time of the vehicles [25]. Song transferred the well-trained action policy of a previous DQN model into a target model under similar traffic scenarios. The results indicated that, compared to the directly trained DQN, transfer-based models could improve both the training efficiency and model performance [18].
The on-policy signal control model interacts with the environment in real-time to obtain an optimized policy, which can optimize while learning [26]. The PG is an onpolicy algorithm, and it learns a parameterized strategy function with sampled episode return [27]. Rizzo et al. proposed the time critic PG technique to avoid jams in heavy traffic volumes. Policy-based algorithms are effective in high-dimensional and continuous action spaces [28].
The SAC algorithm is an off-policy actor-critic approach that has excellent sampling efficiency [29]. The off-policy approach solves the problems of difficult data collection and the high cost and risky implementation process in an on-policy approach. It is of great importance in practical applications. Mao evaluated seven prevailing DRL algorithms from two aspects: training and execution performance. The testing results indicated that the soft actor-critic (SAC) outperformed other DRL algorithms and the maximum pressure method in most cases [30]. Chu presented the advantage actor-critic (A2C) to stabilize the learning procedure, and the results demonstrated its optimality, robustness, and sample efficiency over other decentralized MARL algorithms [15]. Li proposed a PPO algorithm to optimize the fairness of all drivers' waiting times, and the results showed the algorithm efficiently optimized the fairness criterion and had a more stable performance than the A2C model [31]. These researchers enhanced the performance of the DRL method by different means. The model usually performed better in the same or similar traffic scenarios. However, it cannot be adapted to new traffic scenarios. It takes time and effort to train a specific DQN model for each intersection in reality [18]. Therefore, the question of how to find a signal control model that can be adapted to new traffic scenarios needs to be addressed. In this research, we search for a method that can train adaptive models based on the DRL approach. First and foremost, a suitable DRL algorithm needs to be selected as the testing algorithm. Mao et al. showed that the value-based/off-policy optimization approach had good sampling efficiency. It showed better performance than other approaches for a discrete action space decision-making problem, such as traffic signal control [30]. Due to the strong correlation between traffic states, the algorithm training needs to maintain the independence between samples. In addition, the corresponding road facilities are not yet available, and there are technical difficulties in the implementation of an on-policy strategy. Therefore, off-policy was selected to optimize the signal control model in this paper. The focus of the research was not to optimize the performance of the algorithm itself but to explore how to train a DRL signal control model that could be generalized to new traffic scenarios with an adaptable performance. We examined several value-based/offpolicy algorithms with similar effects. Among them, the DQN algorithm is a mature form of urban signal control [32,33]. We used this method as the test algorithm for this research.
In the following section, we describe the construction of a traffic simulation platform based on the DQN algorithm to simulate the signal control. From the perspective of enriching the sample, a comprehensive traffic demand was designed so that the model could "see" as many traffic states as possible to improve the model's adaptability to new traffic scenarios. The relationship between the input traffic demand and the actual traffic flow state was analyzed to aid in designing efficient traffic demand. The feasibility and effectiveness of the model proposed in the research were verified by comparing it with other signal control methods. Ultimately, we obtained a model that could be adapted to new traffic scenarios, which saves the time and effort consumed by repeated training at different intersections. The overall framework of the research is shown in Figure 1.

Materials and Methods
In this section, we first describe the framework of the DQN traffic signal model, including the overall architecture of the DQN algorithm, the definition of the reward function, and the setting of the observation state and action. Secondly, we explain the training process and method of the model. Finally, we analyze the relationship between traffic demand and traffic states.

Architecture of the DQN Model
The traffic signal control problem is described as an optimization problem with deep reinforcement learning. The overall framework of the DQN signal control model is shown in Figure 2. First, we observed the environment to obtain the traffic status of the intersection, such as vehicle speed, queue length, etc. The real-time traffic status information was input into the Q network, and the Q value corresponding to each signal decision action was the output. According to the action selection strategy, the signal decision action with the maximum Q value was selected. It was sent to the signal machine for execution. Then, feedback was provided to measure the traffic operation of the intersection. The process was repeated until the objective of maximizing the cumulative reward R t = ∑ ∞ n=0 γ n r t+n was reached. The traffic parameters, such as the traffic state information, signal decision action, reward value, and the traffic state at the next time, were stored in the quadratic form in the experience replay buffer. The Q network was trained with a batch of samples until convergence, and the optimal mapping of the "state-action" was obtained.
It is necessary to determine the traffic parameter indicators of the model, such as the traffic state information, the signal actions, and the reward functions for evaluating the control effect. The model observes the intersection environment at time t to obtain the information of the traffic state s t . The queue length was chosen to characterize the state of the intersection, which describes the distribution of the vehicle queue lengths on each lane, s t ∈ S. The model chose a corresponding action a t , which is defined as the selection of the signal phase according to the traffic state, a t ∈ A. The environment provided feedback on the action taken by the signal at the next moment. The reward function r(s t , a t ) is defined as the inverse of the average vehicle cumulative delay time, which is used to quantify the impact of the action.

Reward Function
The reward value reflects the impact of the model after deciding. The reward function influences the final performance of the model, which includes vehicle delay time, queue length, waiting time and vehicle speed, etc. The objective of the study is to improve the efficiency of vehicle movement and reduce the delay time of vehicles at the intersection. Therefore, the opposite of the average vehicle delay time is set as the reward function, and its expression is where α is the weighting factor; d l t is the average delay time for each lane at the time t, l i ∈ L.

Observation State
Traffic observation states are the critical factors for signals to make decisions, and each observation state can contain one or more substates, s t = (s t 1 , s t 2 , s t 3 , ..., s t j ). Researchers usually select traffic information such as the vehicle location, the average speed, the traffic throughput, the queue length, and the average waiting time as the observed states. The corresponding observation matrix was constructed as the input of the DQN algorithm.
The queue length is selected as the observed traffic state. The queue length indicates the number of vehicles waiting in the queue on the lane, which changes with the arrival and departure of vehicles. The queue length q i of each lane l i at the intersection is collected. As shown in Figure 3, the traffic flow at the intersection is divided into eight flow directions. The traffic observation state is an eight-dimensional matrix: Among these, q 1 = max{q 11 , q 12 , q 13 }, q 3 = max{q 31 , q 32 , q 33 }, q 5 = max{q 51 , q 52 , q 53 }, and q 7 = max{q 71 , q 72 , q 73 }.

Action Settings
Signals at intersections make appropriate decisions based on the current traffic situation. The decision variables for signal control include the phase green time, the phase switching, and the phase selection. The phase green time optimizes the signal by adjusting the green duration. The phase switching is based on a predefined phase sequence that determines the duration of the green light and switches the signal to the next preset phase. Phase switching is the process of deciding whether to switch the signal to the next phase in a predefined phase sequence. Phase selection is more flexible than phase switching in that it selects the phase to be performed from a set containing multiple phases.
Phase selection is set as the possible actions of the model, which decide the next phase according to the traffic states [34]. Firstly, four feasible phases are defined for the signal system, as shown in Figure 4. The set of phases is phase = {NSL, NSS, WEL, WES}. The set of phase selections is Action = {0, 1, 2, 3}. The phase to be executed is selected according to the action of the Q network output. The phase sequence is shown in Figure 5. The specific decision process proceeds as follows: when a t = 0, execute the phase 0. When a t = 1, execute the phase 2. When a t = 2, execute the phase 4. When a t = 3, execute the phase 6.

Training of the DQN Model
The SUMO and Python software are utilized to construct a traffic simulation platform for the experiments. The main functions of SUMO include building road networks, generating traffic demands, and obtaining various traffic evaluation indicators. The function of Python is to implement the DQN algorithm and interact with SUMO in real-time.
We set up two main files to run the SUMO simulation, which include the following: (1) Road network file (net.xml): We built the road network and set up the road details in this file.
(2) Traffic routing file (rou.xml): We input the traffic requirements and generated the traffic scenarios in this file.
There are other files, which include vehicle description files and detector description files, which could be added to run a superior simulation.
We designed the DQN algorithm and determined the hyperparameters. When setting the traffic simulation time, we considered that after generating the traffic demand in the road network, the input traffic flow needs a certain time to enter the steady state to ensure the desired duration of traffic demand stability and improve the possibility of the model to explore the knowledge fully. Therefore, the simulation duration must be sufficient. The number of simulations needed to ensure the convergence of the neural network. The recommended value for the discount factor was 0.9, the recommended value for the batch size was 400, the recommended value for the learning rate was 3 × 10 −4 , and the recommended value for the number of neural network iterations at each sampling was 4 [15,29,35].
After initializing the parameters, Python interacted with SUMO in real time to obtain real-time traffic status information of the intersection. The traffic status was fed into the neural network. Then, the neural network output the Q values corresponding to each phase. We selected the phase to be executed by ε − greedy and sent it down to the signal for execution. The average delay of the current moment was obtained to evaluate the control effect of the current phase. The traffic state, the selected phase, the average delay, and the traffic state at the next time step were stored in the experience replay buffer. Finally, a batch of samples was randomly drawn from the experience pool and used to update the weights of the neural network.
When the input layer of the neural network is s t , the output will be Q π (s t ). When the input layer of the neural network is s t+1 , the output will be Q π (s t+1 ). The goal of updating the weights of the neural network was to obtain the difference between Q π (s t ) and Q π (s t+1 ) closer to r t . The neural network output the actual Q-value, while the target value was approximated using the value corresponding to the action with the largest Q-value in the next state. The equation for updating is: where r(s t , a t ) is the reward. α,γ is the discount factor. In addition, we designed a suitable traffic scenario to test the trained model. The model testing was divided into the same demand scenario and a new demand scenario. The same demand scenario testing referred to the testing under the same test scenario and training scenario, which was used to verify the performance of the completed training model. The new demand scenario testing referred to testing under different test scenarios from the training scenarios and was used to verify the adaptive performance of the model. Finally, the DQN model of the research was analyzed and compared to several existing traffic signal control methods.

The Relationship between the Traffic Demand and Traffic States
To construct traffic states with various degrees of equilibrium, we designed n traffic scenarios. In each traffic scenario, different percentages of traffic flow were assigned to multiple directions of the intersection. The designed traffic scenarios represented several typical traffic flow states, extreme traffic states, perfectly balanced traffic states, perfectly balanced and mildly unbalanced traffic states, and fully balanced traffic states.
Exploring empirical knowledge plays a key role in deep reinforcement learning. The generalized application of the model can be limited by restricted empirical knowledge. To obtain models with adaptability, it is necessary to provide rich empirical knowledge for model exploration. Therefore, the study classified the traffic demand by the balance level of traffic distribution. Each model was equipped with traffic demand scenarios with different levels of balance.
There are n periods to input traffic, and the total number of vehicles at period t is Q t . Assume that the simulation time for each period is s and the model is simulated once in time T = n · s. To distinguish between traffic states with different levels of balance. The intersection has m directions and assigns traffic flow in direction i is q i (t) , Therefore, the percentage of the assigned traffic flow in direction i is r i (t), r i (t) = q i (t) Q i (t) . Traffic scenarios set up traffic arrivals by time series. However, the actual traffic state is influenced by the signal control and has a strong time correlation. Therefore, to study how to set a more comprehensive and effective demand scenario, the actual traffic state is to be analyzed.Since the traffic parameter of queue length is a continuous value, the dimensionality of its solution is large. In addition, the traffic states with similar values of queue length are similar, and there is no abrupt variation in traffic states. Therefore, we decided to discretize the queue length of intersections to improve the efficiency of the traffic state analysis and reduce the computational consumption. The specific discretization process: we divided the intersection into x flow directions and divided the queue length into k segments. The unit queue length interval of the i-th segment is l i , q ∈ [q i , q i+1 ).

The final number of state spaces for a certain flow direction is obtained as
Therefore, the number of state spaces at the intersection is a x .

Experimental Settings
The intersection is designed as a "cross-shaped" intersection. The four directions of the intersection are four lanes in both directions, including a left-turn lane, two straight lanes, and a right-turn lane. The length of the lane is 750 m. The length of the vehicle is 5 m, the maximum driving speed is 25 m/s, and the average speed is 10 m/s. In addition, detailed information about the traffic demand is described in the next subsection.
The initialized hyperparameters of the DQN model are set as shown in Table 4.

Traffic Demand Settings
In this section, we describe the design of four representative traffic scenarios as the input for the models. The intersection had four directions, including east, west, north, and south. The percentage of the traffic flow was r 1 (t), r 2 (t), r 3 (t), r 4 (t). The traffic flow was input in five time periods. The simulation time for one period was set to 1200 s to ensure that the model was explored sufficiently. The designed traffic scenarios represented several typical traffic flow states, which were extreme traffic states, perfectly balanced traffic states, perfectly balanced and mildly unbalanced traffic states, and fully balanced traffic states.
Model 1 represented a traffic scenario containing extreme traffic conditions, where all vehicles at the intersection were assigned to the south and north directions, and no vehicles were assigned to any other direction. It had extreme unevenness in vehicle distribution. Model 2 represented a traffic scenario that contained perfectly balanced traffic states. The distribution of vehicles at the intersection was identical in each direction. Model 3 represented a traffic scenario with perfectly balanced and lightly unbalanced traffic. The distribution of vehicles at the intersection was identical for a portion of the time period. In the other part of the time period, the south and north traffic was distributed upwards a little more than in the other directions. Model 4 represented a traffic scenario with a fully balanced traffic condition. In all five time periods, the input traffic flows were the same. However, the percentage of vehicles distributed in the south and north directions gradually became larger, and the percentage of vehicles distributed in the east and west directions gradually became smaller. Therefore, the directional distribution of traffic flows containing fully balanced, mildly unbalanced, and more extreme traffic states were included. The vehicle allocation is shown in Table 5 and Figure 6.
The model is trained based on four demand scenarios. The observed states are discretized. There are eight flow directions at the intersection. The upper limit of queue length is the lane length q max = 750 m, and the queue length of each flow direction is divided into three segments, which are shown in Tables 6 and 7.
After the data are discretized, each flow direction contains fifteen state spaces, and the number of state spaces is 15 8 states at the intersection. The number of state spaces of each model is shown in Figure 7. Table 5. Design of traffic scenarios for training: allocation of vehicles in each direction.

Percentage t = 1 t = 2 t = 3 t = 4 t = 5
Model 1   Model 1 represents the extreme traffic state, which contains a significantly lower number of state spaces than the other models. Models 2-4 have increasingly comprehensive traffic scenario designs, and the number of their state spaces is generally proportional. The number of state spaces does not depend entirely on the design of the traffic scenario, but is also related to the interaction between vehicles while they are moving. In addition, the temporal correlation between traffic states also affects the distribution of the state space. Therefore, we should not only design more comprehensive traffic scenarios, but also pay attention to the coverage of the actual traffic state.
To further analyze the distribution of the state space, the frequency distribution of state spaces of the four models was counted, which is shown in Figure 8. As shown in Figure 8, the average frequency of each state space in model 1 is 17.16. The average frequencies of each state space in models 2-4 are 4.27, 4.69, and 4.92, respectively. The average state spaces frequencies of four models are compared, which were shown in Figure 9.
As shown in Figure 9, model 1 only inputs vehicles in the directions of south and north. The distribution of state spaces is more concentrated and the average frequency is larger. Models 2-4 have a similar number of state space categories. As the balance of vehicle distribution becomes more comprehensive, the average frequency of each state space becomes larger. It indicates that the range of knowledge that the model can explore is becoming more complete.

Performance and Adaptation Analysis
The indicator of the loss function reflects how well the model is trained. The smaller the value of the loss function, the better the model is trained. Based on the above process for training, the variation of the loss function of each model is shown in Figure 10. As shown in Figure 10, the four DQN models eventually converge. The loss function of Model 1 is of a larger order of magnitude, due to the extreme case of the traffic scenario. The loss functions of models 2-4 present a trend of gradually growing larger. The traffic scenario of Model 2 is the most balanced. The empirical knowledge used by the neural network for training is similar. Therefore, the fluctuation of the loss function is slight. While the traffic scenario of Model 4 is of different levels of balance, the empirical knowledge used by the neural network for training is complex and diverse. Therefore, the loss function fluctuates drastically.
The indicators of the average delay, loss time, and average cumulative delay are selected to evaluate the control performance of the DQN model. The average delay refers to the average delay time of all detected vehicles at a certain moment. The loss time refers to the cumulative loss time of all detected vehicles within a certain time interval.The average cumulative delay refers to the average cumulative delay time of all detected vehicles within a certain time interval.
Simulation experiments are conducted using the traffic scenarios during training, and the indicators of evaluation are presented in Table 8. The experiments are conducted using the new traffic scenarios, and the evaluation indicators are presented in Tables 10-12.   In the same test traffic scenario, comparing the average delays of all four models, Model 1 shows poor performance. It indicates that the knowledge learned by the model trained under extreme traffic scenarios is very limited and cannot be adapted to other traffic scenarios. Model 2 shows better performance in test traffic scenarios 1 and 4, but poorer performance in test traffic scenario 2. Model 3 and Model 4 are basically adaptable to all test traffic scenarios. However, the comparison reveals that Model 4 has better test results, indicating that it can achieve excellent control under other traffic demands as well. In the same test demand scenario, a comparison of lost time and average cumulative delay for the four models show a similar pattern to the average delay.
The results of the experiments show that Model 1 has poor levels of all evaluation indicators in each new test scenario compared to the other models. Therefore, it is less adaptable to new scenarios. Model 4 exhibits a better level of each evaluation metric in each new test scenario compared to the other models. Thus, it is more adaptable than the other models. The study shows that traffic scenarios should be designed to be comprehensive so that the number of state spaces is high, and the frequency of occurrence is also high. Once the model learns knowledge in as many comprehensive state spaces as possible, it would have the ability to adapt to new scenarios.

Execution Performance Comparison
In this research, the DQN model with the best adaptation is compared with the SAC signal control model [30], the adaptive signal control method and the multi-time fixed timing method. Average cumulative delay and average delay are selected as the indicators for evaluation. Two traffic scenarios are designed for testing. The first traffic scenario for testing is designed as shown in Table 13 and Figure 12.  The result of the average cumulative delay is obtained as shown in Figure 13. In the first test scenario, the average cumulative delay of the DQN model is 13.75 s. The average delay of the SAC model is 14.02 s. The average cumulative delay value of the adaptive control method model is 11.34 s and the multi-time fixed timing method is 19.87 s. The result shows that the adaptive signal control method achieve a better control effect for the signal control problem of an isolated intersection. The average cumulative delay data of the DQN model is a bit worse than it, but the gap is small. In addition, its control performance is better than that of the multi-time fixed timing method.
The result of the average delay is obtained as shown in Figure 14. The average delay of the DQN model is 4.11 s. The average delay of the SAC model is 4.42 s. The average delays of the adaptive control model and the multi-time fixed timing method are 5.48 s and 14.15 s, respectively. Compared with the popular SAC signal control model, the average delay of the DQN model is reduced 8%. Compared with the adaptive signal control method, the average delay time is reduced 33%. The adaptation of the multi-time fixed timing signal control method in complex traffic scenarios is not ideal, indicating that it is unable to respond to diverse traffic demands and has some limitations.  In the first, second, and fourth periods, traffic demand with a more balanced traffic distribution state is provided. Our proposed model and the SAC model exhibit similar control performance in this case. In the third and fifth time periods, traffic demands with more extreme traffic distribution states are provided. Our proposed model exhibits more stable control performance, but the SAC model performs poorly and cannot adapt to the traffic scenario. Therefore, our model shows better performance than the advanced SAC model.
The second traffic scenario for testing is designed as shown in Table 14. The result of the average cumulative delay is obtained as shown in Figure 15. In the first test scenario, the average cumulative delay of the DQN model is 15.13 s. The average delay of the SAC model is 15.75 s. The average cumulative delay of the adaptive control method is 15.69 s and the multi-time fixed timing method is 20.99 s. The result shows that the performance of the DQN model is better than that of other methods. The result of the average delay is obtained as shown in Figure 16. The average delay of the DQN model is 5.13 s. The average delay of the SAC model is 6.52 s. The average delays of the adaptive control model and the multi-time fixed timing method are 9.08 s and 14.78 s, respectively. Compared to the popular SAC and the adaptive signal control models, the average delay of the DQN model is reduced. The multi-time fixed timing signal control method cannot adapt to complex traffic scenarios and it has poorer performance.
In general, for the indicator of average cumulative delay, our proposed DQN model generally exhibits similar performance to the SAC signal control model and the adaptive signal control method. For the indicator of average delay, the DQN model exhibits the best signal control results. The result is related to the optimization objective we selected to train the DQN model. Since the reward function is set to minimize the average delay of the vehicles, the DQN model performs significantly better in signal control for the average delay.

Discussion and Conclusions
In this article, we provided a reasonable platform to test and compare various DRL algorithms for isolated traffic signal control. The main contents of this research are summarized as follows.
(1) To find a suitable model that could be applied to more than one intersection scenario, we designed four typical traffic scenarios as the input of the model to research. The designed traffic scenarios were extreme traffic states, perfectly balanced traffic states, perfectly balanced and mildly unbalanced traffic states, and fully balanced traffic states. It was found that the more complex the designed traffic demand, the more the traffic scenarios contained more traffic states, and the greater the possibility for the model to explore the knowledge fully. The result showed that the model with the traffic scenario of fully balanced traffic states had a better ability to adapt to new traffic scenarios.
(2) To verify the performance of our proposed signal control model, we compared it with other signal control methods, which included the multi-time fixed timing method, the adaptive control method, and the SAC model. The test results showed that our model achieved excellent performance. It worked better for signal control at an intersection than the popular SAC model.
(3) In real traffic systems, traffic demand has a high dimensionality. It takes much time and effort to train and calibrate the parameters for each intersection to obtain a signal control model. In this research, the training method with an adaptive performance model was provided by enriching the samples, which were used in learning. This enabled the model to be extended to more intersections and it provides a potential and feasible solution for urban signal control.
These test results are useful for many intelligent traffic control tasks, including how to use the DRL algorithm to solve complex traffic demand problems, including multiintersection signal control. There is an autonomous driving demonstration area at Yizhuang, Beijing, where some of the advanced sensors are laid down. Our proposed model will be well integrated with real traffic systems for implementation purposes. For future work, we will investigate how to reasonably implement an adaptable DQN model to a real road network. Constrained by page limits, we will discuss these issues in the following articles.