Real-Time Adaptive Traffic Signal Control in a Connected and Automated Vehicle Environment: Optimisation of Signal Planning with Reinforcement Learning under Vehicle Speed Guidance

Adaptive traffic signal control (ATSC) is an effective method to reduce traffic congestion in modern urban areas. Many studies adopted various approaches to adjust traffic signal plans according to real-time traffic in response to demand fluctuations to improve urban network performance (e.g., minimise delay). Recently, learning-based methods such as reinforcement learning (RL) have achieved promising results in signal plan optimisation. However, adopting these self-learning techniques in future traffic environments in the presence of connected and automated vehicles (CAVs) remains largely an open challenge. This study develops a real-time RL-based adaptive traffic signal control that optimises a signal plan to minimise the total queue length while allowing the CAVs to adjust their speed based on a fixed timing strategy to decrease total stop delays. The highlight of this work is combining a speed guidance system with a reinforcement learning-based traffic signal control. Two different performance measures are implemented to minimise total queue length and total stop delays. Results indicate that the proposed method outperforms a fixed timing plan (with optimal speed advisory in a CAV environment) and traditional actuated control, in terms of average stop delay of vehicle and queue length, particularly under saturated and oversaturated conditions.


Introduction
Traffic congestion has been a key urban issue, causing high economic costs in many cities worldwide. The report carried out by the Institute of Economic Affairs (IEA) claims that just a two-minute delay to every car journey costs the economy approximately 16 billion GBP a year, or nearly one percent of GDP (gross domestic product) in the UK [1]. Traffic lights play an essential role in controlling traffic flow and minimising delay, especially in an urban area. As a result, the number of traffic lights in England has increased by 25% since 2000, while the number of cars on roads grew by just 5% [1]. However, most of them use offline control systems based on the historical traffic flow and cannot respond to unexpected traffic situations (e.g., car accidents) efficiently or predict future traffic flows. Thus, adaptive traffic signal control (ATSC) approaches have been developed by numerous researchers [2][3][4]. Recently, advancements in computer performance and new optimisation methods have allowed researchers and practitioners to adopt heuristic methods for the ATSC process (e.g., real-time hierarchical optimising distributed effective system, RHODES [5]). These systems try to optimise traffic signal parameters in real time without considering a cyclic time interval. Thus, the signal plan could change at any time step depending on rapidly changing traffic conditions [6].

Related Work
Adaptive traffic signal control approaches, including SCOOT [25] and SCATS [26], have been widely used in real-world traffic networks. They are based on an open-loop control system that does not consider feedback control in the traffic network ( Figure 1). They use a cyclic system with pre-determined time intervals, which means the controller updates the signal timing plan (cycle length, green signal ratio, and phase difference) at a specific time interval [27]. Studies show that the traffic flow at intersections may vary significantly in major cities due to the fluctuation of traffic demand [28,29]. Nevertheless, these typical adaptive traffic signal control systems cannot respond to such travel demands (traffic flow varies at shorter time intervals) [6] and require complex computation schemes that make their implementation costly [30]. This can increase travel delays for road users.
program for actuated control that is based on the guidelines of the Federal Highway Administration (FHWA) [24]. The actuated control prioritises the phase with the primary approach (main road), and detector actuation partially controls each phase's time.
The rest of this paper is organised as follows. Section 2 presents the relevant literature review. Section 3 presents our integrated reinforcement learning (RL) and speed guidance approach under the CAVs environment. In Section 4, the simulation setup is presented, followed by our evaluation results in VISSIM microsimulation software. Finally, in Section 5, we conclude and provide ideas for future work.

Related Work
Adaptive traffic signal control approaches, including SCOOT [25] and SCATS [26], have been widely used in real-world traffic networks. They are based on an open-loop control system that does not consider feedback control in the traffic network ( Figure 1). They use a cyclic system with pre-determined time intervals, which means the controller updates the signal timing plan (cycle length, green signal ratio, and phase difference) at a specific time interval [27]. Studies show that the traffic flow at intersections may vary significantly in major cities due to the fluctuation of traffic demand [28,29]. Nevertheless, these typical adaptive traffic signal control systems cannot respond to such travel demands (traffic flow varies at shorter time intervals) [6] and require complex computation schemes that make their implementation costly [30]. This can increase travel delays for road users. As mentioned in the introduction section, adaptive traffic signal control in a connected vehicle environment has shown a positive effect on the improvement in network efficiency [10][11][12][13]. Connected vehicle technology is a mobile data platform that allows real-time information to be exchanged among vehicles and between vehicles and infrastructure [31].

Traffic Signal Control under CAVs
As mentioned in [15], there are many novel approaches for CAV-based traffic control. One of the most common methods, which can be seen in various studies, is the 'advanced driver guidance'. In this approach, vehicle speeds and positions are adjusted to minimise some performance measures. Reference [32] proposed an integrated traffic control model to optimise the total delay, and the decision variables for this research were vehicle arrival time and signal timing. They simulated the speed guidance model of CVs in VISSIM microsimulation software. They concluded that this method could significantly decrease vehicle delays and the number of stops. Reference [33] introduced GLOSA and assessed its benefits in reducing vehicles' stop time behind a traffic light and fuel consumption using an integrated cooperative ITS simulation platform. Under two simulated scenarios, reference [34] investigated the positive impacts of the speed guidance on fuel consumption and driving behaviour for multiple signalised intersections. Other methods can be found in the literature, for instance, planning-based traffic signal control [35], platoon-based traffic signal control [36], and signal vehicle coupled control (SVCC) [15,37]. Unfortunately, most existing CAV studies assumed fixed traffic signal timings to optimise the trajectory of CAVs [18]. Unlike the current research methods, this paper presents an adaptive traffic signal control under the CAVs with the advanced traffic control (speed guidance model). As mentioned in the introduction section, adaptive traffic signal control in a connected vehicle environment has shown a positive effect on the improvement in network efficiency [10][11][12][13]. Connected vehicle technology is a mobile data platform that allows real-time information to be exchanged among vehicles and between vehicles and infrastructure [31].

Traffic Signal Control under CAVs
As mentioned in [15], there are many novel approaches for CAV-based traffic control. One of the most common methods, which can be seen in various studies, is the 'advanced driver guidance'. In this approach, vehicle speeds and positions are adjusted to minimise some performance measures. Reference [32] proposed an integrated traffic control model to optimise the total delay, and the decision variables for this research were vehicle arrival time and signal timing. They simulated the speed guidance model of CVs in VISSIM microsimulation software. They concluded that this method could significantly decrease vehicle delays and the number of stops. Reference [33] introduced GLOSA and assessed its benefits in reducing vehicles' stop time behind a traffic light and fuel consumption using an integrated cooperative ITS simulation platform. Under two simulated scenarios, reference [34] investigated the positive impacts of the speed guidance on fuel consumption and driving behaviour for multiple signalised intersections. Other methods can be found in the literature, for instance, planning-based traffic signal control [35], platoon-based traffic signal control [36], and signal vehicle coupled control (SVCC) [15,37]. Unfortunately, most existing CAV studies assumed fixed traffic signal timings to optimise the trajectory of CAVs [18]. Unlike the current research methods, this paper presents an adaptive traffic signal control under the CAVs with the advanced traffic control (speed guidance model).

Closed-Loop Signal Control
Compared to typical adaptive traffic signal control (open-loop) systems, two general closed-loop signal control approaches exist. The traditional signal control approaches (i.e., non-learning-based approach) use a simple feedback loop control system and utilise only the current traffic flow but not historical traffic flow data. In addition, these approaches do not have underlying models for state prediction and optimisation. A learning-based method such as reinforcement learning can learn from the traffic environment by taking actions (i.e., cycle length and phase split) and observing the feedback that can help us predict the traffic flow and optimise the signal plan [32]. Some studies propose a simple closed-loop system (non-learning-based approach) in traffic signal control. Reference [38] presented a decentralised feedback control mechanism aiming to equalise the degree of saturation and queue length on different approaches toward intersections in a network. Reference [39] showed a localised feedback speed control for mainline traffic on motorways. The backpressure controller is a distributed feedback system that does not require knowledge of global network inflow. Reference [40] introduced the traffic-responsive urban control (TUC) strategy as a network-wide feedback approach. Based on store-and-forward modelling of the urban network traffic and using the linear-quadratic regulator theory, the design of TUC leads to a multivariable regulator for traffic-responsive coordinated network-wide signal control that is also particularly suitable for saturated traffic conditions. Reference [6] proposed an upgrade closed-loop feedback signal control strategy, which takes the total amount of instantaneous stopped delay (ISD Total) as input detector data for measurement frame-by-frame in traffic flow video and realised real-time switches of the signal status when the amount reaches the threshold, and adaptively distributes the green interval to the most needed approaches (east-west and north-south) without the regular traffic signal cycle time. In other words, they still used a non-learning-based method but with new forms of data.

Learning-Based Approach (Reinforcement Learning)
Reinforcement learning-based adaptive traffic signal control changes traffic signals based on the feedback from the traffic demand, which can be hypothetical dynamic [41][42][43][44] or based on real-world data [45,46]. The existing literature on using the reinforcement learning approach can be categorised into two groups: networks consisting of CVs [44,47,48] and non-CVs environments [45,46,49]. Moreover, two general classifications (i.e., vehicle positions and queue length) are available for state representation. References [41,43,49,50] proposed the state as discrete values such as the position of vehicles or different levels of queue length. Nevertheless, this kind of state needs massive storage space for solving large problems. Therefore, recent studies recommend the use of continuous states such as queue length [42,45,51], average delay [44], and waiting time [30,42,48]. Furthermore, all of the papers relevant to this topic have used simulation platforms in order to obtain their desired results. SUMO [43,44,49,50], VISSIM [46,47], AIMSUN [45], and PARAMICS [30] are the most common software packages in which the combination of traffic simulation and RL can be executed appropriately. Finally, it is worth mentioning that all the previous papers took the signal controller as an agent for their RL algorithm except [47], which used connected vehicles as its agents.
This research aims to bridge the gap between the RL-based adaptive traffic signal control and the advanced traffic control (speed guidance model) under CAVs. The RL based approach focuses on the intersection as an agent in order to optimise total queue length while allowing the CAVs to adjust their speed based on a fixed timing plan to minimise the total stop delays as an agent adapts the traffic signal control to an adjusted speed of CAVs in real-time (updated traffic flow).

Methods
This section discusses the proposed framework for RL, PTV VISSIM microsimulation platform and implementation details. The RL framework is implemented in Python and integrated into VISSIM through the component object model (COM) interface (COM interface of PTV VISSIM).

Reinforcement Learning (RL)
RL is an area of adaptive control that encompasses algorithms for learning optimal behaviour policies in sequential decision-making problems from scalar rewards. It is assumed that a decision-making problem is time-discrete, and it can be represented with a Markov decision process (MDP) (S, A, P a , R a ), where S is the state space, A is the space of possible actions, P a (s t , s t+1 ) = P(s t+1 |s t , a t ) is the probability of transitioning into state s t+1 by taking action a t in-state s t , and R a (s t , s t+1 ) is the immediate reward received after transitioning s t into s t+1 by taking action a t . The agent is specified by a policy π(a|s) mapping each state to a probability distribution over actions. An optimal policy maximises discounted return G t=0 , which is defined as the sum of discounted future rewards (see (1)). The state-value function V π (s) is the expected return of a state s, assuming the agent follows policy π in all time steps (see (2)).
integrated into VISSIM through the component object model (COM) interface (COM interface of PTV VISSIM).

Reinforcement Learning (RL)
RL is an area of adaptive control that encompasses algorithms for learning optimal behaviour policies in sequential decision-making problems from scalar rewards. It is assumed that a decision-making problem is time-discrete, and it can be represented with a Markov decision process (MDP) (S, A, Pa, Ra), where S is the state space, A is the space of possible actions, Pa(st, st+1) = P(st+1|st, at) is the probability of transitioning into state st+1 by taking action at in-state st, and Ra(st, st+1) is the immediate reward received after transitioning st into st+1 by taking action at. The agent is specified by a policy π(a|s) mapping each state to a probability distribution over actions. An optimal policy maximises discounted return Gt=0, which is defined as the sum of discounted future rewards (see (1)). The statevalue function Vπ(s) is the expected return of a state s, assuming the agent follows policy π in all time steps (see (2)). (1) Model-based RL algorithms assume Pa and Ra to be known, while model-free RL algorithms learn about them implicitly from the interaction between agents and the MDP. This paper uses the advantage actor-critic algorithm (A2C), a model-free RL algorithm suitable for stochastic environments with partially observed states. The policy π(a|s; θ) and state-value function V(s; θv) are approximated by neural networks with parameters θ and θv, respectively. Initialised at random, the policy is optimised through iterative policy evaluation (gathering experience by interacting with the environment using the current policy) and policy improvement (reinforcing actions that led to greater than expected rewards and discouraging others) by gradient descent using (3) and (4).
In order to apply A2C to traffic signal control, we need to define the state space S and the action space A. Monte Carlo samples of Pa(s, s′) and Ra(s, s′) are generated by applying the agent policy in an episodic microscopic traffic simulation. We defined the action space of an agent that controls a single intersection as the set of possible signal phases. In this paper, all intersections have four approaches controlled in three phases. At every decision time step, the agent selects the next signal phase. An inter-stage is executed (switching the active signal group to red and the selected signal group to green) if required, and the chosen phase is applied for a minimum duration before the agent receives its immediate reward and the subsequent environment state. The state is represented as the concatenation of the following features: a. a vector encoding the current queue lengths on all incoming lanes, b. a one-hot vector encoding of the last chosen signal phase at time t−1, c. the elapsed time since the last signal phase change, d. for each signal phase, the elapsed time since it was last active.
Elapsed time here is measured from the agent's perspective in the number of decision time steps. The queue lengths along all approaches represent a partial observation of current traffic demand, and all other state information summarises aspects of recent signalling history. The latter allows the agent to anticipate demand from elapsed time or learn Model-based RL algorithms assume P a and R a to be known, while model-free RL algorithms learn about them implicitly from the interaction between agents and the MDP. This paper uses the advantage actor-critic algorithm (A2C), a model-free RL algorithm suitable for stochastic environments with partially observed states. The policy π(a|s; θ) and state-value function V(s; θ v ) are approximated by neural networks with parameters θ and θ v , respectively. Initialised at random, the policy is optimised through iterative policy evaluation (gathering experience by interacting with the environment using the current policy) and policy improvement (reinforcing actions that led to greater than expected rewards and discouraging others) by gradient descent using (3) and (4).
In order to apply A2C to traffic signal control, we need to define the state space S and the action space A. Monte Carlo samples of P a (s, s ) and R a (s, s ) are generated by applying the agent policy in an episodic microscopic traffic simulation. We defined the action space of an agent that controls a single intersection as the set of possible signal phases. In this paper, all intersections have four approaches controlled in three phases. At every decision time step, the agent selects the next signal phase. An inter-stage is executed (switching the active signal group to red and the selected signal group to green) if required, and the chosen phase is applied for a minimum duration before the agent receives its immediate reward and the subsequent environment state. The state is represented as the concatenation of the following features: a. a vector encoding the current queue lengths on all incoming lanes, b. a one-hot vector encoding of the last chosen signal phase at time t − 1, c. the elapsed time since the last signal phase change, d. for each signal phase, the elapsed time since it was last active.
Elapsed time here is measured from the agent's perspective in the number of decision time steps. The queue lengths along all approaches represent a partial observation of current traffic demand, and all other state information summarises aspects of recent signalling history. The latter allows the agent to anticipate demand from elapsed time or learn a deterministic signal program (if that was an optimal solution). The instantaneous reward R a (s, s ) is defined as the difference between the average queue length across all incoming lanes in state s and the average queue length in state s . Thus, a positive instantaneous reward is given if the average queue length after an action is reduced and vice versa. We also explored using average vehicle delay as immediate rewards in preliminary experiments but discovered a simulation artifact that resulted in reward hacking. As the vehicle delay in PTV VISSIM is only counted after a vehicle passes the intersection, agents would converge to suboptimal policies that prevented vehicles from crossing the intersection to avoid the associated penalty. The policy network π(s; θ) consisted of three dense hidden layers with 64 units each and leaky ReLU activations (α = 0.05), followed by a Softmax layer with one unit per signal phase. The value network V(s; θ v ) also consisted of three dense hidden layers with 64 units each and leaky ReLU activations, followed by a single-unit linear layer. The state was normalised to [0, 1] independently along each dimension before it was fed into these networks. During training, the agent gathered experience by interacting simultaneously with two environment instances, which has been shown to stabilise training by de-correlating batches of experience. We used the Adam optimiser with a learning rate 1 × 10 −4 , gradient clipping at value 1.0, and entropy regularisation. The contributions of the policy loss, value loss and entropy loss were weighted with weights w π = 1.0, w v = 0.5, and w h = 0.01, respectively. The agent was able to choose a signal phase once every 5 s of green time (ignoring the time spent changing signal phases). With this configuration, all agents were trained for 100 simulation episodes, each simulating two hours of traffic.

Simulation Platform
In this paper, PTV VISSIM microsimulation software was used to investigate the impact of the proposed framework on performance measures such as queue length and stop delay. Therefore, akin to some relevant studies [47,52], an isolated four-leg signalised intersection was employed in this study (north entry has one line, whereas other approaches have two lines). This intersection has been taken from a sample example in VISSIM (used as a template to show the benefits of three-stage vehicle actuated signal control over fixed time) to represent the functionality of the proposed method in dynamic traffic demand. As shown in Figure 2, time-varying arrival rates were generated based on the VISSIM example at 15-min intervals for two hours to consider the fluctuation of traffic demand in a real-world network.

Driving Behaviours
In this study, two vehicle classes were defined for the simulation: (1) Conventional vehicles: This type of vehicle has typical characteristics of a hu driven car. The default VISSIM car-following model (Wiedemann 74) was used thermore, the uniform distribution with a minimum value of 45 km per hour an maximum value of 55 km per hour was utilised to generate the speed of conven vehicles. (2) Connected and automated vehicles (CAVs): driving behaviour for this vehicle consists of two major components, autonomous behaviour and connected behav which will be explained below.

Driving Behaviours
In this study, two vehicle classes were defined for the simulation: (1) Conventional vehicles: This type of vehicle has typical characteristics of a humandriven car. The default VISSIM car-following model (Wiedemann 74) was used. Furthermore, the uniform distribution with a minimum value of 45 km per hour and the maximum value of 55 km per hour was utilised to generate the speed of conventional vehicles. (2) Connected and automated vehicles (CAVs): driving behaviour for this vehicle class consists of two major components, autonomous behaviour and connected behaviour, which will be explained below.

Autonomous Behaviour
Numerous studies have investigated the characteristics of autonomous vehicles (AVs) through various experiments [53][54][55]. AVs have some common features which should be considered for the simulation. For instance, AVs can accept a smaller headway than conventional cars [53,55]. They can also keep their speed constant without any fluctuation during free-flow [56], so the constant speed of 50 km per hour was considered for CAVs in this research, which is the mean speed of conventional vehicles. Furthermore, autonomous vehicles accelerate more smoothly than conventional cars. Unlike conventional cars, every particular speed has a unique acceleration and deceleration rate for this type of vehicle, whose acceleration ranges between the minimum and maximum values [57]. Figure 3 better depicts the last cases. In this study, the default AV behaviour of VISSIM, which is based on a European project called Coexist [59], was utilised. Coexist has tested its cars with four driving patterns (Rail Safe, Cautious, Normal, and All-Knowing). The main difference between these types of autonomous vehicles is their capability to accept headways. Cautious AVs are more conservative than conventional cars. Thus, they keep larger headways than other AV types. Cautious AVs were selected for simulating the autonomous behaviour of CAVs in this paper because this vehicle class may be the first generation of highly automated vehicles and can penetrate in transportation networks sooner than other AV types.

Connected Behaviour
CAVs continuously pay attention to the data transmitted to them from other vehicles (V2V) and traffic infrastructures (V2I). In particular, CAVs receive information about the upcoming signal and modify their speed to arrive at the green phase without stopping in signalised intersections. Therefore, it is essential to use the internal script which reads this information from PTV VISSIM [58] and updates the desired speed of CAVs to arrive within green at the signal. This speed guidance is a rule-based algorithm [34] that developed mainly includes the following steps.

•
Step 1. The first question that should be asked of all vehicles entering the network is if the car is able to receive signal data or not. Therefore, conventional vehicles will proceed with movement at their desired speed (the speed at which the driver wants  In this study, the default AV behaviour of VISSIM, which is based on a European project called Coexist [59], was utilised. Coexist has tested its cars with four driving patterns (Rail Safe, Cautious, Normal, and All-Knowing). The main difference between these types of autonomous vehicles is their capability to accept headways. Cautious AVs are more conservative than conventional cars. Thus, they keep larger headways than other AV types. Cautious AVs were selected for simulating the autonomous behaviour of CAVs in this paper because this vehicle class may be the first generation of highly automated vehicles and can penetrate in transportation networks sooner than other AV types.

Connected Behaviour
CAVs continuously pay attention to the data transmitted to them from other vehicles (V2V) and traffic infrastructures (V2I). In particular, CAVs receive information about the upcoming signal and modify their speed to arrive at the green phase without stopping in signalised intersections. Therefore, it is essential to use the internal script which reads this information from PTV VISSIM [58] and updates the desired speed of CAVs to arrive within green at the signal. This speed guidance is a rule-based algorithm [34] that developed mainly includes the following steps.

•
Step 1. The first question that should be asked of all vehicles entering the network is if the car is able to receive signal data or not. Therefore, conventional vehicles will proceed with movement at their desired speed (the speed at which the driver wants to drive). However, if the vehicle is connected, Step 2 is executed.

•
Step 2. The vehicle will continue with its current speed if it passes the intersection or no signal controller can be found ahead of this vehicle; otherwise, Step 3 must be performed.

•
Step 3. In this step, the following question should be answered. "Is the signal at its green phase?". If the response is negative and the signal controller is at its red phase, the vehicle speed must be adjusted (5). Otherwise, go to Step 4.
V opt = max(min(V max for green start , V des ) − V diff , V min ) (5) In this equation, V opt is the optimal speed of the vehicle, and V des is the desired speed of the vehicle. Moreover, the functionality of V diff is to adjust the vehicle speed so that the vehicle arrives shortly before the signal head. It was assumed to be 2 km per hour in the simulation. V min is the least feasible speed of the vehicles in the network, which was considered 5 km per hour in this paper. Finally, the vehicle should not drive above the V max for a green start in order to arrive just when the next green starts. This speed can be obtained from (6).
V max for green start = Vehicle distance to signal head Time until the next green phase starts (6) • Step 4. If V min for a green end (a minimum speed required to arrive at the intersection during the current green) is lower than the desired speed of the vehicle, then V opt should be equal to V des . Conversely, Step 5 is executed. V min for a green end can be calculated by (7).
V min for green end = Vehicle distance to signal head Time until the next red phase starts (7)

•
Step 5. If V max for a green start is greater than the desired speed of the vehicle, then V opt should be equal to V des . Otherwise, V opt = V max for a green start . Therefore, the optimal speed of all CAVs in the network can be calculated through this procedure.
The proposed behaviours were entered in VISSIM for both conventional vehicles and CAVs. Now, the designated scenarios for this study should be explained.

Simulation Scenarios
It is important to perform sensitivity analysis on various parameters to achieve more accurate and comprehensive findings. Therefore, several simulation scenarios were defined in this paper. Different market penetration rates of 0%, 25%, 50%, 75%, and 100% for CAVs were the first set of scenarios considered for this study. Moreover, three scenarios for the total demand were designated. The first scenario was called saturated, which has the travel demands equal to the capacity for each signal phase. This capacity can be calculated by (8). The second and third scenarios were called oversaturated and unsaturated, with travel demands equal to 1.1 and 0.7 of saturated conditions, respectively.
where s is the saturation flow rate was considered to be 1900 veh/h for each lane based on the Highway Capacity Manual [60]. Furthermore, g is the effective green time duration for the phase in seconds, C is the total cycle length in seconds, and N is the lane number for each lane group. Finally, different signal plans should be compared to measure the relative improvements of our approach compared to other widely used methods. First, the intersection with fixed signal timing alongside the AVs was investigated. This scenario assumes that vehicles cannot receive any information from the signal controller (AV scenarios do not include speed guidance approaches). This situation was called "Fixed AV". Furthermore, the fixed signal timing was considered for the intersection again, but AVs, in this case, have the ability to receive signal data, and they can adjust their speed based on the speed guidance approach (i.e., CAVs were used for this condition instead of simple AVs). This situation was called "Fixed CAV". The actuated signal plan was examined with different AV penetration rate scenarios (Not CAV). This case was named "Actuated".
Thus, 60 scenarios (five different vehicle compositions × three different demands × four controlling conditions: Fixed AV, Fixed CAV, Actuated (AV), and RL (CAV)) were generated. Each scenario was typically executed in VISSIM for up to five various random seeds. This process was also accomplished by the RL, and the expected improvements in performance measures were assessed under the CAV environment.

Results
Two different performance measures are implemented: minimising total queue length and total stop delays. The main purpose of this paper is to develop a framework to minimise the performance measure. The minimisation of queue length is the first part of the performance index (PI). Queues happen when vehicles cannot pass the intersection at their green share, and they must stop behind the red light. The average queue length of all entries in the whole simulation period is assessed to optimise an RL feedback loop as a reward factor. The second part of PI is the stop delay of vehicles, a common performance measure in an optimal speed guidance system under a connected environment [28]. This can be described as a delay (in seconds), the mean stop time of vehicles waiting at the intersection behind red lights. In order to use the two aforementioned performance measures simultaneously in one expression, the transformation strategy has been used to convert each measure into a non-dimensional term. According to [61], the average queue length (q) can be divided by its maximum value (Qmax).
Moreover, stop delay (d) can be investigated per cycle length, so it should be divided by signal cycle time (C) discussed in the previous section. Therefore, (9) to (11) illustrate how to determine the PI value for each scenario. In these equations, n represents different replications for the simulation.
Stop Delay ratio = 1 PI = Queue Length ratio + Stop Delay ratio (11) In the first step, it is worth comparing PI values of various signal plans for both saturated and oversaturated conditions. The results for the average PI value are shown in Figures 4 and 5.
As can be observed, the speed guidance approach has worked appropriately due to the fact that PI values for the fixed CAV condition are less than their corresponding values for the fixed AV condition in each scenario. In other words, a 100% penetration rate of CAVs compared to a conventional environment (0% penetration rate of CAVs) in the saturated scenario can decrease PI by 18% and 43% for fixed AV and fixed CAV conditions, respectively. Moreover, the fixed CAV situation has performed better than the actuated case. Therefore, the fixed CAV condition is the best case among all non-learning approaches, and the RL should be compared with this condition in order to assess its functionality.
Stop Delay ratio = ∑ (10) PI = Queue Length ratio + Stop Delay ratio (11) In the first step, it is worth comparing PI values of various signal plans for both saturated and oversaturated conditions. The results for the average PI value are shown in Figures 4 and 5.   As can be observed, the speed guidance approach has worked appropriately due to the fact that PI values for the fixed CAV condition are less than their corresponding values for the fixed AV condition in each scenario. In other words, a 100% penetration rate of CAVs compared to a conventional environment (0% penetration rate of CAVs) in the saturated scenario can decrease PI by 18% and 43% for fixed AV and fixed CAV conditions, respectively. Moreover, the fixed CAV situation has performed better than the actuated case. Therefore, the fixed CAV condition is the best case among all non-learning approaches, and the RL should be compared with this condition in order to assess its functionality.
As shown, the RL can noticeably reduce the PI compared to the fixed CAV condition, especially in 0% and 25% scenarios. However, a slight improvement has occurred in the last three vehicle composition scenarios (i.e., 50%, 75% and 100%), particularly for the oversaturated scenario. This is mainly because of the assumption that CAVs in this paper only can obtain the signal data. This assumption makes the simulation more straightforward and more rapid. Results show that the total network performance could be improved under this simple CAVs environment. Furthermore, future works should consider the CAVs that can receive and interpret the information transmitted from other CAVs (V2V). In other words, the driving behaviour implemented in this study for CAVs only can model V2I, and it cannot cover the V2V connection whose impact has been pinpointed from the 50% penetration rate for CAVs.
Another point to note is that in most cases, the queue length ratio increases when the 0 0  As shown, the RL can noticeably reduce the PI compared to the fixed CAV condition, especially in 0% and 25% scenarios. However, a slight improvement has occurred in the last three vehicle composition scenarios (i.e., 50%, 75% and 100%), particularly for the oversaturated scenario. This is mainly because of the assumption that CAVs in this paper only can obtain the signal data. This assumption makes the simulation more straightforward and more rapid. Results show that the total network performance could be improved under this simple CAVs environment. Furthermore, future works should consider the CAVs that can receive and interpret the information transmitted from other CAVs (V2V). In other words, the driving behaviour implemented in this study for CAVs only can model V2I, and it cannot cover the V2V connection whose impact has been pinpointed from the 50% penetration rate for CAVs.
Another point to note is that in most cases, the queue length ratio increases when the CAVs penetration rate also increases for the RL. For instance, the maximum value of the queue length ratio for the RL in the saturated scenario has a growth rate of 43% compared to its minimum value in the 25% scenario. Although it seems far from the expectation at first glance, it may stem from the fact that the speed guidance approach defined for this research does not consider vehicle positions, and CAVs receive the signal data as soon as they enter the network. Moreover, this algorithm does not specify different ranges for the minimum speed of CAVs, and it has been assumed to be 5 km per hour for this study. Therefore, such a wide range for receiving the signal information in conjunction with this tiny speed value can cause long queues behind CAVs.
PI values were analysed for both saturated and oversaturated scenarios. One of the unforeseen outcomes which can be highlighted in Figure 4 is that the PI value suddenly escalates when moving from 25% to 50% for the RL. In order to better perceive this trend, we present PI variation in the unsaturated condition for these two aforementioned CAV scenarios. The final results are displayed in Figure 6. As expected, fixed CAV, actuated, and RL approaches can improve the conditions of this intersection compared to the simple fixed AV situation. The average improvements are 23%, 40%, and 70%, respectively. Therefore, RL has the best performance in unsaturated conditions among all other algorithms. The main reason for this is that despite using optimal fixed time for CAVs speed guidance, the proposed real-time RL could optimise the signal plan based on the time-varying traffic demand. This means that more vehicles can pass the intersection in their green time, and each entry will be vacant at its green share. That is why the queue length and stop delay will be minimised, and the intersection will perform at the ideal condition, which is unrivalled.
Furthermore, all approaches show a decreasing PI rate by transitioning from 25% CAVs to 50% CAVs. Hence, it is concluded that, unlike saturated and oversaturated conditions, vehicle automation can play an apt role in unsaturated traffic conditions.
The results of this analysis also highlighted the fact that the PI value for a proposed RL framework with speed guidance under a CAV environment is much less than other control systems at all penetration rates of CAVs (AVs) and demand scenarios. The proposed framework is a novel way of framing the RL state based on a CAV environment with speed guidance that could work more flexibly than RL under a conventional environment or speed guidance system under CAVs with fixed timing signal control.

Conclusions
This study develops a learning-based framework that optimises the signal plan by training a traffic signal control as an agent in RL to minimise total queue length (reward in RL) when CAVs receive speed guidance in a fixed-time strategy to minimise the total stop delays. The signal controller (agent) was trained in the fully dynamic traffic environment (traffic flows and CAV speeds) under different demand levels (unsaturated, saturated, oversaturated) as well as CAV penetration rates (0%, 25%, 50%, 75%, 100%) scenarios to show potential interaction effects between signal timing plan, CAV penetration rate and traffic flow.
An isolated four-leg signalised intersection with three phases (training example in As expected, fixed CAV, actuated, and RL approaches can improve the conditions of this intersection compared to the simple fixed AV situation. The average improvements are 23%, 40%, and 70%, respectively. Therefore, RL has the best performance in unsaturated conditions among all other algorithms. The main reason for this is that despite using optimal fixed time for CAVs speed guidance, the proposed real-time RL could optimise the signal plan based on the time-varying traffic demand. This means that more vehicles can pass the intersection in their green time, and each entry will be vacant at its green share. That is why the queue length and stop delay will be minimised, and the intersection will perform at the ideal condition, which is unrivalled.
Furthermore, all approaches show a decreasing PI rate by transitioning from 25% CAVs to 50% CAVs. Hence, it is concluded that, unlike saturated and oversaturated conditions, vehicle automation can play an apt role in unsaturated traffic conditions.
The results of this analysis also highlighted the fact that the PI value for a proposed RL framework with speed guidance under a CAV environment is much less than other control systems at all penetration rates of CAVs (AVs) and demand scenarios. The proposed framework is a novel way of framing the RL state based on a CAV environment with speed guidance that could work more flexibly than RL under a conventional environment or speed guidance system under CAVs with fixed timing signal control.

Conclusions
This study develops a learning-based framework that optimises the signal plan by training a traffic signal control as an agent in RL to minimise total queue length (reward in RL) when CAVs receive speed guidance in a fixed-time strategy to minimise the total stop delays. The signal controller (agent) was trained in the fully dynamic traffic environment (traffic flows and CAV speeds) under different demand levels (unsaturated, saturated, oversaturated) as well as CAV penetration rates (0%, 25%, 50%, 75%, 100%) scenarios to show potential interaction effects between signal timing plan, CAV penetration rate and traffic flow.
An isolated four-leg signalised intersection with three phases (training example in VISSIM as a template of vehicle actuated signal control) was modelled to test the performance. The results were compared to a well-tuned fixed timing plan (with optimal speed advisory in a CAV environment) and actuated signal control. Two objective functions are implemented in a performance index (queue length and stop delay). An important finding from this study is that the proposed framework reduced stop delay significantly under all scenarios and was comparable to other control strategies in queue length.
Finally, the proposed framework only considers simple V2I communication in an isolated intersection. The framework is designed to handle urban networks, and the interaction in operations between adjacent intersections (for example, interchanges) under the V2X environment and evaluation for these conditions is planned. The main limitation of a network-wide application is the computational time for the RL training. In future works, offset optimisation could be added to the signal timing optimisation to implement the proposed approach in an arterial corridor with multiple traffic signals or a city traffic network. It can improve the overall performance of the proposed approach but increase the training process complexity for the RL approaches. Future research will also make use of V2V communication and dynamic speed guidance strategies. Lastly, with dynamic speed guidance strategies which have different communication and control levels compared to simple V2I, it might be helpful to analyse reward and state options of RL with different performance measures.