1. Introduction
In the age of the Internet of Things (IoT), there are complex systems and many interconnected devices that produce large amounts of data. In recent years, Artificial Intelligence (AI) has been used to handle these devices and data and to supply intelligent control for complex scenarios due to its capacity and ability to manage complex tasks. Numerous studies have attempted to assess the efficacy of artificial intelligence and the Internet of Things (IoT) in relation to Intelligent Transportation Systems (ITS), which are crucial to the development of smart cities and the quality of people’s daily lives [
1,
2,
3].
On the other hand, traffic in urban areas is constantly increasing and the resulting congestion is a major concern for transportation management. One of the most important goals of research in the field of transportation at the global level is to optimize traffic flow. According to a report by the U.S. Department of Transportation [
4], many cities face challenges in controlling traffic flows and reducing congestion. Thus, many algorithms and approaches have been proposed to solve traffic signal control challenges. Many cities use a fixed-time traffic signal control system that works according to a predefined schedule. Nevertheless, they are not optimal because they are not affected by the traffic conditions, and traffic changes over time and is not constant. Hence, this method cannot adapt to the dynamics and changes in the environment and results in traffic [
5].
With the advent of sensors (for example, loop detectors, radar, cameras, etc.) at intersections, it has become possible to implement Adaptive Traffic Signal Controllers (ATSC) which aim to optimize traffic flows by accommodating signal timing based on real-world traffic conditions. ATSC methods such as the Split, Cycle, Offset, Optimization Technique (SCOOT) [
6], and the Sydney Coordinated Adaptive Traffic System (SCATS) [
7], which are both centralized, have been used in many cities around the world to reduce congestion. However, these methods have some problems: they need many sensors, networks of computers for implementation, and a control center with a human operator to manage. Considering the fact that implementation and preservation costs are too high, it is not an optimal solution to deploy in metropolitan cities [
8]. Optimization Policies for Adaptive Control (OPAC) [
9] and the Real-Time Hierarchical Optimized Distributed Effective System (RHODES) [
10], which are decentralized with wrapped computation [
1] and a PRODYN algorithm [
11], are similar methods that have been used for traffic control, but they have high computing costs. Although these controllers can change their phase duration or sequence, they cannot do it in real-time and dynamically. Adaptive approaches can potentially improve performance, but they are difficult to develop [
12]. Reference [
13] compares these methods and discusses their advantages and disadvantages.
IoT, on the other hand, is an environment and a platform that links people, devices, and computers by exchanging data through machine-to-machine or machine-to-human interaction [
14], which alone has many complex systems, network infrastructures, and a large number of interconnected devices that generate vast amounts of data. In order to not only handle these devices and traffic data but also to control complicated scenarios intelligently, Machine Learning (ML) methods as AI techniques have been widely used in recent years due to their ability to perform complex tasks [
1]. The ML methods, such as neuro-fuzzy [
15], immune network algorithms [
16], neural networks [
17], and genetic algorithms [
18] have been used for ATSC in recent years. Nevertheless, these methods require a lot of computational costs.
On the other hand, Reinforcement Learning (RL) is an effective ML approach among all the ML algorithms that has the powerful advantage of learning from experience and is used in the design of ATSC [
19,
20].
Although many Reinforcement Learning algorithms have been used to design ATSC, centralized RL-based control methods are not suitable when there are numerous intersections, because the state of all intersections must be collected as a general state and an action must be applied at each intersection. This will increase the delay and, in turn, the state space increases exponentially. Therefore, model training becomes difficult in such a complex situation. This is why a decentralized method is needed [
21]. By combining ML and the decentralized nature of IoT, it is important to measure the usefulness of Machine Learning-based ITS because the ITS depends on people’s daily lives and is one of the most crucial aspects of creating a smart city [
1].
The purpose of this research is to apply a combination of ML and IoT approaches to provide an intelligent traffic signal control solution for multiple intersections. This is done by using Reinforcement Learning techniques where the RL agent learns the best control policy through collaboration with the environment. The observations of each intersection are exchanged with its neighboring intersection in a distributed way to obtain the optimal global schedule for the whole system [
1]. Moreover, due to the limited communication among agents, the exchange of information between intersections becomes difficult. Therefore, a method of observations and fingerprints of neighboring agents is used to stabilize each local agent’s learning [
22]. To evaluate the effectiveness of the implemented algorithm, we conducted numerical simulations for two synthetic intersections by simulated data and a real-world map of Shiraz City (from the Open Street Map (Openstreetmap. [Online]. Available:
https://www.openstreetmap.org (accessed on 1 January 2022)) (OSM)) with real-world traffic data received from the transportation and municipality traffic organization. The implementation is done on an opensource simulation platform, SUMO (Simulation of Urban MObility) [
23]. The simulation results indicated that our proposed approach performs more efficiently than the fixed-time traffic signal control scheduling implemented in Shiraz in terms of vehicle average queue lengths and waiting times at intersections.
The contributions of this article are as follows:
IoT technology and Machine Learning methods have been used to make intelligent traffic light control systems at Shiraz City intersections.
Real traffic data have been used to implement a real-world scenario in Shiraz City to localize the system and consider all the challenges of a real-world scenario.
A distributed Multi-Agent Reinforcement Learning (MARL) algorithm has been used at each intersection for traffic signal control.
The cutting-edge advantage actor-critic (A2C) algorithm has been applied where deep neural networks (DNN) are used for policy and value approximations.
The currently running traffic control system of Shiraz City, which has been implemented using SCATS, has been compared with the proposed method.
The remainder of this paper is structured as follows:
Section 2 examines related works. Background and formulations are described in
Section 3. In
Section 4, the proposed method is introduced.
Section 5 describes numerical experiments and evaluation results and finally, in
Section 6, conclusion of the research is outlined.
2. Related Work
Researchers have always been interested in traffic signal control systems. Studies on this subject are divided into various traffic light control methods.
A. G. Sims et al. (1980) [
24] introduced a system called SCATS, which is an urban traffic control system. The system consists of several small computers in the control center in Sydney. Specifically, it is an intelligent transportation system that manages real-time signal timing in traffic lights and uses traffic light sensors to detect vehicles in each lane. In this system, induction loops are used to detect the presence of vehicles. Information about the passage of vehicles is collected at intersections and transmitted to the traffic control center. The center analyzes this information then the appropriate green time, which is the system output, is reached. The SCATS system offers many benefits by reducing travel time, reducing accidents, saving fuel, and reducing air pollution. Moreover, implementing this control method, in addition to the high cost of purchase and installation, requires a human operator to control the system remotely, so it can be disrupted due to a lack of proper maintenance.
Hosur et al. (2019) [
25] proposed a framework using IoT technologies that evaluate the traffic density via IR sensors to achieve dynamic timings for the traffic light. In their proposed system, they considered some threshold distance when the sensor detects any vehicle within this distance using IoT technologies. When other roads are empty of vehicles, it switches to a green light. The IoT can help to access components from far places, and their proposed system is beneficial for non-peak hours and saves power during non-peak hours. The disadvantage of their work is that they do not consider peak hours because most vehicles will only be present during these rush hours, which is an essential factor for traffic system control.
Liang et al. (2019) [
26] changed the traffic light signal durations according to the discrete values of the actions. They collected the data from sensors and divided the whole intersection into small grids [
21]. The information received from these sensors is difficult to process to find the duration of green and red lights. Such algorithms have low performance in peak traffic conditions, and the main reason is to ignore the impact of the current phase time on future traffic [
27]. Value-based methods are more suitable for solving problems with discrete states than with continuous states such as traffic flows [
21].
Lillicrap et al. (2015) [
28] extended the idea of Q-learning to the continuous action domain. Although their proposed algorithm was able to discover policies whose performance was competitive with predefined scheduling algorithms due to the dynamics of the environment, it required a complete state sequence to update the policy. In fact, policy-based methods can work with continuous states, but the convergence of the training process is complicated, which is an important factor for continuous traffic flow [
21]. In addition, this method has a high bias and variance [
22].
Aslani et al. (2017) [
29] proposed actor-critic adaptive traffic signal controllers to optimize traffic signal controllers in the traffic network of Tehran city for 24 h. They also developed different actor-critic algorithms based on different function approximations and compared them with six different scenarios. They showed that actor-critic in ATSC with centralized agents performed better than Q-learning. This work focused on discrete action RL but did not realize continuous actions.
Chu et al. (2019) [
22] proposed a decentralized and fully scalable MARL algorithm for the deep RL agent called the advanced actor-critic (A2C) in the ATSC. They also proposed two methods to improve learning by enhancing observability and reducing learning difficulty for each local agent: the fingerprint of neighboring agents and spatial discount factor. They compared their multi-agent A2C algorithm with the independent A2C and IQL in both the synthetic traffic scenario and the real-world scenario of Monaco. The results of their work showed the optimality of the proposed algorithm compared to other decentralized MARL algorithms.
Wang et al. (2021) [
21] also proposed an A2C algorithm, but they applied a region-aware cooperative strategy based on a graph attention network to overcome the problem of partial observability of each local agent.
Hongwei Ge et al. (2021) [
30] proposed a MARL algorithm for traffic signal control. They also proposed transfer and encoder paradigms to enhance the agent’s learning ability. Specifically, they improved the cooperation strategies of the algorithm. Their focus in this work was to increase the capability of agents to learn. They showed the robustness of their algorithm, but it limits its implementation in real scenarios. According to the research presented, Cases [
24] attempted to control the traffic signals with SCATS which is an adaptive traffic signal control, but it cannot do it in dynamic conditions. Moreover, Case [
25] proposed a framework using IoT. Cases [
26,
28,
29] provided traffic light control based on reinforcing learning methods. Cases [
21,
22,
30] proposed multi-agent Reinforcement Learning for traffic signal control. It should be mentioned that most previous research focused mainly on single intersections and simulated data, while it is quite rational to use real-world data from IoT sensors for multiple intersections. Moreover, there is no research available in the literature for Shiraz City that addresses traffic signal control using Machine Learning algorithms with real-world data.
In this paper, we used the Advantage Actor-Critic (A2C) algorithm combined with IoT approaches for the global control of each local agent. We also used observations and fingerprints inspired by neighboring agents in the state. Therefore, each local agent has more additional information about the distribution of regional traffic and cooperative strategy. In addition, this algorithm is implemented in a real-world scenario with six intersections of Shiraz City to consider all the challenges of a real-world scenario.
4. The Proposed Method
Shiraz is one of Iran’s most crowded metropolitan cities, with 63 junctions. One of the cons of the city is that there are still fixed-time traffic light systems that are implemented with the SCATS system and work according to a predefined schedule that causes increased travel time, fuel consumption, and air pollution, in addition to which, heavy traffic behind red lights causes psychological damage to the driver. Therefore, in order to intelligently control the intersections and examine this issue more closely, six real-world intersections were selected to be examined in this research.
The area between Imam Ali Bridge and Deh Bozorgi Bridge, according to the data received from the Transportation and Traffic Organization of Shiraz Municipality, consists of four Bridges, including six intersections, which have heavy traffic congestion due to their proximity to offices, parks, historical gardens, and tourist attractions. Thus, it requires efficient traffic control measures such as traffic signal controls. Therefore, this area of Shiraz has been chosen as the study area due to the fact that the traffic output of one intersection affects the traffic volume of another intersection.
In order to implement real-world intersections, an agent-based traffic simulator was used. Therefore, in this paper, the Simulation of Urban MObility (SUMO) was used for agent-based traffic simulation, which is an open-source, highly portable, microscopic, and ongoing multi-modal traffic simulation package that is built to handle massive networks.
In this section, different features of the traffic simulation are explained. Moreover, six real-world intersections of Shiraz City, along with the Reinforcement Learning control method that was applied for multi-agent Reinforcement Learning, are described. The goal is to design challenging and real-world traffic environments.
4.1. Environment, Agents and Traffic Demands
The environment is the traffic network, consisting of streets, vehicles, intersections, and agents interacting with each other. Agents are also considered in this article as signalized intersections or traffic signals.
The study area of Shiraz is shown in
Figure 1, and includes four bridges, two of which have north- and south-signalized intersections, as shown in
Figure 2.
The four Bridges are as follows:
Imam Ali Bridge: It has two intersections, north and south, both of them have two phases, and the duration of the green phase is 27 s in the existing static configuration. At the northern intersection, the first phase is N–S and S–N, which means that vehicles are allowed to pass from north to south and south to north. The second phase of the northern intersection is W–E. At the southern intersection, the first phase is N–S, the second phase is E–W and W–E.
Hejrat Bridge: It also has two intersections, and the duration of the green phase is 42 s for both of them. At the northern intersection, the first phase is N–S and S–E, and the second phase is W–E. At the southern intersection, the first phase is N–S, and the second phase is E–W.
Safa Garden Bridge: It has one intersection with two phases; the first phase is N–S, which, due to the congestion on this side, has a longer duration for the green phase of 70 s. The second phase is W–E, which lasts 24 s.
Deh Bozorgi Bridge: It has a three-phase intersection; the first phase is W–E for 30 s, the second phase is east to west for 29 s, and the third phase is N-S which lasts for 22 s.
We should mention that the yellow phase duration is considered 3 s for all of the intersections in existing static configuration. In addition, the phases are shown in
Figure 3.
To take into account all of the challenges of the real world, four groups of time-varying traffic flows were simulated. For all four groups, traffic flows that entered the intersections from arterial streets were also considered.
The first group had a traffic flow that covered the entire route. This stream entered the Imam Ali Bridge and exited the Deh Bozorgi Bridge through Hejrat and Safa Garden Bridges.
The second group of flows also covered the entire route, but unlike the first group, it entered from the Deh Bozorgi Bridge and passed through the Safa Garden and Hejrat Bridge, and exited the Imam Ali Bridge.
The third flow group entered from the north of the Deh Bozorgi Bridge and exited from its south, and vice versa. The same flow was defined for Safa Garden Bridge.
Finally, the fourth group included the flows from the north of the northern Hejrat intersection to the south of the southern intersection of Hejrat Bridge and vice versa, from the south to the north of the southern intersection to the northern intersection of the Hejrat Bridge. The same flow was defined for Imam Ali Bridge.
We defined these flow rates according to information from the Transportation and Traffic Organization of Shiraz Municipality at different times of the day.
4.2. IoT Agents as MA2C for Traffic Signal Control
In this paper, we employed RL, a data-driven method for adaptive traffic signal control in intricate urban traffic networks, and because our simulation scenario included four bridges with six interconnected intersections of Shiraz City, we could not use the centralized RL. Therefore, the multi-agent RL (MARL) was used, which can overcome the scalability issue. This means that more intersections could be controlled. Distributed MARL was installed in the traffic light system in such a way that an RL agent was located at each intersection to manage the local traffic lights for vehicles in all directions. Surveillance cameras, which were IoT sensors, were also placed on each side to capture the queue lengths of the vehicles. Additionally, the agent gathered local traffic data that was tracked by cameras and recorded the information in a local IoT database. As an IoT technique, agents also gathered information from neighbors by exchanging information across network connections. Neighbor data were also kept in the same database, as seen in
Figure 4. Based on the data in the database, the actor-critic algorithm, which is an RL type, selects the optimal control action from a list of predefined actions, and the IoT actuator, such as a traffic light, applies the selected action to the environment. Moreover, in order to better communicate and coordinate between intersections, the combination of MARL and A2C (MA2C) was used in this paper, which makes each intersection not only aware of its own policy but also aware of the policy of other intersections. Therefore, the traffic impact of neighboring intersections can be controlled at the desired intersection. Finally, in order to stabilize the learning procedure by improving the observability of each local agent, the fingerprint method of neighboring agents was incorporated. This means that observations and fingerprints of neighboring agents were included in the local agent state. In the fingerprint method, we incorporated the most recent neighborhood policies
,
in the DNN inputs, where
is the neighborhood of agent
. The local policy is calculated as:
Therefore, each local intersection contains more information on neighbors’ policies in addition to the regional traffic distribution and cooperation strategy.
The following is a definition of state, action, and reward based on [
22].
We define each state as follows:
where
l is the incoming lanes of the intersection
i. wait[s] is the cumulative delay of the first vehicle, and the wave[veh] measures the total number of vehicles entering the intersection lanes. Both wait and wave are measured as shown in
Figure 2 by using induction-loop detectors (ILD) marked in blue.
LaneAreaDetector in SUMO is also used to obtain this information.
In this paper, we define the action for an intersection of all possible phase combinations of traffic lights. This definition allows the agent to control the traffic signal more flexibly. In other words, a set of all existing static phases of Shiraz is defined for each intersection so that the RL agent chooses one of them that lasts for t at each step; in which t is the interaction period between each agent and the traffic environment.
The reward should be measurable and evaluable, directly dependent on the state and indicating to the agent whether the chosen action is good or bad. In this paper, the queue length at each incoming lane and the average waiting time of drivers are considered as a reward and measured at time
.
where
a[veh/s] is a tradeoff factor. This reward definition emphasizes traffic congestion and trip delay.
4.2.1. DNN Structure
The traffic flows are complicated spatial–temporal data, so MDP may become non-stationary if the agent only knows the current state. One direct strategy is to include all historical states as A2C input. However, this dramatically increases the dimension of the state and may reduce A2C’s attention to recent traffic conditions. Fortunately, long–short term memory (LSTM) is a potential DNN layer that keeps hidden states to memorize brief history. Therefore, we used LSTM as the last hidden layer to extract representations from various types of states. We defined the states for each input line as the input of neural networks, which include the number of input vehicles at the intersection and their waiting time within 50 m of the intersection. In addition to the wave and wait as state, we also included a neighbor policies node as input. Then we processed these values with a fully connected layer. Finally, we concatenated the output in an array and then normalized them. In order to select the best action from the available actions, we used the Softmax as an activation function, and for the critic we also used the linear relation as an activation function to return the reward [
22].
Figure 5 shows the DNN architecture in actor-network and critic-network.
DNN Training
In our method, agents learn their policy since each agent is distributed within the six intersections. Therefore, as shown in
Figure 5, each agent has its actor-network and critic-network. We train the actor and the critic of the DNN separately. The input of each actor-network is the local agent’s observations or state (Equation (8)). Based on the model that has been trained so far, it performs an action which has predefined phases that are mentioned in sub-
Section 4.1. These phases have been obtained from the Shiraz transportation organization and municipality. Based on the selected action, each agent obtains a reward from the environment (Equation (9)), and based on it, the critic-network examines whether the selected action is appropriate for the existing conditions or not, and then the weights are updated. During the training process, the input of each critic-network includes not only the global states of all intersections but also the global actions of all agents. After that, each agent receives a reward, then we enter the new state, and these steps are repeated again. The model is trained at any given time based on the data given to it. Moreover, due to intersection cooperation, neighboring information can be shared via network connections of IoT devices. By exchanging messages through the hop connections, information that is more than one hop away can also be propagated throughout the network to achieve an approximate global optimization. Additionally, by increasing the observability and lowering the learning difficulty of each local agent, the fingerprint method has been utilized to stabilize the learning process. In this way, we include the latest policies or the last actions of the neighboring agents in the state so that each local agent is better informed about the regional traffic distribution and the cooperation technique of the neighboring agent.
4.2.2. Normalization
Normalization is a crucial factor in DNN training. A greedy policy is applied for each wave and wait state to gather statistics relevant to a certain traffic environment and use them to produce an accurate normalization. To avoid the gradient explosion, all normalized states are clipped to [0, 2]. Similarly, to stabilize the mini-batch updating, we normalized the reward and clipped it to [−2, 2] Also, the wave and wait normalization factors are 5 veh and 100 s, respectively [
22].
5. Numerical Experiments and Evaluation Results
As mentioned in
Section 4, we executed our traffic signal control method in the SUMO [
23] simulation, which can model microscopic traffic conditions. We also used Traffic Control Interface (TraCI) (
https://sumo.dlr.de/docs/TraCI.html (accessed on 1 January 2022)) as an API, which gives online access from Python to traffic simulation to retrieve simulated objects’ values and manipulate their behavior. We train the MA2C algorithm over 1M steps, which is around 1400 episodes. Then we evaluate the obtained model over 10 episodes. Additionally, 10 distinct seeds are used to create various training and evaluation episodes. For MDP, we set
0.99. Four time-varying traffic flow groups are designed as unit flows of 325 veh/hr. This traffic flow strongly matches into the real world. The general configurations of the simulation are shown in
Table 1,
Table 2 and
Table 3.
We evaluated MA2C-based ATSC in two traffic environments: two synthetic traffic grids for evaluating the results with synthetic data and a real-world six-intersection traffic network extracted from Shiraz City for evaluation with real data.
5.1. Synthetic Traffic Grid
As illustrated in
Figure 6, two synthetic intersections are formed by three lanes with a speed limit of 8.89 m/s, and vehicles with 5 m length are defined for it. The action space for two intersections contains four possible phases: E–W straight phase, E–W left-turn phase, and N–S straight phase, N–S left-turn phase. Moreover, each vehicle’s route is generated randomly during run-time.
5.1.1. Training Results
As Reinforcement Learning learns from accumulated experience and eventually reaches the local optimum, the learning curve initially rises and then converges. Therefore, this algorithm has achieved a good result.
5.1.2. Evaluation Results
The network’s average queue length for each simulation step is shown in
Figure 8.
This figure represents the total number of vehicles at all junctions at red lights. First, because the number of vehicles is small, the queue length is also short, then as the number of vehicles in the middle of the simulation time increases, the queue length also increases and again decreases with the decreasing number of vehicles.
5.2. Shiraz Traffic Network with Real Data
We exported the map of Imam Ali Bridge to Deh Bozorgi Bridge in Shiraz City from Open Street Map (OSM) as shown in
Figure 9. The map was converted into SUMO- compatible topology by the netconvert tool as shown in
Figure 1.
After conversion, we applied the algorithm we used to each traffic light in
Figure 1. Along with the real-world map, we also used real-world traffic data, which are the implemented phases in Shiraz that are obtained from the transportation and municipality traffic organization of Shiraz City. In totally, there were six signalized intersections: five were two-phase, and the last one had three phases. In addition, as stated in
Section 4, four time-varying traffic flow groups were designed as unit flows of 325 veh/hr to simulate the peak-hour traffic and to take into account real-world challenges and evaluate the robustness and optimality of the algorithm.
5.2.1. Training Results
Figure 10 plots the training curves of the MA2C algorithm. It converges to reasonable policy and has a stable convergence.
5.2.2. Evaluation Results
Figure 11 and
Figure 12 represent the average queue length of vehicles and their waiting time (average intersection delay) over the simulation time, respectively, in which our proposed system was compared with the traditional traffic control system of Shiraz City using the fixed-time traffic signal control system.
As shown in
Figure 11, the MA2C algorithm performed better than Shiraz City’s traditional traffic control system, which uses fixed-time scheduling and does not consider environmental traffic conditions. The main reason for the efficiency of this system is that by taking into account the traffic volume on each side of the intersection using IoT technologies, it switches the traffic light into green for that side and into red if the traffic on the opposite road is less congested, or free of vehicles. Furthermore, the robustness of our system appears during peak hours because it works according to traffic conditions. Therefore, as seen in the figure, the MA2C algorithm can manage the queue length of vehicles more effectively.
As the results show in
Figure 12, our system could reduce and then maintain intersection delays by coordinating and distributing traffic homogeneously among neighboring intersections, specifically when the local traffic flow is maximized greedily in the middle of the simulation time. Therefore, the MA2C algorithm can significantly reduce vehicle waiting time compared to the fixed-time traffic signal control system of Shiraz City.