An Edge Based Multi-Agent Auto Communication Method for Traffic Light Control

With smart city infrastructures growing, the Internet of Things (IoT) has been widely used in the intelligent transportation systems (ITS). The traditional adaptive traffic signal control method based on reinforcement learning (RL) has expanded from one intersection to multiple intersections. In this paper, we propose a multi-agent auto communication (MAAC) algorithm, which is an innovative adaptive global traffic light control method based on multi-agent reinforcement learning (MARL) and an auto communication protocol in edge computing architecture. The MAAC algorithm combines multi-agent auto communication protocol with MARL, allowing an agent to communicate the learned strategies with others for achieving global optimization in traffic signal control. In addition, we present a practicable edge computing architecture for industrial deployment on IoT, considering the limitations of the capabilities of network transmission bandwidth. We demonstrate that our algorithm outperforms other methods over 17% in experiments in a real traffic simulation environment.


Introduction
Traffic congestion has caused a series of severe negative impacts like longer waiting time, more gas cost, and severe air pollution. According to a report in 2014 [1], the loss caused by traffic jams is up to $124 billion US dollars a year in the US. The shortage of traffic infrastructures, the growing number of vehicles, and the inefficient traffic signal control are key underlying reasons for traffic congestion. Among these, the traffic light control problem seems to be the most easily solved. However, the internal operation of the real urban transportation environment cannot be accurately calculated and analyzed mathematically due to its complexity and uncertainty. Reinforcement learning (RL), which is characterized by being data-driven, mode-less, and self-learning, is well suited for conducting research on adaptive traffic light control algorithms [2][3][4].
Traditional adaptive traffic light control method [2,3] could achieve local optimization by adapting to single intersection based on RL. Furthermore, global optimization is needed to achieve dynamic multi-intersection control in large smart city infrastructure. Multi-agent reinforcement learning (MARL) is increasingly being used to study more complex traffic light control issues [17][18][19].
Although the existing methods have effectively improved the control efficiency of traffic signal control, they still have the following problems: (1) shortage of communication between a traffic light and other traffic lights; (2) shortage of consideration of the limitations of the capabilities of network transmission bandwidth. The contributions of this paper are summarized as the following: • We present an auto communication protocol (ACP) between agents in MARL based on attention mechanism; • We propose a multi-agent auto communication (MAAC) algorithm based on MARL and ACP in traffic light control; • We build a practicable edge computing architecture for industrial deployment on Internet of Things (IoT), considering the limitations of the capabilities of network transmission bandwidth;

•
The experiments show the MAAC framework outperformed 17 %over baseline models.
The remainder of this paper is organized as follows: Section 2 introduces related works including multi-agent system, RL, IoT, edge computing, and the basic concept of communication theory. Section 3 formulates the definition of the traffic light control problem. Section 4 details the MAAC model and our edge computing architecture for IoT. Section 5 conducts the experiments in a traffic simulation environment and demonstrates the results of the experiments with a comparison between our methods and others. Section 6 concludes the paper and discusses future work.

Related Work
Urban traffic signal control theory has been continuously investigated and developed for nearly 70 years since the 1950s. However, from theory to practice, the goal of alleviating urban traffic congestion through the optimization and control of urban traffic signals, is consisting in very complex control problems. Urban traffic signal control is to allocate the time of a signal cycle and the ratio of the time of red and green lights in a signal cycle. The control methods include fixed time [20], vehicle detection [21], and automatic control [22]. The fixed time and vehicle detection methods cannot adapt to the dynamic changes in traffic flow and complex road conditions. The automatic control methods are difficult to implement for its high algorithm complexity.
A multi-agent system is an important branch of distributed AI research, with the ability of distribution, autonomy, coordination, learning, and reasoning [23]. In 1989, Durfee et al. [24] proposed the use of a negotiation mechanism to share tasks among multiple agents. In 2007, Marvin Minsky argued that human thoughts were constructed by multi-agents [25]. In 2016, Sukhbaatar et al. [18] and Hoshen et al. [19] observed all agents using a centrally controlled method in local environment, and then output the probability distribution of multi-agents' joint actions. Alibaba and University College London (UCL) proposed the use of two-way communication network between agents network (BiNet) [26], and achieved good results in the StarCraft game mission in 2017. From the trend of the multi-agent system, communication between agents has gotten more and more attention.
Adaptive traffic light control [2,3] is a relatively easy way to ease the traffic jam for smart city. Although adaptive traffic light control methods have achieved local optimization by adapting to single intersection based on RL (single-agent), the city has thousands of traffic lights. Thus, the global traffic light optimization could be considered as a multi-agent system, which has been studied by Chen et al. [27]. Moreover, the deployment structure must be taken into consideration for the industrial deployment.
Here is the summary of the methods of traffic signal control (as Table 1 shown): Table 1. The summary of the methods of traffic signal control.

Method Pros Cons
Fixed time [20] Easy to deploy and implement, still the mainstream method today.
Inability to dynamically adapt to intersection changes.
Optimize one traffic light [2,3] The traffic signal can be adjusted according to the dynamic changes of the intersection situation.
Urban traffic signals are actually composed of multiple intersections, and the local optimization of a single intersection cannot represent the overall optimization of multiple intersections.
Optimize multiple traffic lights [27] Global optimization of traffic signals at multiple intersections in a city.
It is difficult to implement and deploy, and the algorithm also has room for optimization, such as considering multi-agent communication.

Single-Agent Reinforcement Learning
Single-agent reinforcement learning was developed to train one agent, which chose a series of actions to get more rewards after interacting with an environment. To learn an optimal policy for the agent to gain maximal reward is the aim of the algorithm. At each time step t, the agent interacts with the environment to maximize the total reward R T , where T is the total number of time steps of an episode until it finishes. The rewards obtained after each action being performed are accumulated, which would be: RL algorithm, which has the characteristics of "data-driven, self-learning, and model-free", is considered to be a practical method to solve problem of traffic light control [28,29]. As shown in Figure 1. The traffic signal is regarded as an "agent" with decision-making ability at the intersection. By observing the real-time traffic flow, the current traffic status S t and the reward R t is obtained. According to the current status, the agent selects and executes the corresponding action (change lights or keep). Then, the agent observes the effect of the action on the intersection traffic to obtain the new traffic state S t+1 and the new reward R t+1 . The agent evaluates the action just selected so that it executes, optimizes strategies until converging to the optimal "state and action".

Multi-Agent Reinforcement Learning
All agents apply their actions to the environment for whole rewards. From this perspective, we define a multi-agent reinforcement learning (MARL) environment as a tuple ( where X m is any given agent and A m is any given action, then the new state of the environment is the result of a set of joined actions defined by A 1 , A 2 , . . . , A n . In other words, the complexity of MARL scenarios increases with the number of agents in the environment (as shown in Figure 2).

Urban Traffic Environment
A "#$%# The urban traffic signal control can be seen as a typical multi-agent system. With the traditional adaptive traffic signal control method based on RL, the new signal controls have been expanded from one intersection to multiple intersections.
MARL is mainly to study the cooperative and coordinated control actions of multiple states of intersections, which extend the single-agent RL algorithm to multi-agents in the urban traffic environment. The MARL based methods in this field are divided into three categories [30]: (1) a completely independent MARL at each intersection; (2) MARL in cooperation with some states from the intersections; (3) MARL in all states from the intersections.
The collaboration mechanism is an important part of MARL in traffic signal control at multiple intersections. Each agent could estimate the action probability model of other agents without real-time computing, but it was still difficult to update the estimation model in a dynamic environment [31].

Attention Mechanism
Attention mechanism has recently been widely used in various fields of deep learning (for example, image processing [32], natural language processing [12,33].), achieving good results. From a conceptual perspective, attention imitates human cognitive methods, selectively filters out a small amount of important information, and focuses on this important information, ignoring most unimportant information. The attention information selection process is reflected in the calculation of information weight coefficients.
The specific calculation method is divided into three steps (as shown in Figure 3): (1) Calculate the similarity or correlation between Query and Key; Similarity(Query, Key i ) = Query · Key i (1) (2) Normalize the result calculated in step (1) to obtain the weighting coefficient; The weighting coefficient is used to perform weighted sum on Value. Step3 Step2 Figure 3. The method of computing attention.

Cloud and Edge Computing
Cloud computing is a paradigm for enabling ubiquitous, convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services), which can be rapidly provisioned and released with minimal management efforts or service provider interaction.
With cloud computing, edge computing emerges as a novel form of computing paradigm, which shifts from centralized to decentralized [40]. Compared with conventional cloud computing, it provides shorter response times and better reliability. To save bandwidth and reduce the latency, more data is processed at the edge rather than uploaded to the cloud. Thus, mobile devices of users can complete parts of the workload at the edge of the network. Similarly, in modern transportation, edge devices can be deployed on roadsides and in vehicles for better communications and control between connected objects.

Cloud Computing
Edge Computing

Multi-Agent Communication Model
The multi-agent communication model (as shown in Figure 5) is in accordance with Shannon communication model [41], the applied perception and behavior of agents can be modeled by information reception and transmission. The agent acts as a communication transceiver, and the internal structure information of the agent is encoded and decoded. The environment is the communication channel between the agents. In actual modeling, a continuous matrix is generally used for multi-agent communication [18,42].

Shannon Communication Model
The basic problem of communication is to reproduce a message sent from one point to another point. In 1948, Shannon proposed the Shannon communication model [41], which represented the beginning of modern communication theory. Shannon communication model is a linear communication model, consisting of six parts: sender, encoder, channel, noise, decoder, and receiver, as shown in Figure 6:

Communications Protocol
The communication protocol [43] is also called the transmission protocol. Both parties involved in the communication carried out end-to-end information transmission according to the agreed rules, and both parties can understand the received information. The communication protocol is mainly composed of grammar, semantics, and timing. The syntax includes the data format, encoding, and signal level; the semantics represent the data content that contains control information, and the timing represents clear rate matching and sequencing of communications.

Problem Definition
In the problem of multi-agent traffic signal control, we consider it as a Markov Decision Process (MDP): < x, π, R, γ >, where x the state of all intersections; π is the policy to create actions; R is reward from all crossroads; γ is the discount factor. Furthermore, we define: each agent that controls the change (duration) of traffic lights is Agent i (i ∈ N); π i is Agent i for all acceptable traffic light duration control strategies, rewarding the R i environment for the level of traffic congestion at the intersection of Agent i and other Agents (Agent −i and its policy π −i ) (It can be calculated according to the specific indicators of vehicle queue length, the lower the congestion level, the greater the reward), (c 1 , . . . , c N ) is the communication matrix C between agents, so in our multi-agent traffic signal problem, the objective functions controlled are: In the Eqution (4), π i is the policy of Agent i ; π −i is the policy of Agent −i ; t is the timestep. The problem is to find a better strategy to maximize the value of the above formula.

Methodology
In this section, we will detail the multi-agent auto communication (MAAC) model and an edge computing architecture for IoT.

MAAC Model
In the MAAC model (as shown in Figure 7): Each agent can be modeled by distributed, partially observable Markov decision (Dec-POMDP). The strategy of each agent (π θ t ) is generated by a neural network. In each time step, the agent will observe the local environment x t and the communication information sent by other agents (c 1 , . . . , c i−1 , c i+1 , . . . , c N ). Through the combination of the above time series information, the Agent generates the next action (a t+1 and the next communication message c t+1 sent out by the internal processing mechanism (parameter is θ i ).
The joint actions of all Agents (a 1 , . . . , a N ) interact with the environment, which is to obtain the maximum value of the centralized value function (θ = E[R]). The MAAC algorithm is designed to improve the neural network parameter set θ i of each agent through the process of optimizing the central value function. The overall architecture of the MAAC model can be regarded as a distributed MARL model with automatic communication capabilities.

Internal Communication Module in Agent
The internal communication module (ICM) in an agent is an important part in MAAC model (as shown in Figure 8).
Each Agent, which is divided into two sub-modules, with the receiving end and the sending end. The receiving end receives the information of other agents and uses the attention mechanism for information processing, and then sends the processed information to the sending end; the sending end observes the external environment and uses the information processed by the receiving attention mechanism to generate information using a neural network.

Receiving End
Agent i will use the attention mechanism to filter information received from other Agents (Agent −i ). Firstly it generates its own message c t from a combined message C = c 1 , . . . , c N after receiving the information of Agent −i . Then, it picks important messages and ignores unimportant ones. Herein, we introduce the parameter set W q ,W k , W v , which are calculated separately (could be calculated in parallel): Then, we calculate the information weightα i = so f tmax(q iki ). Finally we get the weighted information after the information selection:

Sending End
The sending end of the Agent receives the information of other Agents processed by the Attention mechanism of the receiving endĈ, and through the observed local environment x t , generates the next execution action through the neural network a t+1 and communication information c t+1 .

MAAC Algorithm
At the time of t in the MAAC model, the environment input is X t = (x 1 t , . . . , x N t ) and corresponding communication information input is C t = (c 1 t , . . . , c N t ). Multi-agents (Agent 1 , . . . , Agent N ) are going to interact with each other. Each Agent receives information with receivers and transmitters internally. The receiver receives its own environmental information x t and communication information c t , and generates action and external interaction information group (a t+1 , c t+1 ) at t + 1. The MAAC model collects all agent actions to form a joint action (a 1 , . . . , a N ), interacting with the environment and optimizing objective strategy for each agent.
The calculation steps of MAAC at time t are shown in the Figure 9: In the MAAC algorithm (as shown in Algorithm 1), the parameter set of Agent i for each agent is θ i . Furthermore, θ i is divided into the sender θ i Sender and receiver θ i Receiver . The parameters of the sending end and the receiving end, which are optimized by the overall multi-agent objective function, iteratively updating the parameter set of the receiver and the sender in the communication module of each agent. Receiver of Agent i : uses attention mechanism to generate communication matrixĈ t

5:
Sender of Agent i : chooses an action a i t+1 from policy selection network, or randomly chooses action a (e.g., -greedy exploration) 6: Sender of Agent i : generates its own information through the receiver's communication matrix Collect all the joint actions of Agent and execute the actions a 1 t+1 , . . . , a N t+1 , get the reward from the environment R t+1 and next state X t+1

8:
Update the strategic value function of each Agent:

Edge Computing Structure
In order to deploy MAAC algorithms in an industrial scale environment, we must take the network delay into consideration. We propose an edge computing architecture near every traffic light. An edge computing device needs to have the following functions: (1) it could detect vehicles' information (location, direction, velocity) from the surveillance video of its intersection in real-time and record the vehicle information; (2) it could run the traffic signal control algorithm to control the traffic light nearby (see Figure 10).

Experiments
In this section, we first built the urban traffic simulator based on our edge computing architecture. Then, we have applied the MAAC algorithm and other baseline algorithms to the simulation environment for comparing the performance of all models.

Simulation Environment and Settings
We apply an open source simulator for traffic environment: CityFlow [44] as our experiment environment. We assumed that there are six traffic lights (intersection nodes or edge computing nodes) in one section of a city (as shown in Figure 11). Our dynamic control of the traffic lights was using the CityFlow [44] Python interface at runtime. Here are the settings of our experiments (as shown in Table 2).

Simulation Parameters Value
Road length 350 (m) Vehicle speed limit 30 (km/h) Traffic control timing cycle g t = 20, r t = 20, y t = 5 (s) Episode 900 (s) Vehicle simulation setting 400 (one episode) • The directions One traffic light at node 0 has four neighbor nodes (node 1 , node 2 , node 3 , node 4 ), four entries (in), and four exits (out). The road length is set to 350 m and vehicle speed limit is set to be 30 (km/h).

•
Traffic light agent We apply traffic signal control algorithm into a docker container [45]. Vehicle simulation setting We assume vehicles arrive at road entrances according to the Bernoulli process with the random probability P in = 1 15 at one intersection. Every vehicle has a random destination node except for the entry node (we set random(seed) = 7). In one episode, there are approximately 400 vehicles. • Hyper-parameter setting The learning rate is set to 0.001; γ is set to 0.992; the reward is the average waiting time at intersection.

Baseline Methods
• Fix-time In this method we set all the traffic light timing as fixed traffic control timing cycle as we have mentioned in the experiments. • Q-learning (Center) Q-learning algorithm [46] is deployed on center (docker) to generate traffic light control action. The delay from the traffic light agent to an intersection is set at 1.0 s. • Q-learning (Edge) Q-learning algorithm [46] herein is deployed on edge device (docker) to generate traffic light control action. The delay from the traffic light agent to an intersection is set 0.1 s. • Nash Q-learning Nash Q-learning [47] extends Q-learning to a non-cooperative MARL. An agent maintains Q-functions over joint actions, and performs updates based on assuming Nash equilibrium behavior over the current Q-values.

Evaluation
The time of a vehicle enters an entry of the intersection until it passes through, is defined as t M , where M is the number of vehicles. In simulations, we record the time for all vehicles at one intersection in every episode. At last, we accumulate all the times record over all the intersections, To evaluate the traffic network, where E is the number of the episode, M is the number of vehicles, and I is the number of intersections that every vehicle will pass through.

Results
We have applied five methods, including Fixed-time method [20], Q-learning (Center) [46], Q-learning (Edge) [46], Nash Q-learning [47], and our MAAC method. They were all trained in 1000 episodes in CityFLow [44] based on edge computing architecture as we designed. As shown in Figure 12, we can see that the the algorithms converged at around 600 episode point. The MAAC method performed the fastest convergence in the training process, comparing with other models. After trainning, we tested the algorithms in 500 episodes after the training process. As shown in Table 3, the MAAC performed the best among the traffic signal control algorithms. As shown in Table 4, our method did not sacrifice the waiting time of some intersections to ensure overall performance. Furthermore, the performance of every intersection was optimized at different levels. From the method Q-learning (center) and Q-learning (edge), we can see that the edge computing structure has reduced the network delay for the deployment environment.
As shown in Table 5, the delay time and delay rate of Q-learning (Center) are the highest, which proves the edge computing structure we proposed is useful for reducing the network delay. The MAAC still outperforms others when computing the delay time of the network.

Conclusions
In this work, we proposed a multi-agent auto communication (MAAC) algorithm based on the multi-agent reinforcement learning (MARL) and an auto communication protocol (ACP) between agents with the attention mechanism. We built a practicable edge computing structure for industrial deployment on IoT, considering the limitations of the capabilities of network transmission bandwidth.
In the simulation environment, the experiments have shown the MAAC framework outperformed 17% over baseline models. Moreover, the edge computing structure is useful for reducing the network delay when deploying the algorithm on an industrial scale.
In future research, we will build a simulation environment much closer to the real world and take the communication from vehicle to traffic light into consideration to improve the MAAC method.