Mean Field Multi-Agent Reinforcement Learning Method for Area Traffic Signal Control

Zhang, Zundong; Zhang, Wei; Liu, Yuke; Xiong, Gang

doi:10.3390/electronics12224686

Open AccessArticle

Mean Field Multi-Agent Reinforcement Learning Method for Area Traffic Signal Control

¹

School of Electrical and Control Engineering, North China University of Technology, Beijing 100144, China

²

State Key Laboratory for Multimodal Artificial Intelligence Systems, Chinese Academy of Sciences, Beijing 100190, China

^*

Author to whom correspondence should be addressed.

Electronics 2023, 12(22), 4686; https://doi.org/10.3390/electronics12224686

Submission received: 15 October 2023 / Revised: 10 November 2023 / Accepted: 15 November 2023 / Published: 17 November 2023

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Reinforcement learning is an effective method for adaptive traffic signal control in urban transportation networks. As the number of training rounds increases, the optimal control strategy is learned, and the learning capabilities of deep neural networks are further enhanced, thereby avoiding the limitations of traditional signal control methods. However, when faced with the sequential decision tasks of regional signal control, it encounters issues such as the curse of dimensionality and environmental non-stationarity. To address the limitations of traditional reinforcement learning algorithms applied to multiple intersections, the mean field theory is applied. This models the traffic signal control problem at multiple intersections within a region as interactions between individual intersections and the average effects of neighboring intersections. By decomposing the Q-function through bilateral estimation between the agent and its neighbors, this method reduces the complexity of interactions between agents while preserving global interactions between the agents. A traffic signal control model based on Mean Field Multi-Agent Reinforcement Learning (MFMARL) was constructed, containing two algorithms: Mean Field Q-Network Area Traffic Signal Control (MFQ-ATSC) and Mean Field Actor-Critic Network Area Traffic Signal Control (MFAC-ATSC). The model was validated using the SUMO simulation platform. The experimental results indicate that across different metrics, such as average speed, the mean field reinforcement learning method outperforms classical signal control methods and several existing approaches.

Keywords:

traffic engineering; area traffic signal control; mean field theory; multi-agent reinforcement learning; neural network; traffic simulation

1. Introduction

The issue of urban traffic congestion has become increasingly severe as urbanization accelerates, causing significant negative impacts on public transportation and society as a whole [1]. Reinforcement Learning (RL), as an adaptive control strategy, has the advantage of learning directly from observed data without making unrealistic assumptions about traffic models. Traffic signal control based on reinforcement learning first observes the traffic conditions, generates and executes signal control actions, and then learns and adjusts the strategy based on feedback from the environment. Mikam et al. [2] were the first to apply reinforcement learning to traffic signal control. However, reinforcement learning methods face challenges when dealing with complex and continuous state spaces, making autonomous decision making difficult. With the development of reinforcement learning and deep learning techniques, researchers have proposed combining the two techniques to form Deep Reinforcement Learning (DRL) methods [3,4]. Li et al. [5] used deep reinforcement learning techniques to study single intersection control problems and made improvements in this area.

In regional coordinated control, the control scope extends to multiple intersections, and there is a close spatiotemporal correlation among them. Signal control schemes also interact with each other. The goal is to coordinate multiple intersection agents to approach the optimal traffic network strategy [6,7]. However, the increase in the number of intersections leads to an exponentially growing state-action space. The cumulative noise from the exploration behavior of other agents makes learning the Q-value function extremely challenging, rendering it difficult to apply traditional reinforcement learning to multi-intersection signal control. One simple approach is to use a single global agent to control all intersections [8]. This method directly takes the state as input and learns to select joint actions for all intersections. However, this approach leads to dimensionality explosion as the state-action space grows exponentially in dimensions. Rasheed et al. [9] introduced multi-agent DQN, which learns the best joint actions through multi-agent cooperation and further addresses the dimensionality problem in high-traffic and interference traffic network scenarios. Another approach adds a term to the loss function in the individual learning process, minimizing the difference between the weighted sum of individual Q-values and the global Q-value to ensure that individuals consider the learning process of other agents. Tian et al. [10] achieved traffic signal control using multiple regional agents and a centralized global agent. Each regional agent learns its own policy and value function in a region with a finite set of actions. The centralized global agent aggregates RL results from different regional agents in a hierarchical manner, forming the final Q-function over the entire large-scale traffic grid. There also exists a series of research focused on using independent agents to control traffic signals, with each agent controlling one intersection. Unlike joint action control, agents learn their strategies without obtaining rewards from other agents. Independent learning can be further divided into two categories: communication and non-communication among agents. Zheng et al. [11] had each agent in the road network observe only its own state and reward, with no communication among the agents. However, as the scale of the traffic network grows and the complexity of the state increases, the non-stationarity in the environment becomes more pronounced. Learning without knowing communication or coordination mechanisms between agents can make it challenging to reach a convergent state. Methods that allow communication between agents enable them to act as a collective rather than a set of individual agents. Xu et al. [12] added traffic information from neighboring agents to their own state inputs, facilitating communication among the agents.

However, the aforementioned methods are only suitable for regional signal control in cases with a limited number of intersections. The Mean Field Multi-Agent Reinforcement Learning (MFMARL) [13] method used in this paper addresses these issues by employing mean field estimation in the joint action space. Hu [14] proposed a multi-agent system with a nearly infinite number of agents and used mean field effects to approximate the influence of other agents on a single agent. This led to the derivation of the Fokker–Planck equation that describes the evolution of the probability distribution of Q-values in a multi-agent population. The mean field estimation method transforms the complex interactions among multiple intersections within a region into bilateral interactions between individual intersections and neighboring intersections. Each agent directly interacts with a finite set of neighbors, and any two agents indirectly interact through a finite chain of direct interactions. This method reduces the complexity of interactions between agents while preserving global interactions between any pair of agents. Consequently, the parameterization of the Q-value function becomes independent of the number of agents, effectively mitigating the exploration noise issue caused by other agents. This method adapts signal timing plans at intersections based on the current traffic state, ensuring smooth traffic flow and improving intersection capacity.

2. The Regional Traffic Signal Control Model Based on MFMARL

2.1. Random Games

The multi-agent random game

Γ

consists of a tuple

Γ ≜ (S, A^{1}, \dots, A^{N}, r^{1}, \dots, r^{N}, p, γ)

, where

S

represents the state space and is the action space for agent

j \in \{1, \dots, N\}

. The reward for agent

j

is denoted as

r^{j} : S \times A^{1} \times \dots A^{N} \to R

. The transition probability

p : S \times A^{1} \times \dots A^{N} \to Ω

describes the stochastic evolution of the states over time, where

Ω (s)

is a collection of probability distributions over the state space. The constant

γ \in [0,1)

represents the reward discount factor over time. At time step

t

, all agents simultaneously take actions, and the immediate reward

r_{t}^{j}

received by each agent is based on the actions taken in the previous time step. Agents choose actions based on their policies. For agent

j

, the policy is defined as

π^{j} : S \to Ω (A^{j})

, where

Ω (A^{j})

is the probability distribution of agent

j

over action space

π ≜ [π^{1}, \dots, π^{N}]

. Let represents the joint policy of all agents, where

π

is assumed to be a time-independent stationary variable.

In the initial state

s

, the value function of agent

j

under the joint policy

π

is represented as the discounted return:

v_{π}^{j} (s) = v^{j} (s; π) = \sum_{t = 0}^{\infty} γ^{t} E_{π, p} [r_{t}^{j}| s_{0} = s, π]

(1)

The Q-function can be defined as the given value function based on the Bellman equation in a multi-agent framework, as shown in Equation (1). Therefore, the Q-function

Q_{π}^{j} : S \times A^{1} \times \dots A^{N} \to R

for agent j under the joint policy can be represented as:

Q_{π}^{j} (s, a) = r^{j} (s, a) + γ E_{s' ~ p} [v_{π}^{j} (s')]

(2)

s'

represents the state at the next time step, and the value function

v_{π}^{j}

can be expressed using the Q-function from Equation (2):

v_{π}^{j} (s) = E_{a \sim π} [Q_{π}^{j} (s, a)]

(3)

The Q-function for multi-agent games in Equation (2) extends the formulation from single-agent games to consider the joint actions of all agents

a ≜ [a^{1}, \dots, a^{N}]

and take the expectation over the joint actions as shown in Equation (3). Representing Multi-Agent Reinforcement Learning (MARL) as a discrete non-cooperative stochastic game, each agent is not aware of the game rules or the reward definitions of other agents. However, they can respond to other agents’ previous actions and immediate rewards through observations.

2.2. Nash Q-Learning

In MARL, the goal of each agent is to find the optimal strategy and maximize its value function. The value function of agent j depends on the joint policy of all agents. Nash equilibria in stochastic games are crucial in this context as they represent situations where no agent can unilaterally improve its position by changing its strategy, given the strategies of the other agents. By using a specific joint strategy

π_{*} ≜ [π_{*}^{1}, \dots, π_{*}^{N}]

for all

s \in S, j \in \{1, \dots, N\}

and effective

π^{j}

, the following formula holds:

v^{j} (s; π_{*}) = v^{j} (s; π_{*}^{j}, π_{*}^{- j}) \geq v^{j} (s; π_{*}^{j}, π_{*}^{- j})

A compact notation

π_{*}^{- j} ≜ [π_{*}^{1}, \dots, π_{*}^{j - 1}, π_{*}^{j + 1}, π_{*}^{N}]

is adopted for the joint strategy of all agents except for agent

j

.

In a Nash equilibrium, assuming that other agents adhere to strategy

π_{*}^{- j}

, each agent will exhibit the best response

π_{*}^{j}

to the strategies of the other agents. For multi-agent stochastic games, there will always exist at least one Nash equilibrium with fixed strategies [13]. Given a Nash strategy

π_{*}

, calculate the strategy

π_{*}

that all agents follow starting from the initial state through the Nash value function

v^{N a s h} ≜ [v_{π_{*}}^{1} (s), \dots, v_{π_{*}}^{N} (s)]

.

Nash Q-Learning is an iterative process that calculates Nash strategies through two alternating steps: (1) Use the Lemke–Howson [15] algorithm to solve for the Nash equilibrium in the current stage game defined under

\{Q_{t}\}

; (2) optimize the estimation of the Q-function using the newly obtained Nash equilibrium values. Nash Q-Learning process can be represented in the following form:

H^{N a s h} Q (s, a) = E_{s^{'} ~ p} [r (s, a) + γ v^{N a s h} (s^{'})]

(4)

The Nash operator forms a contraction mapping on

r (s, a) ≜ [r^{1} (s, a), \dots, r^{N} (s, a)]

and

Q ≜ [Q^{1}, \dots, Q^{N}]

.

2.3. Mean Field Estimation

The dimension of joint actions grows exponentially with an increase in the number of agents. Because all agents simultaneously take actions and evaluate the value function based on joint actions, training a standard Q-function

Q^{j} (s, a)

becomes extremely challenging. To address this issue, mean field multi-agent reinforcement learning decomposes the Q-function through local bilateral interactions:

Q^{j} (s, a) = \frac{1}{N^{j}} \sum_{k \in N (j)} Q^{j} (s, a^{j}, a^{k})

(5)

N (j)

is the index set of neighbors of agent

j

, with a size of

N^{j} = |N (j)|

. By decomposing the Q-function through bilateral estimations between this agent and its neighbors, it reduces the complexity of interactions among agents while still retaining the global interactions between any pair of agents. The bilateral interaction

Q^{j} (s, a^{j}, a^{k})

in Equation (5) can be approximated using mean field theory. In a discrete action space, the action

a^{j}

of agent

j

is represented as a discrete variable using one-hot encoding, where each component of the one-hot encoding represents D possible actions for each of the

a^{j} ≜ [a_{1}^{j}, \dots, a_{D}^{j}]

components. Based on the neighbor set

N (j)

of agent

j

, the average action

{\bar{a}}^{j}

is calculated and the one-hot action

a^{k}

for each neighbor

k

is represented as the sum of

{\bar{a}}^{j}

and a small perturbation

δ a^{j, k}

:

a^{k} = {\bar{a}}^{j} + δ a^{j, k}, w h e r e {\bar{a}}^{j} = \frac{1}{N^{j}} \sum_{k} a^{k}

(6)

{\bar{a}}^{j} ≜ [{\bar{a}}_{1}^{j}, \dots, {\bar{a}}_{D}^{j}]

represents the empirical distribution of action choices made by the neighbors of agent

j

. In Taylor’s theorem, if the actions taken by neighbor

k

, denoted as

a^{k}

, are twice differentiable, the bilateral interaction Q-function

Q^{j} (s, a^{j}, a^{k})

can be represented as:

Q^{j} (s, a) = \frac{1}{N^{j}} \sum_{k} Q^{j} (s, a^{j}, a^{k}) = \frac{1}{N^{j}} \sum_{k} [Q^{j} (s, a^{j}, {\bar{a}}^{k}) + ▽_{{\bar{a}}^{j}} Q^{j} (s, a^{j}, {\bar{a}}^{k}) \cdot δ a^{j, k} + \frac{1}{2} δ a^{j, k} \cdot ▽_{{\bar{a}}^{j, k}}^{2} Q^{j} (s, a^{j}, {\tilde{a}}^{j, k}) \cdot a^{j, k}] = Q^{j} (s, a^{j}, {\bar{a}}^{j}) + ▽_{{\bar{a}}^{j}} Q^{j} (s, a^{j}, {\bar{a}}^{j}) [\frac{1}{N^{j}} \sum_{k} δ a^{j, k}] + \frac{1}{{2 N}^{j}} \sum_{k} [δ a^{j, k} \cdot ▽_{{\bar{a}}^{j, k}}^{2} Q^{j} (s, a^{j}, {\tilde{a}}^{j, k}) \cdot a^{j, k}]

(7)

= Q (s, a^{j}, {\bar{a}}^{j}) + \frac{1}{2 N^{j}} \sum_{k}^{} R_{s, a^{j}}^{j} (a^{k}) \approx Q (s, a^{j}, {\bar{a}}^{j})

(8)

R_{s, a^{j}}^{j} (a^{k}) ≜ δ a^{j, k} \cdot \nabla_{{\tilde{a}}^{j, k}}^{2} \cdot Q^{j} (s, a^{j}, {\tilde{a}}^{j, k}) \cdot δ a^{j, k}

represents the remainder of the Taylor polynomial using

{\tilde{a}}^{j, k} = \bar{a} + ϵ^{j, k} δ a^{j, k}, ϵ^{j, k} \in [0,1]

. In Equation (7),

\sum_{k} δ a^{k} = 0

is obtained by removing the first-order term using Equation (6). By using mean field theory, the bilateral interactions

Q^{j} (s, a^{j}, a^{k})

between agent

j

and each neighbor agent

k

are simplified into interactions between the central agent

j

and a virtual average agent abstracted from the collective effects of all the neighbors of agent

j

. The interaction process is thus simplified to the mean field Q-function

Q^{j} (s, a^{j}, {\bar{a}}^{j})

expressed in Equation (8). During the training process, as learning experiences

e = (s, \{a^{k}\}, \{r^{j}\}, s')

are acquired, the mean field Q-function is updated iteratively in the following manner:

Q_{t + 1}^{j} (s, a^{j}, {\bar{a}}^{j}) = (1 - α) Q_{t}^{j} (s, a^{j}, {\bar{a}}^{j}) + α [r^{j} + γ v_{t}^{j} (s')]

(9)

α

represents the learning rate, and

{\bar{a}}^{j}

is the average action of all neighbors of agent

j

defined in Equation (6). The mean field value function

v_{t}^{j} (s')

for agent j is defined in Equation (9) as:

v_{t}^{j} (s') = \sum_{a^{j}} π_{t}^{j} (a^{j}| s', {\bar{a}}^{j}) E_{{\bar{a}}^{j} (a^{- j}) ~ π_{t}^{- j}} [Q_{t}^{j} (s', a^{j}, {\bar{a}}^{j})]

(10)

From Equation (9), as seen in Equation (10), mean field estimation transforms the multi-agent reinforcement learning problem into the problem of finding the best strategy

π_{t}^{j}

for the central agent

j

based on the average action

{\bar{a}}^{j}

of all its neighbors. MFMARL calculates the best strategy for agent

j

through an iterative approach. In the stage game

\{Q_{t}\}

, the average action

{\bar{a}}^{j}

of all neighbors of agent

j

is obtained by calculating the mean of the actions

a^{k}

of agent

j

’s

N^{j}

neighbors, where the strategy parameters of

π_{t}^{k}

are influenced by the average action

{\bar{a}}_{_}^{k}

from the previous time step.

{\bar{a}}^{j} = \frac{1}{N^{j}} \sum_{k} a^{k}, a^{k} ~ π_{t}^{k} (\cdot| s, {\bar{a}}_{_}^{k})

(11)

According to Equation (11), the average action

{\bar{a}}^{j}

can be computed, and the changes in strategy

π_{t}^{j}

also depend on the current average action

{\bar{a}}^{j}

. The new strategy is obtained through the Boltzmann distribution as follows:

π_{t}^{j} (a^{j}| s, {\bar{a}}^{j}) = \frac{\exp (β Q_{t}^{j} (s, a^{j}, {\bar{a}}^{j}))}{\sum_{a^{j'} \in A^{j}} \exp (β Q_{t}^{j} (s, a^{j'}, {\bar{a}}^{j}))}

(12)

Through the iterative updates as described in Equation (11) and Equation (12), the average actions and strategies of all agents are refined and improved. In Equation(12),

β

is the temperature parameter, and the agent selects actions according to the Q-value learned by adjusting the temperature parameter according to the Boltzmann distribution.

3. Introduction to the MFMARL-ATSC Model Algorithm

In the Mean Field Multi-Agent Reinforcement Learning Area Traffic Signal Control (MFMARL-ATSC) model, the intersection signal control problem is abstracted as a process where agents choose optimal strategies based on regional road network state information. Agents use both dynamic and static traffic information within the road network to select actions from the action space. They record the rewards obtained after executing these actions. Based on the rewards obtained for different actions, agents calculate Q-values, guiding the selection of actions in the next step to achieve the maximum expected outcome.

This model employs a neural network as a function approximator to realize the mean field Q-function in Equation (8), where the Q-function is parameterized by the neural network weights denoted as

φ

. The update rules in Equation (9) can also be interpreted as adjustments to the neural network weights. As illustrated in Figure 1, the modeling capacity of a convolutional neural network is utilized to compute Q-values for state-action pairs. When fitting Q-values using a convolutional neural network, the state information serves as the input to the network. After processing through the convolutional neural network, the Q-value for the action taken in that state is obtained.

MFMARL-ATSC uses the standard Q-function to solve problems in discrete action space, that is, MFQ-ATSC, shown in Figure 2.where MF is the Mean Field. In the MFQ-ATSC algorithm, the agent is trained by minimizing the loss function:

L (φ^{j}) = {(y^{j} - Q_{φ^{j}} (s, a^{j}, {\bar{a}}^{j}))}^{2}

where

y^{j} = r^{j} + γ v_{φ_{_}^{j}}^{M F (s')}

represents the target mean field value and

φ_{_}^{j}

corresponds to the target network parameters within the neural network. Taking the derivative of the above equation yields the parameter gradient direction as follows:

\nabla_{ϕ^{j}} L (ϕ^{j}) = (y^{j} - Q_{ϕ^{j}} (s, a^{j}, {\bar{a}}^{j})) \nabla_{ϕ^{j}} Q_{ϕ^{j}} (s, a^{j}, {\bar{a}}^{j})

(13)

The algorithm flow of MFQ-ATSC is as follows (Algorithm 1):

Algorithm 1: MFQ-ATSC

1: Initialize the values of

Q_{φ^{j}}, Q_{φ_{_}^{j}}

and

{\bar{a}}^{j}

for all

j \in \{1,2, \dots, N\}

.

2: while training is not finished yet do

3: for

m = 1, \dots M

do

4: For each intersection agent

j

, sample actions

a^{j}

based on the existing average actions

{\bar{a}}^{j}

and exploration probabilities, using Equation (13);

5: For each intersection agent

j

, sample actions

{\bar{a}}^{j}

based on Equation (13);

6: end

7: Execute the joint action

a = | a^{1}, \dots, a^{N} |

, receive rewards

r = [r^{1}, \dots, r^{N}]

, and observe the next traffic state s′;

8: Store

〈s, a, r, s^{,}, \bar{a}〉

in the replay buffer

D

, where

\bar{a} = | {\bar{a}}^{1}, \dots, {\bar{a}}^{N} |

;

9: for

j = 1, \dots N

do

10: Select a small batch of K experiences

〈s, a, r, s^{'}, \bar{a}〉

from the replay buffer

D

;

11: Let

{\bar{a}}_{_}^{j} \leftarrow {\bar{a}}^{j}

, choose actions

{\bar{a}}_{_}^{j}

based on

Q_{φ_{_}^{j}}

;

12: Set

y^{j} = r^{j} + γ v_{φ_{_}^{j}}^{M F}

according to Equation (11);
13: Minimize the loss function through

L (φ^{j}) = \frac{1}{K} \sum {(y^{j} - Q_{φ^{j}} (s^{j}, a^{j}, {\bar{a}}^{j}))}^{2}

;
14: end
15: For each intersection agent

j

, the target network parameters are updated based on the learning rate

τ

:

φ_{_}^{j} \leftarrow τ φ^{j} + (1 - τ) φ_{_}^{j}

;
16: end

Through neural network weights

θ

, the model’s strategy can be explicitly defined, replacing the Boltzmann strategy used in MFQ-ATSC. This gives rise to the online actor-critic model, shown in Figure 3, namely MFAC-ATSC. The policy network (actor) is trained through the following policy gradient:

\nabla_{θ^{j}} F (θ^{j}) \approx \nabla_{θ^{j}} \log π_{θ^{j}} (s) Q_{ϕ^{j}} {(s, a^{j}, {\bar{a}}^{j})|}_{a = π_{θ^{j}} (s)}

In MFAC-ATSC, the critic follows the same configuration as in MFQ-ATSC, as described in Equation (13). During the training process of MFAC-ATSC, it is necessary to continuously update

θ, φ

until convergence is achieved.

The algorithm flow of MFAC-ATSC is as follows (Algorithm 2):

Algorithm 2: MFAC-ATSC

1: Initialize the values of

Q_{φ^{j}}, Q_{φ_{_}^{j}}, π_{θ^{j}}, π_{θ_{_}^{j}}

and

{\bar{a}}^{j}

for all

j \in \{1,2, \dots, N\}

.

2: while training is not finished yet do

3: For each intersection agent, select action

a^{j} = π_{θ^{j}} (s)

and calculate the new average action

\bar{a} = | {\bar{a}}^{1}, \dots, {\bar{a}}^{N} |

.

4: Execute the joint action

a = | a^{1}, \dots, a^{N} |

, receive rewards

r = [r^{1}, \dots, r^{N}]

, and observe the next time step’s traffic state s′

5: Store

〈s, a, r, s^{,}, \bar{a}〉

in the replay buffer

D

;

6: for

j = 1, \dots N

do

7: Select a small batch of K experiences

〈s, a, r, s^{,}, \bar{a}〉

from the replay buffer

D

;

8: Set

y^{j} = r^{j} + γ v_{φ_{_}^{j}}^{M F}

based on Equation (11);

9: Minimize the loss function through

L (φ^{j}) = \frac{1}{K} \sum {(y^{j} - Q_{φ^{j}} (s^{j}, a^{j}, {\bar{a}}^{j}))}^{2}

;

10: Update the actor network with policy gradients as calculated below:

\nabla_{θ^{j}} F (θ^{j}) \approx \nabla_{θ^{j}} \log π_{θ^{j}} (s') Q_{ϕ^{j}} {(s', a_{_}^{j}, {\bar{a}}_{_}^{j})|}_{a = π_{θ_{_}^{j}} (s')}

11: end

12: Update the target network parameters for each intersection agent

j

based on the learning rate

τ_{φ}, τ_{θ}

:

φ_{_}^{j} \leftarrow τ_{φ} φ^{j} + (1 - τ_{φ}) φ_{_}^{j}

θ_{_}^{j} \leftarrow τ_{θ} θ^{j} + (1 - τ_{θ}) θ_{_}^{j}

13: end

4. Simulation Experiment Settings and Results Analysis

The experiments in this study were conducted using the open-source simulation software SUMO, which simulates micro-traffic and multi-modal transport. Real-time traffic environment state information was obtained from the Python APIs provided by SUMO, and actions were controlled for traffic signals using the TraCI module.

4.1. Simulation Experiment Settings and Results Analysis

The research area selected was the urban road network in Shijingshan District, Beijing. The road network structure in this area is shown in Figure 4a. The area extends north to Fushi Road, east to Bajiao East Street, south to Bajiao Road, and west to Yangzhuang East Street. Within this area, there are various road types, including main roads, secondary roads, and local roads, with lane widths ranging from two lanes in each direction to eight lanes in each direction. There are a total of 22 intersections in the research road network, out of which eight are signal-controlled intersections. In the subsequent research, optimization will be focused exclusively on the signal-controlled intersections marked in Figure 4b.

4.2. Experimental Setup

4.2.1. State Definition

Decisions are made based on the current traffic state, so the design of the state is crucial [16]. The state designed in this study consists of two parts: local information and neighbor information. Local information includes the queue lengths of vehicles in each incoming lane of the current intersection, while neighbor information includes the queue lengths of vehicles in each incoming lane of all neighboring intersections. It is expressed in the following form:

s_{j} = [q_{j}^{1}, \dots, q_{j}^{n}, q_{k}^{1}, \dots, q_{k}^{n}]

where

q_{j}^{n}, q_{k}^{n}

represent the queue length of vehicles in the nth incoming lane for agent

j

and its neighboring agent

k

.

4.2.2. Action Definition

Agents need to select appropriate actions based on the traffic state and strategy to manage traffic flow. The MFQ-ATSC algorithm uses a standard Q-function to address problems in discrete action spaces, while the MFAC-ATSC algorithm utilizes an AC (Actor-Critic) model to handle continuous action spaces. Consequently, the two algorithms have different action space definitions.

Discrete Action Space

In the discrete action space, at each step, it is determined whether to increase the current phase’s duration. The action set is as follows:

A = \{N S S_{l}, N S S_{s}, N S L_{l}, N S L_{s}, W E S_{l}, W E S_{s}, W E L_{l}, W E L_{s}\}

They represent the following: South–North straight-through phase, lasting 30 s; South–North straight-through phase, lasting 15 s; South–North left-turn phase, lasting 25 s; South–North left-turn phase, lasting 15 s; East–West straight-through phase, lasting 30 s; East–West straight-through phase, lasting 15 s; East–West left-turn phase, lasting 25 s; East–West left-turn phase, lasting 15 s.

2.: Continuous Action Space

In the continuous action space, at each step, it is determined whether to increase the current phase’s duration. The action set is as follows:

A = \{s w i t c h, k e e p\}

They represent an immediate switch to the next phase and the maintenance of the current phase, respectively.

4.2.3. Reward Function

Agents adjust the probability distribution of selecting each action based on the received immediate rewards, seeking to maximize long-term reward values [17]. The defined reward in this study is unique for each intersection, meaning that each agent considers only the information within its respective intersection. During the training process, while having knowledge of the states of neighboring intersections and the average actions, agents aim to maximize their accumulated reward. The reward is defined as follows:

r = \sum_{n = 1} \frac{1}{w_{n}}

where

w_{n}

represents the waiting time for the nth incoming lane at the intersection.

4.3. Simulation Experiment Results Analysis

In this study, a comparison was conducted between the region-based signal traffic control method using MFMARL and traditional signal control methods. The selected control methods included fixed-timing control as well as traditional deep reinforcement learning algorithms like DQN [3] and A3C [18]. DQN and A3C are two algorithms used to solve reinforcement learning problems. They have significant differences in methods and implementation. DQN is a value-based method that aims to directly learn the value function of state-action pairs, often using experience replay and a target network to achieve stable training and convergence. In contrast, A3C combines strategy and value training methods, allowing multiple agents to interact with the environment at the same time, thus improving training efficiency through parallel training and introducing diversity. These were used as control groups in the simulation experiments. When employing the fixed-timing control method, signal timing plans at various intersections within the region remained unchanged and were not adjusted in response to changes in the road network state. In contrast, when employing Mean Field Reinforcement Learning and traditional deep reinforcement learning signal control methods, signal timings at each intersection were dynamically adjusted based on detectors placed within the road network. Each simulation experiment ran for a duration of 10,800 s in each iteration, with a total of 400 iterations. To mitigate potential biases introduced during the road network loading process, no algorithmic control was applied during the initial 1200 s. Figure 5 illustrates the comparison of reward values for various intersections within the region under different control methods.

All the algorithms designed in this study utilize the same neural network structure. As depicted in Figure 5, it is evident that the performance significantly improved when employing the Mean Field Reinforcement Learning control method. Combining the performance of different algorithms across various intersections, it can be observed that the offline learning methods MFQ-ATSC and DQN exhibit faster convergence compared to the online learning methods MFAC-ATSC and A3C [19]. Transforming the multi-intersection traffic signal control problem within the region into interactions between individual intersections and neighboring intersections enhances the convergence speed of offline learning algorithms [20]. During the experience replay process, it allows for faster learning of the optimal policy from non-current strategy experiences [21] and leads to a more stable convergence of the optimal policy [22]. MFQ-ATSC starts converging after 100–150 iterations, while MFAC-ATSC still has a few intersections that have not reached convergence even after 400 iterations.

Based on Figure 4 and Figure 5, and Table 1, after 400 iterations, it can be observed that the average loss time of MFAC-ATSC is reduced by 29.60% compared with fixed-timing control, while that of MFQ-ATSC is reduced by 34.50%. The huge difference in the J17 reward curve is because J17 is only one agent among multiple agents, and its reward fluctuations cannot fully represent the traffic signal level of the entire area. Judging from the simulation scenario, it is possible that at the intersection where the J17 signal controller is located, vehicle congestion, queuing, sudden braking, and other operations have caused changes in the vehicle density of the road network. The same goes for J9. Furthermore, the control performance is better than the other two traditional deep reinforcement learning control methods. This demonstrates that using mean field reinforcement learning in regional traffic signal control can reduce loss time for vehicles within the road network, thereby improving operational efficiency, as shown in Figure 6 and Figure 7.

5. Conclusions

This article presents a regional traffic signal control method based on mean field reinforcement learning. It trains neural networks based on real-time traffic state information to determine the optimal signal timing plan at intersections. To address the challenges of extending existing multi-agent reinforcement learning algorithms to regional road network environments, such as the problem of dimension explosion and non-stationarity in the environment, mean field multi-agent reinforcement learning decomposes the Q-function through local bilateral interactions. It transforms the multi-agent reinforcement learning problem for regional traffic signal control into an optimization problem of finding the best strategy for the central agent by approximating the mean field. The article applies two mean field multi-agent reinforcement learning algorithms, MFQ-ATSC and MFAC-ATSC, as well as existing DQN and A3C algorithms, to simulate experiments in a real regional road network. These algorithms are compared with traditional fixed-timing control methods. The models are validated using the SUMO simulation platform. MFQ-ATSC and MFAC-ATSC improve travel time by 15.74% and 7.4%, respectively, compared to fixed-timing control. They also outperform the DQN algorithm by 53.22% and 40.32%, respectively, and the A3C algorithm by 85.1% and 52.63%, respectively. The experimental results show that MFQ-ATSC and MFAC-ATSC algorithms converge faster, achieve higher final stable reward values, and result in higher average vehicle speeds and less loss of time within the road network. This demonstrates the effectiveness of mean field multi-agent reinforcement learning algorithms when addressing regional traffic signal control problems, thus significantly improving traffic efficiency within the road network.

Future research will further improve the convergence speed and adaptability of the algorithm and extend the application scope of the model to more complex systems. At the same time, a variety of experimental scenarios will be considered, combined with networked automatic vehicles and regional traffic signal control, and mean field theory will be used to further optimize the traffic efficiency of the road network.

Author Contributions

Conceptualization, Z.Z.; Writing—original draft, W.Z.; Writing—review & editing, Y.L.; Supervision, G.X. All authors have read and agreed to the published version of the manuscript.

Funding

The research was supported by the following funding projects: National Natural Science Foundation Project (U1909204), China National Railway Group Co., Ltd. Science and Technology Research and Development Program Project (L2022X002), Open Topic of National Railway Intelligent Transportation System Engineering Technology Research Center (RITS2021KF03), and Guangdong Provincial Key Area Research and Development Program Project (2020B0909050001).

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no conflict of interest.

References

Hua, W.; Zheng, G.J. Recent Advances in Reinforcement Learning for Traffic Signal Control. ACM SIGKDD Explor. Newsl. 2020, 22, 12–18. [Google Scholar]
Mikami, S.; Kakazu, Y. Genetic reinforcement learning for cooperative traffic signal control. In Proceedings of the First IEEE Conference on Evolutionary Computation, IEEE World Congress on Computational Intelligence, Orlando, FL, USA, 27–29 June 1994; Volume 1, pp. 223–228. [Google Scholar]
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef] [PubMed]
Shang, C.L.; Liu, X.M.; Tian, Y.L.; Tian, Y.L.; Dong, L.X. Priority of Dedicated Bus Arterial Control Based on Deep Reinforcement Learning. J. Transp. Syst. Eng. Inf. Technol. 2021, 21, 64–70. [Google Scholar]
Li, L.; Lv, Y.S.; Wang, F.Y. Traffic signal timing via deep reinforcement learning. IEEE/CAA J. Autom. Sin. 2016, 3, 247–254. [Google Scholar]
Chu, T.; Wang, J.; Codecà, L.; Li, Z. Multi-Agent Deep Reinforcement Learning for Large-Scale Traffic Signal Control. IEEE Trans. Intell. Transp. Syst. 2020, 21, 1086–1095. [Google Scholar] [CrossRef]
Liang, X.; Du, X.; Wang, G.; Han, Z. A Deep Reinforcement Learning Network for Traffic Light Cycle Control. IEEE Trans. Veh. Technol. 2019, 68, 1243–1253. [Google Scholar] [CrossRef]
Prashanth, L.A.; Bhatnagar, S. Reinforcement learning with average cost for adaptive control of traffic lights at intersections. In Proceedings of the 2011 14th International IEEE Conference on Intelligent Transportation Systems(ITSC), Washington, DC, USA, 5–7 October 2011; pp. 1640–1645. [Google Scholar]
Rasheed, F.; Yau, K.; Low, Y.C. Deep reinforcement learning for traffic signal control under disturbances: A case study on Sunway city, Malaysia. Future Gener. Comput. Syst. 2020, 109, 431–445. [Google Scholar] [CrossRef]
Tan, T.; Bao, F.; Deng, Y.; Jin, A.; Dai, Q.; Wang, J. Cooperative Deep Reinforcement Learning for Large-Scale Traffic Grid Signal Control. IEEE Trans. Cybern. 2020, 50, 2687–2700. [Google Scholar] [CrossRef] [PubMed]
Zheng, G.; Zang, X.; Xu, N.; Wei, H.; Yu, Z.; Gayah, V.; Xu, K.; Li, Z. Diagnosing Reinforcement Learning for Traffic Signal Control. arXiv 2019, arXiv:abs/1905.04716. [Google Scholar]
Xu, M.; Wu, J.; Huang, L.; Zhou, R.; Wang, T.; Hu, D. Network-wide traffic signal control based on the discovery of critical nodes and deep reinforcement learning. J. Intell. Transp. Syst. 2020, 24, 1–10. [Google Scholar] [CrossRef]
Yang, Y.; Luo, R.; Li, M.; Zhou, M.; Zhang, W.; Wang, J. Mean Field Multi-Agent Reinforcement Learning. In Proceedings of the 35th International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; Volume 80, pp. 5571–5580. [Google Scholar]
Hu, S.; Leung, C.; Leung, H. Modelling the Dynamics of Multiagent Q-Learning in Repeated Symmetric Games: A Mean Field Theoretic Approach. In Proceedings of the Advances in Neural Information Processing Systems 32 (NeurIPS 2019), Vancouver, BC, Canada, 8–14 December 2019. [Google Scholar]
Mangasaria, O.L. Equilibrium Points of Bimatrix Games. J. Soc. Ind. Appl. Math. 1964, 12, 778–780. [Google Scholar] [CrossRef]
Wu, T.; Zhou, P.; Liu, K.; Yuan, Y.; Wang, X.; Huang, H.; Wu, D.O. Multi-Agent Deep Reinforcement Learning for Urban Traffic Light Control in Vehicular Networks. IEEE Trans. Veh. Technol. 2020, 69, 8243–8256. [Google Scholar] [CrossRef]
Kumar, N.; Rahman, S.S.; Dhakad, N. Fuzzy Inference Enabled Deep Reinforcement Learning-Based Traffic Light Control for Intelligent Transportation System. IEEE Trans. Intell. Transp. Syst. 2020, 22, 4919–4928. [Google Scholar] [CrossRef]
Mnih, V.; Badia, A.P.; Mirza, M.; Graves, A.; Lilicrap, T.; Herlay, T.; Silver, D.; Kavukcuoglu, K. Asynchronous methods for deep reinforcement learning. In Proceedings of the International Conference on Machine Learning, New York, NY, USA, 19–24 June 2016; pp. 1928–1937. [Google Scholar]
Wei, W.; Wu, Q.; Wu, J.Q.; Du, B.; Shen, J.; Li, T. Multi-agent deep reinforcement learning for traffic signal control with Nash Equilibrium. In Proceedings of the 2021 IEEE 23rd International Conference on High Performance Computing & Communications; 7th International Conference on Data Science & Systems; 19th International Conference on Smart City; 7th International Conference on Dependability in Sensor, Cloud & Big Data Systems & Application, Haikou, China, 20–22 December 2021; pp. 1435–1442. [Google Scholar]
Zhang, X.; Xiong, G.; Ai, Y.; Liu, K.; Chen, L. Vehicle Dynamic Dispatching using Curriculum-Driven Reinforcement Learning. Mech. Syst. Signal Process. 2023, 204, 110698. [Google Scholar] [CrossRef]
Wang, X.; Yang, Z.; Chen, G.; Liu, Y. A Reinforcement Learning Method of Solving Markov Decision Processes: An Adaptive Exploration Model Based on Temporal Difference Error. Electronics 2023, 12, 4176. [Google Scholar] [CrossRef]
Wu, Y.; Wu, X.; Qiu, S.; Xiang, W. A Method for High-Value Driving Demonstration Data Generation Based on One-Dimensional Deep Convolutional Generative Adversarial Networks. Electronics 2022, 11, 3553. [Google Scholar] [CrossRef]

Figure 1. Fitting Q-valued convolutional neural networks.

Figure 2. MFQ-ATSC framework.

Figure 3. MFAC-ATSC framework.

Figure 4. Simulation of road network. (a) Target Regional Traffic Network, (b) Simulated Regional Traffic Network.

Figure 5. The reward value of each intersection is compared under different algorithms.

Figure 6. Vehicle average loss time using different algorithms.

Figure 7. Vehicle average speed using different algorithms.

Table 1. Simulation results at different algorithms.

Evaluation Metrics	Fixed-Timing Control	DQN	MFQ- ATSC	A3C	MFAC- ATSC
Average Reward Value	-	6 × 10⁻⁴	1.45 × 10⁻³	0.7 × 10⁻⁴	1.08 × 10⁻³
Number of Waiting Vehicles	1717	1790	1694	1797	1701
Travel Time (s)	998	993	977	995	986
Loss Time (s)	60.21	56.79	39.44	65.92	42.39
Speed (m/s)	0.81	0.62	0.95	0.57	0.87

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, Z.; Zhang, W.; Liu, Y.; Xiong, G. Mean Field Multi-Agent Reinforcement Learning Method for Area Traffic Signal Control. Electronics 2023, 12, 4686. https://doi.org/10.3390/electronics12224686

AMA Style

Zhang Z, Zhang W, Liu Y, Xiong G. Mean Field Multi-Agent Reinforcement Learning Method for Area Traffic Signal Control. Electronics. 2023; 12(22):4686. https://doi.org/10.3390/electronics12224686

Chicago/Turabian Style

Zhang, Zundong, Wei Zhang, Yuke Liu, and Gang Xiong. 2023. "Mean Field Multi-Agent Reinforcement Learning Method for Area Traffic Signal Control" Electronics 12, no. 22: 4686. https://doi.org/10.3390/electronics12224686

APA Style

Zhang, Z., Zhang, W., Liu, Y., & Xiong, G. (2023). Mean Field Multi-Agent Reinforcement Learning Method for Area Traffic Signal Control. Electronics, 12(22), 4686. https://doi.org/10.3390/electronics12224686

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Mean Field Multi-Agent Reinforcement Learning Method for Area Traffic Signal Control

Abstract

1. Introduction

2. The Regional Traffic Signal Control Model Based on MFMARL

2.1. Random Games

2.2. Nash Q-Learning

2.3. Mean Field Estimation

3. Introduction to the MFMARL-ATSC Model Algorithm

4. Simulation Experiment Settings and Results Analysis

4.1. Simulation Experiment Settings and Results Analysis

4.2. Experimental Setup

4.2.1. State Definition

4.2.2. Action Definition

4.2.3. Reward Function

4.3. Simulation Experiment Results Analysis

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI