Next Article in Journal
Spatially Gated Mixture of Experts for Missing Data Imputation in Pavement Management Systems
Previous Article in Journal
Stress-Testing Food Security in a Socio-Ecological System: Qatar’s Adaptive Responses to Sequential Shocks
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Multi-Agent Regional Traffic Signal Control System Integrating Traffic Flow Prediction and Graph Attention Networks

School of Automotive and Traffic Engineering, Jiangsu University, Zhenjiang 212013, China
*
Author to whom correspondence should be addressed.
Systems 2026, 14(1), 47; https://doi.org/10.3390/systems14010047
Submission received: 30 November 2025 / Revised: 22 December 2025 / Accepted: 30 December 2025 / Published: 31 December 2025
(This article belongs to the Section Systems Engineering)

Abstract

Adaptive traffic signal control is a critical component of intelligent transportation systems, and multi-agent deep reinforcement learning (MARL) has attracted increasing interest due to its scalability and control efficiency. However, existing methods have two major drawbacks: (i) they are largely driven by current and historical traffic states, without explicit forecasting of upcoming traffic conditions, and (ii) their coordination mechanisms are often weak, making it difficult to model complex spatial dependencies in large-scale road networks and thereby limiting the benefits of coordinated control. To address these issues, we propose TG-MADDPG, which integrates short-term traffic prediction with a graph attention network (GAT) for regional signal control. A WT-GWO-CNN-LSTM traffic forecasting module predicts near-future states and injects them into the MARL framework to support anticipatory decision-making. Meanwhile, the GAT dynamically encodes road-network topology and adaptively captures inter-intersection spatial correlations. In addition, we design a reward based on normalized pressure difference to guide cooperative optimization of signal timing. Experiments on the SUMO simulator across synthetic and real-world networks under both off-peak and peak demands show that TG-MADDPG consistently achieves lower average waiting times, shorter queue lengths, and higher cumulative rewards than IQL, MADDPG, and GMADDPG, demonstrating strong effectiveness and generalization.

1. Introduction

With rapid urbanization and sustained growth in vehicle ownership, traffic congestion has become a major constraint on urban operating efficiency and residents’ quality of life [1,2]. As a core technical measure to alleviate congestion, traffic signal control (TSC) can effectively enhance the efficiency of urban transportation systems, playing a vital role in mitigating urban traffic congestion and improving road capacity [3].
TSC has progressed from traditional fixed-time control to adaptive strategies and, more recently, deep reinforcement learning (DRL). Fixed-time methods [4,5,6] are straightforward and reliable, but their offline-optimized signal plans (cycle length, splits, and offsets) remain static and therefore cannot respond to time-varying demand. Adaptive systems such as SCOOT [7] and SCATS [8] update signal operation using real-time detector measurements; while the phase sequence is typically constrained by safety requirements, green allocations—and in some implementations the cycle length and coordination offsets—can be adjusted online. Nevertheless, these rule-based adaptive schemes often exhibit limited flexibility and weak proactive optimization capability under complex, highly fluctuating traffic conditions.
In recent years, DRL has demonstrated significant potential in the TSC field due to its powerful environmental perception and decision-making optimization capabilities [9]. By modeling the TSC problem as a Markov Decision Process, DRL agents can autonomously learn efficient signal control strategies through interaction with the environment.
Nevertheless, applying DRL to regional cooperative traffic signal control still faces severe challenges. Firstly, most existing methods make decisions based on the instantaneous traffic state, lacking effective exploration of the temporal patterns of traffic flow [10], which makes it difficult to achieve proactive control and results in insufficient long-term benefits of the strategy in dynamic environments. Secondly, in multi-agent cooperative control, the communication mechanism among agents is crucial; without effective structured communication, agents making decisions based solely on local observations struggle to achieve global coordination, easily leading to policy conflicts or inefficiencies. Furthermore, the design of the reward function directly influences the learning direction of agents. A reward mechanism that accurately reflects the real operational state of intersections and guides cooperative behavior is key to the system’s success.
To address the above challenges, this paper develops a Graph Attention Multi-Agent Deep Deterministic Policy Gradient algorithm integrated with traffic flow prediction (TG-MADDPG). The main contributions of this study are threefold:
  • A prediction-guided MADDPG framework for regional signal control. We propose TG-MADDPG, which couples short-term traffic flow prediction with multi-agent actor–critic learning for regional signal control. The predicted future traffic state is incorporated into the target construction to guide value learning, enabling more forward-looking decision making beyond purely reactive control based on historical and current states.
  • Topology-aware coordination with graph attention. To capture complex spatial dependencies among intersections, we introduce a graph attention network (GAT) to dynamically encode the road network topology and learn adaptive influence weights over neighboring intersections, thereby enhancing coordination and improving cooperative control in large-scale networks.
  • Reward design and comprehensive evaluation. We design a reward derived from normalized pressure difference to better align local signal operations with regional coordination objectives. The proposed control scheme is evaluated on the SUMO simulation platform using both synthetic road networks with simulated traffic and real-world road networks with measured traffic flow. Multiple performance metrics are employed to assess control performance, demonstrating the effectiveness and generalization capability of the proposed approach.
The remainder of this paper is organized as follows. Section 2 reviews related work. Section 3 formulates the multi-intersection signal control problem. Section 4 presents the architecture of the proposed TG-MADDPG algorithm. Section 5 reports and analyzes the experimental results. Section 6 concludes the paper and discusses future research directions.

2. Related Work

Deep Reinforcement Learning (DRL) addresses the TSC problem by framing it as a Markov Decision Process (MDP) [11]. Algorithms developed within this framework do not necessitate a mathematical model of the external environment. Instead, DRL learns control policies through interaction with the environment, enabling it to manage high-dimensional state spaces and optimize long-term performance. Early research employed tabular Q-Learning to control traffic signals at isolated intersections [12]; however, this approach encountered computational challenges when confronted with high-dimensional state and action spaces. The integration of deep learning with reinforcement learning has led to the emergence of DRL, effectively overcoming these limitations. Deep Q-Networks (DQN) [13] and their variants (e.g., Double DQN, Dueling DQN) [14] have achieved remarkable success in single-intersection control. For more complex continuous control problems, policy gradient-based algorithms such as Deep Deterministic Policy Gradient (DDPG) [15] and Proximal Policy Optimization (PPO) [16] have been widely applied.
In real-world road networks, however, intersections are interconnected; traffic fluctuations at one intersection often propagate and influence its neighbors. Consequently, cooperative multi-intersection signal control is crucial for enhancing overall network efficiency. To address the challenges of multi-intersection coordination, Multi-Agent Reinforcement Learning (MARL), particularly the Centralized Training with Decentralized Execution (CTDE) framework, has become a research focus. Ref. [17] proposed a Multi-Agent Actor-Critic algorithm (MA2C) to address partial observability and non-stationarity in TSC, demonstrating its superiority. The study in [18] developed various MARL-based algorithms and compared their performance against fixed-time control. Qu et al. proposed a novel MARL method based on a regional mixed-strategy Nash equilibrium [19], validating its effectiveness. However, these studies exhibit limitations regarding efficient structured communication, which hinders the explicit modeling of topological correlations within the road network.
To effectively capture spatial dependencies, Graph Neural Networks (GNNs) have been introduced into MARL frameworks [20]. The literature has proposed a Multi-layer Graph Masked Q-learning (MGMQ) algorithm to address challenges related to incomplete environment embedding and the difficulties of scaling to large networks [21]. Yang [22] utilized a restart random walk attention mechanism to enhance global graph structure awareness. Another study integrated Graph Attention Networks (GAT) [23] and Graph Convolutional Networks (GCN) [24] within the Actor-Critic framework to alleviate difficulties in multi-agent coordination and decision bias caused by partial observability [25]. Literature proposed a Deep Graph Q-learning (DGQL) algorithm combining graph cooperative convolution kernels and a CMEM emission model, addressing both agent collaboration and eco-traffic objectives [26]. Meanwhile, reference developed a GA2 communication module that merges RegionLight [27] and R-DRL frameworks to overcome communication deficiencies and the disconnect between macro and micro states in regional control. Despite these advancements in spatial modeling, most approaches primarily focus on static or current spatial relationships while giving insufficient consideration to the temporal dynamics of traffic flow. Consequently, agents’ decision-making processes do not fully leverage historical traffic patterns for predicting future states, which limits the proactivity of control policies.
Accurate short-term traffic flow prediction is the foundation for proactive control. In this domain, Variational Mode Decomposition (VMD) is widely used for preprocessing traffic data to improve the accuracy of prediction models [28,29,30,31]. Some scholars proposed a Spatio-temporal Graph Convolutional Network combining Information Geometry and Attention mechanisms (IGAGCN) [32], which effectively captures dynamic spatial correlations between sensors and multi-scale temporal features. Another study [33] comprehensively considered factors like holidays and weather, demonstrating advantages in spatio-temporal feature fusion. Furthermore, to address the limitations of Recurrent Neural Networks in capturing long-term temporal dependencies, literature [34] proposed using a Temporal Convolutional Network (TCN) based on causal dilated convolutions to enhance short-term prediction performance.
In summary, motivated by the limitations of prior studies, we propose TG-MADDPG with targeted innovations for regional traffic signal control. Unlike works that apply GNNs only for state representation [20,21,22,23,24,25] or treat prediction results merely as additional inputs, our method embeds both prediction and graph attention into the core learning and coordination mechanisms. Specifically, the predicted future state is incorporated into the value backup target, shaping the critic’s training signal and encouraging more proactive policy learning. Meanwhile, graph attention learns adaptive neighborhood influence weights to support topology-aware cooperation in policy representation. Together with a pressure-aware reward design, these components form a tightly coupled framework that improves coordination efficiency and long-term control performance, rather than a modular “plug-in” enhancement.

3. Preliminaries

This section presents the formal definitions of the core components in deep reinforcement learning: the state space, action space and reward function.

3.1. State Space

The description of state-space inputs primarily utilizes methods such as uniform segmentation, Discrete Traffic State Code (DTSE) [35], and feature-based value vectors. Among these approaches, the DTSE method dynamically partitions road segments based on the vehicle’s distance to the stop line, employing shorter intervals in proximity to the stop line and longer intervals at greater distances. This represents an advancement over uniform segmentation, enabling a more nuanced characterization of complex and dynamic vehicle behaviors near the stop line while simplifying input representation for vehicles located farther away.
In this paper, we adopt the DTSE to represent the intersection state, using a non-uniform discretization scheme to construct the state vector. Specifically, vehicle position and speed on each lane are mapped to lane-wise occupancy and speed vectors along the road segment. As shown in Figure 1, the lane is discretized with higher resolution near the stop line and coarser resolution upstream to balance fidelity and dimensionality. For the 350 m west-to-east approach, the downstream section (0–28 m) is divided into four 7 m cells, and the remaining upstream segment is partitioned into progressively larger cells. This design preserves fine-grained information in the critical queuing region while keeping the overall state representation compact.
When a vehicle is in a cell, the value of that cell is 1; otherwise, it is 0. The vehicle speed is normalized. In a multi-intersection environment, the traffic state at intersection i is defined as:
O t i = P l i , V l i
In Equation (1), O t i = P l i , V l i denotes the local observation (local state feature) of intersection i at time t , constructed from DTSE-encoded traffic information. Specifically, P l i and V l i represent the lane-wise position/occupancy and normalized speed matrices for intersection i , respectively.
In a multi-intersection setting with centralized training, we further define the neighborhood-augmented state of intersection i as
S t i = O t i , { O t j } j N i
where N i denotes the set of neighboring intersections of i . Throughout this paper, O t i is used to represent the local observation of a single intersection, while S t i represents the joint state that incorporates information from adjacent intersections for coordinated control.

3.2. Action Space

In TSC tasks, actions represent adjustments to traffic signals. To prevent excessive switching in practical applications, a control interval is typically enforced between consecutive actions. Three primary action schemes exist: phase transition, phase selection, and phase duration [36].
Each intelligent agent is assigned four distinct, predefined phase options, corresponding to the NS, NSL, EW, and EWL phases illustrated in Figure 2 below. Unlike most studies that enhance traffic conditions by controlling phase sequence transitions [37], our approach directly adjusts phase durations without altering the predefined phase order, enabling more flexible traffic light scheduling. The resulting action space A, defined by the fixed phase sequences at an intersection, is expressed as:
G a i = T Z G m i n G a i G m a x
In this formula, G a i is the duration of the green light for each phase, G m i n is the minimum green light time, and G m a x is the maximum green light time.

3.3. Reward Function

This subsection proposes a reward function construction method based on normalized pressure difference, which quantifies the traffic pressure difference between intersections and adjacent intersections, guides the agent to optimize signal timing, and promotes the balanced distribution of road network traffic, so as to effectively alleviate local congestion. The lane pressure is defined as the queue length of vehicles in this lane, which is normalized by the piecewise function and mapped to the [−1, 1]. The normalized pressure difference can intuitively reflect the difference in traffic load between adjacent intersections.
The traffic phase pressure difference is defined as the difference between the pressure of entering the lane and the pressure of leaving the lane. We express the traffic phase a , b pressure difference between the vehicle from lane A to lane B as:
p a , b = p a p b
p a and p b are the normalized pressures on the entering lane a and the exiting lane b, respectively. The basic form of the normalized pressure formula is:
P a = k Q a + ω ( Q a ) m
ω = 1 k C a ( C a ) m
Q a is the queue length of vehicles on the lane. k and m are adjustment parameters, ω by the maximum capacity C a constraint of the road. This function ensures that pressure increases linearly when the queue length is small and rises rapidly as it approaches the road’s capacity, more sensitively reflecting the severity of traffic congestion.
Intersection i Total Pressure is defined as the absolute value of the sum of all phase pressure differences, namely:
P i = a , b phase p a , b
This value reflects the imbalance in the overall traffic flow distribution at the intersection. Here, a , b phase indicates that a , b represents a phase set phase consisting of NS, NSL, EW, and EWL. To guide agents in minimizing the total pressure at the intersection, this paper defines the reward function r i as:
r i = P i
The multi-agent reward function is a weighted combination of local reward and neighborhood reward. The introduction of the spatial discount factor attenuates the contribution of neighborhood reward according to the distance between agents, and the closer the distance, the greater the effect. The reward function is as follows:
r t i = r t i , l o c a l + j N i φ i , j · r t j
where r t i , l o c a l is the reward obtained by the agent locally; r t j is the reward information received by the agent from the neighboring agent j , φ i , j is the spatial discount factor.

4. Methodology

The proposed TG-MADDPG model is composed of a traffic flow prediction module, a graph attention network, and a Multi-Agent Deep Deterministic Policy Gradient (MADDPG) [38] module. This section first introduces the WT-GWO-CNN-LSTM-based traffic flow prediction model, then elaborates on the fundamental principles of the Graph Attention Network, and finally provides a detailed analysis of the theoretical foundation and algorithmic workflow of the TG-MADDPG model.

4.1. Traffic Flow Prediction Model

In recent years, deep learning-based traffic flow prediction methods have demonstrated significant potential in mining complex spatiotemporal dependencies. However, short-term traffic flow data commonly exhibits non-stationary, multi-scale, and highly stochastic characteristics, making it difficult for a single model to fully capture its underlying dynamic patterns. To address this, this paper constructs a deep learning traffic flow prediction model named WT-GWO-CNN-LSTM, which integrates Wavelet Transform denoising [39], the Grey Wolf Optimizer [40], Convolutional Neural Networks (CNN) [41], and Long Short-Term Memory (LSTM) [42] networks. The structure of the model is illustrated in Figure 3.
The model employs wavelet transform to denoise the original traffic flow sequence, thereby enhancing its feature representation capability in the time-frequency domain. Through Formula (10), the wavelet decomposition process separates the original traffic flow sequence x(t) into multiple sub-sequences:
x t = k c j , k ϕ j , k t + j = 1 J k d j , k ψ j , k t
where c j , k represents the approximation coefficients, d j , k denotes the detail coefficients, ϕ and ψ are the scaling function and wavelet function, respectively, and J indicates the decomposition level.
Building upon this foundation, a combined CNN-LSTM prediction module is constructed to leverage the complementary strengths of CNNs in extracting spatially local features and LSTM units in modeling temporal dependencies.
h i = σ W × x i : i + k 1 + b
f t = σ W f h t 1 , x t + b f i t = σ W i h t 1 , x t + b i o t = σ W o h t 1 , x t + b o
where W denotes the convolution kernel weight, b is the bias term, σ represents the activation function, and k indicates the convolution kernel size; f t , i t , and o t denote the forget gate, input gate, and output gate, respectively.
To further enhance the model performance, the Grey Wolf Optimizer (GWO) is employed to automatically search for optimal hyperparameters, thereby overcoming the limitations of manual parameter tuning. GWO mimics the social hierarchy and cooperative hunting behavior of grey wolves to locate the global optimum.
Let X α , X β , and X δ denote the positions of the alpha, beta, and delta wolves (i.e., the three best solutions), and let X be the position of a generic wolf. The hunting process in GWO is modeled by Equations (13)–(15). First, the distances from X to the three leaders are computed as:
D α = C 1 X α X D β = C 2 X β X D δ = C 3 X δ X
Then, three candidate positions estimating the prey location are obtained as:
X 1 = X α A 1 D α X 2 = X β A 2 D β X 3 = X δ A 3 D δ
The position of the wolf is updated by averaging these three candidate positions:
X t + 1 = X 1 + X 2 + X 3 3
where X 1 ,   X 2 , and X 3 represent three estimated prey positions derived from the leaders, guiding the population movement toward the region of the optimal solution.
Finally, the prediction results of each subsequence are aggregated through linear weighting to produce the final output:
y ^ = i = 1 N w i y ^ i
The WT-GWO-CNN-LSTM model effectively leverages the signal processing capability of wavelet decomposition, the spatiotemporal feature extraction strengths of CNN-LSTM, and the global optimization ability of GWO. This enables accurate prediction of traffic flow states at intersections within the road network, thereby providing reliable theoretical support for subsequent model improvements.

4.2. Graph Attention Network Model

In multi-intersection traffic signal control scenarios, each intelligent agent typically only accesses local observational information, such as the states of neighboring agents. However, achieving the global objective requires coordinated actions among all agents. The traditional MADDPG algorithm simply concatenates the information of all agents into a joint state, failing to capture the spatial relationships and topological structure between agents. To address this limitation, as shown in Figure 4, we introduce Graph Attention Networks (GAT).
Graph Attention Feature Extraction Module: This network operates without requiring complex matrix decomposition and can assign differential attention weights to individual nodes within the graph. Given the state information of each intersection, the m-dimensional observations are embedded into an n-dimensional space:
h i = E O t i = σ O t i W + b
Here, W ϵ R F × F denotes the weight matrix and b ϵ R F the bias vector, σ is the activation function, and the resulting hidden state h i ϵ R F represents the node features of intersection i . The features of all nodes form the set h = h 1 , h 2 , , h N , where N is the total number of intersections and F is the feature dimension.
To evaluate the importance of a source node j to the decision-making of a target node i , the features of the two nodes are jointly modeled. The node features are first linearly transformed using the learnable weight matrix W ϵ R F × F and bias vector b ϵ R F . The transformed features are then fused and mapped to produce an attention score e i j , which characterizes the influence of node j on node i . The LeakyReLU activation function is applied, and the attention coefficients α i j  are obtained by normalizing these scores across all neighbors using the softmax function:
e i j = LeakyReLU ( W i h i ) ( W j h j ) T
α i j = softmax j e i j = exp ( e i j ) k N i exp e i k
where W i  are W j the weight matrices for the target node and source node, respectively. By performing a weighted summation of the neighborhood node features, the proposed attention mechanism employs a multi-head attention mechanism to obtain the enhanced features of the target node:
h i = σ 1 K k = 1 K j N i w α i j   h j
where K is the number of attention heads. This enhanced feature h i not only incorporates the local state of intersection i , but also integrates the importance-weighted information from its neighboring intersections. This provides the basis for the subsequent Actor network to generate heterogeneous policies. For instance, when severe queuing occurs at upstream intersection j within the neighborhood of downstream intersection i , the attention coefficient α i j increases. Consequently, h i accentuates the congestion information from j , guiding intersection i to extend its green light duration to facilitate traffic dissipation and thereby avoiding a homogeneous “one-size-fits-all” control strategy.

4.3. TG-MADDPG Framework

The MADDPG algorithm based on the traffic prediction state predicts the traffic flow at the future moment and calculates the target Q value by using the prediction results. On this basis, the MADDPG algorithm is embedded with a graph attention network, which captures the spatial relationship and topology between agents, and the TG-MADDPG algorithm is designed to realize the collaborative optimization of regional signal traffic control.
Figure 5 illustrates the architecture of the TG-MADDPG algorithm. In this framework, each intersection within the region is treated as an independent agent, equipped with its own Actor and Critic networks, and operates under the Centralized Training with Decentralized Execution (CTDE) paradigm. The decision-making process follows the Markov Decision Process (MDP). The Actor network generates actions based on the current local observation, while the Critic network evaluates the action’s value using global state-action information. The parameters of the Actor and Critic networks for agent i are denoted as θ i μ and θ i Q , respectively.
The Actor network takes the current state S t i as input. As defined in Section 3.1, S t i is constructed by combining the local observation of intersection i with information from its neighboring intersections according to the road-network topology (see Equation (2) and Figure 1). Consequently, the dimensionality of S t i is topology-dependent rather than uniform across agents: it varies with the number of controlled approaches/lanes at the intersection and its neighborhood configuration in the network. For example, for the intersection illustrated in Figure 1, the resulting state tensor has a length of 160. Therefore, different intersections may have different state lengths, and each agent uses a topology-consistent state representation when feeding S t i into the actor–critic networks.
The corresponding target Actor network receives the next state S t + 1 i and outputs the action for that subsequent state. The Critic network takes the current state S t i together with the executed action and outputs the corresponding Q-value. The target Critic network, in contrast, uses the next state S t + 1 i and the action produced by the target Actor network as input. Crucially, it outputs not only the Q-value for the next state but also an additional predicted Q-value based on the predicted future state S p .
During training, agents continuously interact with the environment. The resulting experience tuples < S t , A t , R t , S t + 1 , S p > are stored in a shared replay buffer for subsequent network training.
The Actor network updates its parameters using the policy gradient ascent method. The gradient is calculated as:
θ i μ J E S ~ D a i Q S , a 1 , , a n ; θ i Q θ i μ μ h i ; θ i μ
The Critic network utilizes global information (including the states and actions of all agents) to compute Q-values, and its parameters are updated by minimizing the mean-squared Bellman error:
L θ i Q = E S , a , r , S , S p ~ D Q i S , a 1 , , a N ; θ i Q y i 2
y i = r t i + γ 1 β Q i t a r g e t S t + 1 , A t + 1 t a r g e t + β Q i t a r g e t S p , A p t a r g e t , γ 0,1 , β 0,1 .
S p = f ϕ S t .   A t + 1 t a r g e t = μ t a r g e t S t + 1 .   A p t a r g e t = μ t a r g e t S p
Equation (23) preserves the standard Bellman target form—an immediate reward plus a discounted bootstrapped value—while incorporating the predicted future state through a convex combination. Here, r t i denotes the immediate reward, S p = f ϕ S t denotes the predicted state, γ 0,1 is the discount factor, and β 0,1 controls the contribution of the prediction-guided component. Accordingly, the next-state and predicted-state Q-values are weighted by γ 1 β and γ β , respectively; with γ = 0.99 and β = 0.7 , the effective coefficients are approximately 0.3 and 0.7.
Finally, the algorithm employs a soft update mechanism to refresh the parameters of the target policy network θ i μ and the target value network θ i Q , as shown in Equation (24). Here, the parameter τ defaults to a value of 0.001.
Softupdate : θ i μ , t a r g e t τ θ i μ + 1 τ θ i μ , t a r g e t θ i Q , t a r g e t τ θ i Q + 1 τ θ i Q , t a r g e t
Algorithm Workflow
Step 1:
Construct an Actor network and a Critic network for each agent, initializing their respective parameters θ i μ  and θ i Q Synchronize the parameters of the corresponding target networks. Initialize hyperparameters including the experience replay buffer, learning rate, and discount factors. Simultaneously, initialize the Graph Attention Network weights and the attention coefficient calculation function.
Step 2:
The agent obtains the initial environmental state of the intersection and its neighborhood observation information, and forms a joint state representation after fusion.
Step 3:
Based on the current joint state, compute the enhanced node features and normalized attention coefficients. Subsequently, generate the action strategy for the current time step. Execute the selected action, observe the environmental feedback, and obtain the next state and the corresponding reward value.
Step 4:
Store the obtained experience tuple in the experience replay buffer. During the training phase, randomly sample a mini-batch of experiences from the buffer and update the Critic network parameters by minimizing the loss function.
Step 5:
Update the online Actor network parameters using the policy gradient method. Then, synchronize the parameters of the target policy network and the target value network via a soft update mechanism.
We present the pseudocode of the TG-MADDPG in detail in Algorithm 1.
Algorithm 1: TG-MADDPG Pseudocode
1      Initialize Actor network  μ ( S t i ; θ i μ )  and Critic network  Q ( S t i ; θ i Q ) with parameters  θ i μ , θ i Q  for each agent  i .
2      Initialize target network weights: θ i μ , t a r g e t θ i μ , θ i Q , t a r g e t θ i Q .
3      Initialize experience replay buffer  D , learning rates, discount factor and other hyperparameters.
4      Initialize Graph Attention Network weights and attention coefficient function.
5      For episode = 1 to M do
6                         Initialize environment and obtain initial intersection state O t i and neighborhood observations O t j .
7                 Compute joint state representation  S t i  using Equation (2).
8           For t = 1 to T do
9                 Compute enhanced features  h t i  and normalized attention coefficients  a i j using Equations (17) and (19).
10               Select action  a t i = μ i h t i ; θ i μ  based on current state  ( S t i ) .
11               Execute  a t i , observe next state  S t + 1 i , and compute reward  r t i using Equation (9).
12               Obtain traffic flow prediction state  S p i  from the prediction module.
13               Store experience tuple < S t , A t , R t , S t + 1 , S p > into replay buffer  D .
14               Sample a random minibatch S from  D .
15               For each sampled transition, compute target actions with target actors:
                   A t + 1 = μ 1 , t a r g e t S t + 1 1 , , μ N , t a r g e t S t + 1 N .
16               Similarly, compute predicted-state actions with target actors:
                   A p = μ 1 , t a r g e t S p 1 , , μ N , t a r g e t S p N .
17               Compute target Q-value  y i using Equation (23).
18               Update Critic network by minimizing loss  L θ i Q  using Equation (22).
19               Update Actor network using policy gradient method according to Equation (21).
20               Soft-update target network parameters via Equation (24).
21         End for
22   End for

5. Experiment

5.1. Experiment Environment

SUMO (Simulation of Urban Mobility) [43], an open-source microscopic traffic simulation platform, serves as a highly scalable environment for modeling individual vehicle movements and interactions, such as car-following and lane-changing, making it well-suited for reproducing realistic traffic phenomena like intersection queuing and congestion dissipation. As shown in Figure 6, in this study, the simulation framework employs the Traffic Control Interface (TraCI) to establish interactive communication between SUMO and a Python (3.13)-based deep reinforcement learning (DRL) agent. At each time step, the DRL agent retrieves real-time intersection states—such as queue lengths and waiting times—via TraCI, computes optimal signal control actions, and applies them back to the simulation, thereby creating a closed-loop control system consistent with the reinforcement learning paradigm. The road network is constructed from OpenStreetMap data, which is converted into a SUMO-compatible format using netconvert, while traffic demand is defined through route files based on historical data. A configuration file integrates all components and sets key simulation parameters, enabling the execution of experiments and output of performance metrics for analysis.

5.2. Datasets

This study conducts experiments based on a real-world road network in the Weiyang District of Xi’an, China. The topology of this road network is illustrated in Figure 7. The experimental network comprises 9 signalized intersections. Each intersection is configured with bidirectional lanes, where the north–south (NS) and east–west (EW) directions have lane lengths of 840 m and 1280 m, respectively. The traffic flow dataset, covering the period from 1 April to 15 April 2021, was provided by the local transportation authority.
For the traffic signal control (TSC) experiments, both simulated and real-world data are utilized. To evaluate the adaptability of the proposed method under different traffic loads, simulations are conducted for both off-peak and peak-hour traffic conditions. In the simulated experiments, the total vehicle inflow rates for off-peak and peak hours are set to 7200 vehicles/h and 10,800 vehicles/h, respectively. Furthermore, to comprehensively assess the performance of the TG-MADDPG method, two representative time periods (10:00 a.m. and 6:00 p.m.) on 5 April 2021 are selected from the real-world dataset for testing.
The parameters in Table 1 were chosen to ensure stable training, computational efficiency, and realistic signal-timing constraints. Learning rates and the discount factor follow standard MADDPG practice, while the batch size and replay buffer size balance gradient variance and memory cost. Signal timing bounds and a 3 s yellow interval enforce operational safety and avoid excessive switching, and vehicle-dynamics settings use standard microscopic simulation configurations. For fairness, SUMO uses no pre-defined green-wave progression or manual offsets: all intersections share identical phase definitions and safety constraints, and green durations are selected online by the learned policy, so any coordination emerges from TG-MADDPG.
For the traffic flow prediction experiments, data is selected from the intersection of Fengcheng 7th Road and Mingguang Road within the aforementioned road network. The dataset comprises 15 consecutive days of traffic flow records from 1 April to 15 April 2021. The first 14 days of data are used as the training set for model learning, while the data from the final day serves as the test set to evaluate the predictive performance of the model.

5.3. Experimental Results

5.3.1. Traffic Signal Control Experiment

To assess the effectiveness and generalization capability of the proposed TG-MADDPG algorithm for regional traffic signal control, we conducted a comparative analysis with four algorithms: IQL (Independent Q-Learning) [44], MADDPG, GMADDPG, and TG-MADDPG. The details of these algorithms are as follows: (1) IQL serves as an independent multi-agent learning benchmark, where each intersection makes decisions autonomously without any collaborative mechanism; (2) MADDPG represents a classical centralized training-distributed execution framework, incorporating basic multi-agent collaboration capabilities; (3) GMADDPG introduces a graph attention network to examine the impact of the graph attention mechanism; (4) TG-MADDPG is the novel algorithm proposed in this study.
To comprehensively evaluate the performance of these traffic signal control algorithms, three key indicators were selected: average cumulative reward, average queue length, and average waiting time.
Figure 8 and Figure 9 illustrate the convergence of average cumulative rewards for each algorithm during off-peak and peak periods, respectively, to assess learning stability. The results indicate that TG-MADDPG achieves the fastest convergence rate, stabilizing after approximately 60 episodes, with its steady-state reward value significantly surpassing those of the other algorithms. This finding suggests that incorporating traffic flow prediction information effectively mitigates exploration uncertainty, thereby enhancing both learning efficiency and convergence stability.
Table 2 summarizes the performance of each model across core metrics. Notably, TG-MADDPG outperforms all baseline models in every metric assessed, with its advantages being particularly pronounced during peak periods. This underscores its superior adaptability to complex traffic scenarios. Specifically, compared to GMADDPG, TG-MADDPG reduces average waiting time by 7.36% and 10.40% during off-peak and peak periods, respectively; it also decreases average queue length by 10.86% and 8.06%, respectively. These results suggest that the short-term traffic flow prediction module enhances the algorithm’s forward-looking capabilities, enabling it to proactively respond to dynamic changes in traffic flow.
Compared with MADDPG, the performance improvement of GMADDPG verifies the effectiveness of graph attention network in capturing road network spatial dependencies. IQL performed the worst, indicating that in highly coupled regional traffic scenarios, independent decision-making without collaboration is difficult to achieve global optimization.
Figure 10 and Figure 11 illustrate the convergence of average cumulative rewards for each algorithm on the real-world Weiyang District road network during off-peak and peak periods, respectively. Table 3 presents the performance comparison of each algorithm within the Weiyang District. TG-MADDPG consistently outperforms all other models across all metrics, confirming its strong generalization ability. Although the performance of all models slightly decreases in real-world environments due to random disturbances and road network heterogeneity, TG-MADDPG still significantly outperforms the baseline methods. Specifically, during peak periods, TG-MADDPG reduces the average waiting time by 16.55%, 13.79%, and 6.22% compared to IQL, MADDPG, and GMADDPG, respectively. Average queue length is reduced by 17.52%, 14.82%, and 8.12%, respectively.
Further analysis of the training curves in Figure 12 and Figure 13 reveals that TG-MADDPG rapidly converges to lower levels in both waiting time and queue length. Additionally, during peak periods, the increase in queue length during sudden traffic flow surges is significantly smaller compared to the other algorithms. This advantage is attributed to the early perception of future traffic flow afforded by the predictive information, enabling dynamic adjustments to the signal control system, such as extending the green light duration for critical phases, thus alleviating vehicle congestion. Simultaneously, the collaborative mechanism of the graph attention network ensures the overall balance of the regional road network, thereby improving traffic efficiency comprehensively.
The exceptional performance of TG-MADDPG is attributed to its robust adaptability to real-world traffic characteristics. The graph attention network dynamically allocates control priorities among intersections, effectively mitigating local congestion that could result from static strategies. Additionally, the traffic flow prediction module identifies spatiotemporal patterns, such as tidal flows, facilitating proactive adjustments to signal timing. Although its convergence speed is slightly slower in real-world road networks, TG-MADDPG’s final performance remains markedly superior to other algorithms, reflecting its robustness.
Experimental Discussion
  • Comparison between GMADDPG and TG-MADDPG: The integration of predictive state information leads to a 5–12% improvement in average cumulative rewards, waiting time, and queue length, with the effect being more pronounced during peak periods. This validates the rationale behind reconstructing the Q-value target. By introducing predictive information, the agent transcends the limitations of real-time observations, acquiring the capacity for “forward-looking decision-making.” This enables the agent to better handle traffic flow peaks and prevent congestion propagation.
  • Comparison between MADDPG and GMADDPG: The introduction of the graph attention network results in a 3–8% improvement in performance, indicating that modeling spatial dependencies between intersections optimizes cooperative effects. TG-MADDPG further demonstrates that dynamic weight allocation, compared to static allocation, is more adaptable to the spatiotemporal dynamics of traffic flow, effectively addressing the issue of decision homogenization in regional traffic control.

5.3.2. Traffic Flow Prediction Experiment

To assess the effectiveness of the proposed WT-GWO-CNN-LSTM model for short-term traffic flow prediction, this study compares it against representative benchmark models: CNN-GRU and VMD-CNN-LSTM [30]. The evaluation metrics used for performance analysis include Root Mean Square Error (RMSE), Mean Absolute Error (MAE), and the coefficient of determination (R2).
As shown in Figure 14, the WT-GWO-CNN-LSTM model applies wavelet transform to decompose the original traffic flow data into an approximation component and several detail components (D1, D2, D3, D4). These components effectively capture the traffic flow characteristics at various frequency scales, thus providing the model with refined multi-scale inputs. Figure 15 illustrates a comparison of the predicted traffic flow curves from all models against the actual traffic data. It is evident that the WT-GWO-CNN-LSTM model demonstrates superior dynamic response, especially during low-traffic periods and peak-valley intervals with abrupt fluctuations in traffic volume. As shown in Figure 16, A closer inspection of the critical peak period between 216 and 236 further shows that the model’s predicted trajectory aligns closely with the actual curve, highlighting its exceptional ability to capture complex traffic patterns.
The quantitative results shown in Table 4 clearly indicate that the WT-GWO-CNN-LSTM model outperforms all other models in terms of the evaluation metrics. Specifically, it achieves RMSE and MAE values of 1.19 and 0.84, respectively—the lowest among all models—signifying minimal prediction error and optimal fitting performance. Compared to the baseline CNN-GRU model, the proposed model reduces RMSE and MAE by 35% and 37.3%, respectively. Moreover, even when compared to the VMD-CNN-LSTM model, which also incorporates signal decomposition, the proposed approach improves RMSE by 22.8% and MAE by 27.5%.
To quantify the contribution of each component, as shown in Figure 17, we performed an ablation study with four variants: CNN–LSTM, WT–CNN–LSTM, GWO–CNN–LSTM, and the complete WT–GWO–CNN–LSTM.
As summarized in Table 5, the proposed full model delivers the best overall performance (RMSE = 1.19, MAE = 0.84, R2 = 94.50). Relative to the baseline CNN–LSTM, it reduces RMSE and MAE by 41.4% and 47.2%, respectively, underscoring the efficacy of the integrated design. Introducing WT alone yields most of the improvement (RMSE: 2.03 → 1.36; MAE: 1.59 → 0.99), indicating that multi-scale wavelet decomposition is the primary contributor by mitigating non-stationarity and enhancing the informativeness of the learned features. By contrast, GWO in isolation provides only modest gains, suggesting that global hyperparameter search is more impactful when the input representation has already been strengthened. Finally, combining WT with GWO further improves upon WT–CNN–LSTM, implying that GWO offers additional robustness and fine-grained calibration on top of the WT-enhanced representation.
Experimental Discussion
The superiority of the proposed WT–GWO–CNN–LSTM is reflected not only in its overall accuracy but also in the ablation evidence, which isolates the role of each component. In particular, WT-driven multi-scale decomposition contributes the largest share of the improvement: adding WT alone markedly reduces the prediction error, implying that it alleviates non-stationarity, attenuates high-frequency noise, and retains predictive temporal structure. By comparison, GWO brings a smaller yet reliable gain. When deployed on top of the WT-enhanced inputs, it further boosts performance, indicating that global hyperparameter search primarily strengthens robustness and parameter calibration rather than serving as the main source of accuracy. Hence, the best results arise from the synergy between representation stabilization (WT) and parameter refinement (GWO).
Methodologically, this also sheds light on why other decomposition-based baselines (e.g., VMD–CNN–LSTM) can display less consistent performance: their effectiveness often depends on the choice of decomposition settings, whereas the proposed pipeline delivers more stable and generalizable improvements through its complementary design.

5.3.3. Prediction Errors on Training Stability

Although the predicted state S p may contain forecasting errors, the training curves in Section 5.3.1 remain stable and converge smoothly, suggesting that the prediction-guided target does not introduce harmful oscillations in practice. This can be explained by the fact that the prediction-based component is explicitly bounded in influence by the convex mixing weight β and the discount factor γ , such that any deviation induced by prediction inaccuracies is scaled by β γ rather than dominating the update. Moreover, target networks with soft updates further mitigate target-value fluctuations during learning. In addition, the prediction module achieves high accuracy (Table 4), which further limits the magnitude of the induced target bias. Together, these results indicate that the augmented target is well-behaved and the overall learning process remains stable under practical prediction uncertainty.

6. Conclusions

This paper develops TG-MADDPG for regional traffic signal coordination by coupling short-term traffic flow forecasting with graph-attention-based inter-intersection communication. The WT-GWO-CNN-LSTM module provides near-future demand cues so that control decisions are informed not only by the current traffic state but also by expected short-term trends. Meanwhile, the GAT mechanism assigns adaptive attention weights to neighboring intersections according to network topology, helping agents capture critical spatial interactions during cooperative decision-making. Extensive experiments across both synthetic and real-world networks (including peak and off-peak periods) show that TG-MADDPG consistently outperforms representative baselines such as IQL, MADDPG, and GMADDPG in terms of cumulative reward, average waiting time, and queue length, while maintaining stable performance under disturbances and heterogeneous conditions.
Despite these gains, the current implementation assumes high-fidelity, lane-level observations, which may be costly to obtain at scale in practical deployments. Future work will investigate robust multi-agent learning under partial observability by adopting more realistic sensing assumptions and reduced state information. Finally, we plan to extend the framework toward multi-objective and priority-aware control, enabling explicit trade-offs between efficiency and sustainability while supporting public-transport priority [45,46].

Author Contributions

Conceptualization by C.S.; Methodology by C.S. and Y.Y.; Software and Data curation by Y.Y. and J.L.; Formal analysis by Y.Y. and W.F.; Writing—original draft by Y.Y.; Writing—review and editing by C.S.; Supervision by C.S. and P.Z.; All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Humanities and Social Sciences Foundation of the Ministry of Education of China (22YJCZH153).

Data Availability Statement

The original contributions presented in the study are included in the article; further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Haddad, T.A.; Hedjazi, D.; Aouag, S. A Deep Reinforcement Learning-Based Cooperative Approach for Multi-Intersection Traffic Signal Control. Eng. Appl. Artif. Intell. 2022, 114, 105019. [Google Scholar] [CrossRef]
  2. Hussain, Z.; Kaleem Khan, M.; Xia, Z. Investigating the Role of Green Transport, Environmental Taxes and Expenditures in Mitigating the Transport CO2 Emissions. Transp. Lett. 2023, 15, 439–449. [Google Scholar] [CrossRef]
  3. Ye, B.-L.; Wu, W.; Ruan, K.; Li, L.; Chen, T.; Gao, H.; Chen, Y. A Survey of Model Predictive Control Methods for Traffic Signal Control. IEEE/CAA J. Autom. Sin. 2019, 6, 623–640. [Google Scholar] [CrossRef]
  4. Nagatani, T. Effect of Bypasses on Vehicular Traffic through a Series of Signals. Phys. A Stat. Mech. Its Appl. 2018, 506, 229–236. [Google Scholar] [CrossRef]
  5. Chiou, S.-W. A Two-Stage Model for Period-Dependent Traffic Signal Control in a Road Networked System with Stochastic Travel Demand. Inf. Sci. 2019, 476, 256–273. [Google Scholar] [CrossRef]
  6. Kong, X.; Shen, G.; Xia, F.; Lin, C. Urban Arterial Traffic Two-Direction Green Wave Intelligent Coordination Control Technique and Its Application. Int. J. Control Autom. Syst. 2011, 9, 60–68. [Google Scholar] [CrossRef]
  7. Liu, K. Design and Application of Real-Time Traffic Simulation Platform Based on UTC/SCOOT and VISSIM. J. Simul. 2023, 18, 539–556. [Google Scholar] [CrossRef]
  8. Chen, R.; Fang, F.; Sadeh, N. The Real Deal: A Review of Challenges and Opportunities in Moving Reinforcement Learning-Based Traffic Signal Control Systems Towards Reality. arXiv 2022, arXiv:2206.11996. [Google Scholar] [CrossRef]
  9. Zhao, H.; Dong, C.; Cao, J.; Chen, Q. A Survey on Deep Reinforcement Learning Approaches for Traffic Signal Control. Eng. Appl. Artif. Intell. 2024, 133, 108100. [Google Scholar] [CrossRef]
  10. Solaiappan, S.; Kumar, B.R.; Anbazhagan, N.; Song, Y.; Joshi, G.P.; Cho, W. Vehicular Traffic Flow Analysis and Minimize the Vehicle Queue Waiting Time Using Signal Distribution Control Algorithm. Sensors 2023, 23, 6819. [Google Scholar] [CrossRef]
  11. Zhao, R.; Hu, H.; Li, Y.; Fan, Y.; Gao, F.; Gao, Z. Sequence Decision Transformer for Adaptive Traffic Signal Control. Sensors 2024, 24, 6202. [Google Scholar] [CrossRef]
  12. Joo, H.; Ahmed, S.H.; Lim, Y. Traffic Signal Control for Smart Cities Using Reinforcement Learning. Comput. Commun. 2020, 154, 324–330. [Google Scholar] [CrossRef]
  13. Huang, Z. Reinforcement Learning Based Adaptive Control Method for Traffic Lights in Intelligent Transportation. Alex. Eng. J. 2024, 106, 381–391. [Google Scholar] [CrossRef]
  14. Zheng, Y.; Luo, J.; Gao, H.; Zhou, Y.; Li, K. Pri-DDQN: Learning Adaptive Traffic Signal Control Strategy through a Hybrid Agent. Complex Intell. Syst. 2025, 11, 47. [Google Scholar] [CrossRef]
  15. Li, Z.; Yu, H.; Zhang, G.; Dong, S.; Xu, C.-Z. Network-Wide Traffic Signal Control Optimization Using a Multi-Agent Deep Reinforcement Learning. Transp. Res. Part C Emerg. Technol. 2021, 125, 103059. [Google Scholar] [CrossRef]
  16. Wang, L.; Zhang, G.; Yang, Q.; Han, T. An Adaptive Traffic Signal Control Scheme with Proximal Policy Optimization Based on Deep Reinforcement Learning for a Single Intersection. Eng. Appl. Artif. Intell. 2025, 149, 110440. [Google Scholar] [CrossRef]
  17. Chu, T.; Wang, J.; Codecà, L.; Li, Z. Multi-Agent Deep Reinforcement Learning for Large-Scale Traffic Signal Control. IEEE Trans. Intell. Transp. Syst. 2019, 21, 1086–1095. [Google Scholar] [CrossRef]
  18. Mannion, P.; Duggan, J.; Howley, E. An Experimental Review of Reinforcement Learning Algorithms for Adaptive Traffic Signal Control. In Autonomic Road Transport Support Systems; McCluskey, T.L., Kotsialos, A., Müller, J.P., Klügl, F., Rana, O., Schumann, R., Eds.; Springer International Publishing: Cham, Switzerland, 2016; pp. 47–66. [Google Scholar]
  19. Qu, Z.; Pan, Z.; Chen, Y.; Wang, X.; Li, H. A Distributed Control Method for Urban Networks Using Multi-Agent Reinforcement Learning Based on Regional Mixed Strategy Nash-Equilibrium. IEEE Access 2020, 8, 19750–19766. [Google Scholar] [CrossRef]
  20. Sun, Y.; Lin, K.; Bashir, A.K. KeyLight: Intelligent Traffic Signal Control Method Based on Improved Graph Neural Network. IEEE Trans. Consum. Electron. 2024, 70, 2861–2871. [Google Scholar] [CrossRef]
  21. Wang, T.; Zhu, Z.; Zhang, J.; Tian, J.; Zhang, W. A Large-Scale Traffic Signal Control Algorithm Based on Multi-Layer Graph Deep Reinforcement Learning. Transp. Res. Part C Emerg. Technol. 2024, 162, 104582. [Google Scholar] [CrossRef]
  22. Yang, G.; Wen, X.; Chen, F. Multi-Agent Deep Reinforcement Learning with Graph Attention Network for Traffic Signal Control in Multiple-Intersection Urban Areas. Transp. Res. Rec. 2025, 2679, 880–898. [Google Scholar] [CrossRef]
  23. Veličković, P.; Cucurull, G.; Casanova, A.; Romero, A.; Liò, P.; Bengio, Y. Graph Attention Networks. arXiv 2017, arXiv:1710.10903. [Google Scholar]
  24. Kipf, T.N.; Welling, M. Semi-Supervised Classification with Graph Convolutional Networks. arXiv 2016, arXiv:1609.02907. [Google Scholar]
  25. Ma, T.; Peng, K. AGRCNet: Communicate by Attentional Graph Relations in Multi-Agent Reinforcement Learning for Traffic Signal Control. Neural Comput. Appl. 2023, 35, 21007–21022. [Google Scholar] [CrossRef]
  26. Yan, L.; Zhu, L.; Song, K.; Yuan, Z.; Yan, Y.; Tang, Y.; Peng, C. Graph Cooperation Deep Reinforcement Learning for Ecological Urban Traffic Signal Control. Appl. Intell. 2023, 53, 6248–6265. [Google Scholar] [CrossRef]
  27. Gu, H.; Wang, S.; Jia, D.; Zhang, Y.; Luo, Y.; Mao, G.; Wang, J.; Gee Lim, E. Communication Strategy on Macro-and-Micro Traffic State in Cooperative Deep Reinforcement Learning for Regional Traffic Signal Control. IEEE Trans. Intell. Transport. Syst. 2025, 26, 12183–12196. [Google Scholar] [CrossRef]
  28. Li, G.; Deng, H.; Yang, H. Traffic Flow Prediction Model Based on Improved Variational Mode Decomposition and Error Correction. Alex. Eng. J. 2023, 76, 361–389. [Google Scholar] [CrossRef]
  29. Zhao, K.; Guo, D.; Sun, M.; Zhao, C.; Shuai, H. Short-Term Traffic Flow Prediction Based on VMD and IDBO-LSTM. IEEE Access 2023, 11, 97072–97088. [Google Scholar] [CrossRef]
  30. Ren, C.; Fu, F.; Yin, C.; Lu, L.; Cheng, L. A Combined Model for Short-Term Traffic Flow Prediction Based on Variational Modal Decomposition and Deep Learning. Sci. Rep. 2025, 15, 17142. [Google Scholar] [CrossRef]
  31. Li, J.; Zhang, Z.; Meng, F.; Zhu, W. Short-Term Traffic Flow Prediction via Improved Mode Decomposition and Self-Attention Mechanism Based Deep Learning Approach. IEEE Sens. J. 2022, 22, 14356–14365. [Google Scholar] [CrossRef]
  32. An, J.; Guo, L.; Liu, W.; Fu, Z.; Ren, P.; Liu, X.; Li, T. IGAGCN: Information Geometry and Attention-Based Spatiotemporal Graph Convolutional Networks for Traffic Flow Prediction. Neural Netw. 2021, 143, 355–367. [Google Scholar] [CrossRef] [PubMed]
  33. Zhang, J.; Sha, J.; Zhang, C.; Zhang, Y. A CNN-LSTM-GRU Hybrid Model for Spatiotemporal Highway Traffic Flow Prediction. Systems 2025, 13, 765. [Google Scholar] [CrossRef]
  34. Zhao, W.; Gao, Y.; Ji, T.; Wan, X.; Ye, F.; Bai, G. Deep Temporal Convolutional Networks for Short-Term Traffic Flow Forecasting. IEEE Access 2019, 7, 114496–114507. [Google Scholar] [CrossRef]
  35. Han, G.; Zheng, Q.; Liao, L.; Tang, P.; Li, Z.; Zhu, Y. Deep Reinforcement Learning for Intersection Signal Control Considering Pedestrian Behavior. Electronics 2022, 11, 3519. [Google Scholar] [CrossRef]
  36. Mao, F.; Li, Z.; Li, L. A Comparison of Deep Reinforcement Learning Models for Isolated Traffic Signal Control. IEEE Intell. Transport. Syst. Mag. 2023, 15, 160–180. [Google Scholar] [CrossRef]
  37. Oroojlooy, A.; Nazari, M.; Hajinezhad, D.; Silva, J. AttendLight: Universal Attention-Based Reinforcement Learning Model for Traffic Signal Control. Adv. Neural Inf. Process. Syst. 2020, 33, 4079–4090. [Google Scholar]
  38. Kamal, H.; Yánez, W.; Hassan, S.; Sobhy, D. Digital-Twin-Based Deep Reinforcement Learning Approach for Adaptive Traffic Signal Control. IEEE Internet Things J. 2024, 11, 21946–21953. [Google Scholar] [CrossRef]
  39. Xi, Q.; Chen, Q.M.; Ahmad, W.; Pan, J.; Zhao, S.; Xia, Y.; Ouyang, Q.; Chen, Q.S. Quantitative analysis and visualization of chemical compositions during shrimp flesh deterioration using hyperspectral imaging: A comparative study of machine learning and deep learning models. Food Chemistry 2025, 481, 143997. [Google Scholar] [CrossRef]
  40. Kohli, M.; Arora, S. Chaotic Grey Wolf Optimization Algorithm for Constrained Optimization Problems. J. Comput. Des. Eng. 2018, 5, 458–472. [Google Scholar] [CrossRef]
  41. Zhao, X.; Wang, L.; Zhang, Y.; Han, X.; Deveci, M.; Parmar, M. A Review of Convolutional Neural Networks in Computer Vision. Artif. Intell. Rev. 2024, 57, 99. [Google Scholar] [CrossRef]
  42. Xia, Y.; Xiao, X.; Adade, S.Y.-S.S.; Xi, Q.; Wu, J.; Xu, Y.; Chen, Q.M.; Chen, Q.S. Physicochemical properties and gel quality monitoring of surimi during thermal processing using hyperspectral imaging combined with deep learning. Food Control 2025, 175, 111258. [Google Scholar] [CrossRef]
  43. Lopez, P.A.; Wiessner, E.; Behrisch, M.; Bieker-Walz, L.; Erdmann, J.; Flotterod, Y.-P.; Hilbrich, R.; Lucken, L.; Rummel, J.; Wagner, P. Microscopic Traffic Simulation Using SUMO. In Proceedings of the 2018 21st International Conference on Intelligent Transportation Systems (ITSC), Maui, HI, USA, 4–7 November 2018; IEEE: New York, NY, USA, 2018; Volume 11, pp. 2575–2582. [Google Scholar]
  44. Whiteson, S. Stabilising Experience Replay for Deep Multi-Agent Reinforcement Learning. In Proceedings of the 34th International Conference on Machine Learning (ICML 2017), Sydney, Australia, 6–11 August 2017; pp. 1146–1155. [Google Scholar]
  45. Bilotta, S.; Fereidooni, Z.; Ipsaro Palesi, L.A.; Nesi, P. Macroscopic GA-Based Multi-Objective Traffic Light Optimization Prioritizing Tramways. Appl. Soft Comput. 2025, 178, 113269. [Google Scholar] [CrossRef]
  46. Kwesiga, D.K.; Vishnoi, S.C.; Guin, A.; Hunter, M. Integrating Transit Signal Priority into Multi-Agent Reinforcement Learning Based Traffic Signal Control. arXiv 2024, arXiv:2411.19359. [Google Scholar] [CrossRef]
Figure 1. Intersection approach lane discrete model.
Figure 1. Intersection approach lane discrete model.
Systems 14 00047 g001
Figure 2. The relationship diagram of traffic flow and phase in each direction of intersection.
Figure 2. The relationship diagram of traffic flow and phase in each direction of intersection.
Systems 14 00047 g002
Figure 3. Overall Framework for Traffic Flow Prediction Model.
Figure 3. Overall Framework for Traffic Flow Prediction Model.
Systems 14 00047 g003
Figure 4. Graph Attention Network Model.
Figure 4. Graph Attention Network Model.
Systems 14 00047 g004
Figure 5. TG-MADDPG Algorithm Framework.
Figure 5. TG-MADDPG Algorithm Framework.
Systems 14 00047 g005
Figure 6. SUMO+Python co-simulation framework.
Figure 6. SUMO+Python co-simulation framework.
Systems 14 00047 g006
Figure 7. Road network used in experiments: (a) real-world road network topology in Weiyang District, Xi’an, China; (b) corresponding SUMO simulation network constructed from OpenStreetMap data.
Figure 7. Road network used in experiments: (a) real-world road network topology in Weiyang District, Xi’an, China; (b) corresponding SUMO simulation network constructed from OpenStreetMap data.
Systems 14 00047 g007
Figure 8. Reward Values of the Simulated Road Network During Off-Peak Periods.
Figure 8. Reward Values of the Simulated Road Network During Off-Peak Periods.
Systems 14 00047 g008
Figure 9. Reward Values of the Simulated Road Network During Peak Periods.
Figure 9. Reward Values of the Simulated Road Network During Peak Periods.
Systems 14 00047 g009
Figure 10. Reward Values of the Weiyang Road Network During Off-Peak Periods.
Figure 10. Reward Values of the Weiyang Road Network During Off-Peak Periods.
Systems 14 00047 g010
Figure 11. Reward Values of the Weiyang Road Network During Peak Periods.
Figure 11. Reward Values of the Weiyang Road Network During Peak Periods.
Systems 14 00047 g011
Figure 12. Average Waiting Times in Weiyang District Across Different Periods.
Figure 12. Average Waiting Times in Weiyang District Across Different Periods.
Systems 14 00047 g012
Figure 13. Average queue length in Weiyang District Across Different Periods.
Figure 13. Average queue length in Weiyang District Across Different Periods.
Systems 14 00047 g013
Figure 14. Wavelet Decomposition Time-Domain Plot.
Figure 14. Wavelet Decomposition Time-Domain Plot.
Systems 14 00047 g014aSystems 14 00047 g014b
Figure 15. Comparison Plots of Traffic Flow Prediction Among Different Models.
Figure 15. Comparison Plots of Traffic Flow Prediction Among Different Models.
Systems 14 00047 g015
Figure 16. Traffic Flow Prediction Performance of Different Models (216–236).
Figure 16. Traffic Flow Prediction Performance of Different Models (216–236).
Systems 14 00047 g016
Figure 17. Ablation Study Model Comparison Plot.
Figure 17. Ablation Study Model Comparison Plot.
Systems 14 00047 g017
Table 1. Experimental parameter settings.
Table 1. Experimental parameter settings.
TG-MADDPGSimulation Environment
ParameterValueParameterValue
Batch size64Intersections9
Actor learning rate1 × 10−4 G m i n 10 s
Critic learning rate1 × 10−3 G m a x 50 s
Discount factor0.99Yellow light3 s
Episode100Acceleration2 m/s2
Replay buffer10,000Probability of turning left15%
Simulation duration3600 sProbability of going straight60%
OptimizerAdamProbability of turning right25%
Number of GAT attention heads4Maximum speed13.89 m/s
GAT hidden layer dimension64Minimum vehicles distance2.5 m
Table 2. Performance Metrics of Different Algorithms Based on Simulated Road Networks.
Table 2. Performance Metrics of Different Algorithms Based on Simulated Road Networks.
ModelAverage Cumulative RewardAverage Waiting TimeAverage Queue Length
Off-PeakPeak PeriodOff-PeakPeak PeriodOff-PeakPeak Period
IQL−31.31−39.1294.57129.63106.12131.62
MADDPG−29.28−36.5292.64120.97103.98122.83
GMADDPG−27.73−33.3488.48111.5295.31118.64
TG-MADDPG−24.59−29.2981.9799.9284.96109.09
Table 3. Performance Metrics of Different Algorithms Based on Weiyang District.
Table 3. Performance Metrics of Different Algorithms Based on Weiyang District.
ModelAverage Cumulative RewardAverage Waiting TimeAverage Queue Length
Off-PeakPeak PeriodOff-PeakPeak PeriodOff-PeakPeak Period
IQL−37.07−40.4688.12135.51100.37151.32
MADDPG−36.46−39.3486.31131.2098.69146.53
GMADDPG−32.77−36.4280.26120.5888.47135.84
TG-MADDPG−30.58−33.4673.39113.0882.78124.81
Table 4. Comparison of Prediction Performance Indicators for Different Models.
Table 4. Comparison of Prediction Performance Indicators for Different Models.
ModelAssessment Indicators
RMSEMAER2
WT-GWO-CNN-LSTM1.190.8494.50
VMD-CNN-LSTM1.541.1590.78
CNN-GRU1.831.3488.06
Table 5. Comparison of Performance Metrics Across Different Models in the Ablation Study.
Table 5. Comparison of Performance Metrics Across Different Models in the Ablation Study.
ModelAssessment Indicators
RMSEMAER2
WT-CNN-LSTM1.360.9992.41
GWO-CNN-LSTM1.921.4687.59
CNN-LSTM2.031.5985.98
WT-GWO-CNN-LSTM1.190.8494.50
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Sun, C.; Yang, Y.; Li, J.; Fang, W.; Zhang, P. A Multi-Agent Regional Traffic Signal Control System Integrating Traffic Flow Prediction and Graph Attention Networks. Systems 2026, 14, 47. https://doi.org/10.3390/systems14010047

AMA Style

Sun C, Yang Y, Li J, Fang W, Zhang P. A Multi-Agent Regional Traffic Signal Control System Integrating Traffic Flow Prediction and Graph Attention Networks. Systems. 2026; 14(1):47. https://doi.org/10.3390/systems14010047

Chicago/Turabian Style

Sun, Chao, Yuhao Yang, Jiacheng Li, Weiyi Fang, and Peng Zhang. 2026. "A Multi-Agent Regional Traffic Signal Control System Integrating Traffic Flow Prediction and Graph Attention Networks" Systems 14, no. 1: 47. https://doi.org/10.3390/systems14010047

APA Style

Sun, C., Yang, Y., Li, J., Fang, W., & Zhang, P. (2026). A Multi-Agent Regional Traffic Signal Control System Integrating Traffic Flow Prediction and Graph Attention Networks. Systems, 14(1), 47. https://doi.org/10.3390/systems14010047

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop