HAPS-PPO: A Multi-Agent Reinforcement Learning Architecture for Coordinated Regional Control of Traffic Signals in Heterogeneous Road Networks

Lu, Qiong; Fang, Haoda; Yin, Zhangcheng; Zhu, Guliang

doi:10.3390/app152010945

Open AccessArticle

HAPS-PPO: A Multi-Agent Reinforcement Learning Architecture for Coordinated Regional Control of Traffic Signals in Heterogeneous Road Networks

¹

School of Traffic and Transportation Engineering, Xinjiang University, Urumqi 830017, China

²

Xinjiang Key Laboratory of Green Construction and Smart Traffic Control of Transportation Infrastructure, Xinjiang University, Urumqi 830017, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(20), 10945; https://doi.org/10.3390/app152010945

Submission received: 7 September 2025 / Revised: 27 September 2025 / Accepted: 30 September 2025 / Published: 12 October 2025

(This article belongs to the Special Issue Advances in Intelligent Transportation and Its Applications)

Download

Browse Figures

Review Reports Versions Notes

Abstract

The increasing complexity of urban traffic networks has highlighted the potential of Multi-Agent Reinforcement Learning (MARL) for Traffic Signal Control (TSC). However, most existing MARL methods assume homogeneous observation and action spaces among agents, ignoring the inherent heterogeneity of real-world intersections in topology and signal phasing, which limits their practical applicability. To address this gap, we propose HAPS-PPO (Heterogeneity-Aware Policy Sharing Proximal Policy Optimization), a novel MARL framework for coordinated signal control in heterogeneous road networks. HAPS-PPO integrates two key mechanisms: an Observation Padding Wrapper (OPW) that standardizes varying observation dimensions, and a Dynamic Multi-Strategy Grouping Learning (DMSGL) mechanism that trains dedicated policy heads for agent groups with distinct action spaces, enabling adequate knowledge sharing while maintaining structural correctness. Comprehensive experiments in a high-fidelity simulation environment based on a real-world road network demonstrate that HAPS-PPO significantly outperforms Fixed-time control and mainstream MARL baselines (e.g., MADQN, FMA2C), reducing average delay time by up to 44.74% and average waiting time by 59.60%. This work provides a scalable and plug-and-play solution for deploying MARL in realistic, heterogeneous traffic networks.

Keywords:

heterogeneity; multi-agent reinforcement learning (MARL); parameter sharing; proximal policy optimization (PPO); traffic signal control (TSC)

1. Introduction

With the acceleration of global urbanization and the surge in vehicle ownership, traffic congestion has become a primary bottleneck restricting sustainable urban development. In the United States, traffic congestion resulted in economic losses of $179 billion in 2021 [1]. These severe realities impose immense economic and time costs, leading to significant environmental pollution. Among the numerous mitigation strategies, TSC aims to alleviate urban congestion, enhance traffic efficiency, and reduce carbon emissions by optimizing traffic signal timings [2]. However, the escalating complexity of urban traffic conditions has rendered traditional Fixed-time [3] and actuated control [4] methods inadequate for dynamic traffic demands. Fixed-time control cannot adapt to real-time fluctuations in traffic flow. In contrast, despite its local adaptability, actuated control suffers from a myopic decision-making mechanism that hinders network-level coordination, resulting in performance degradation, especially under saturated traffic conditions [5]. Adaptive control systems, such as SCATS [6] and SCOOT [7], can pursue network-level optimization by incorporating traditional optimization algorithms or simple machine learning models [8]. Nevertheless, their centralized architecture often encounters computational bottlenecks and communication latency when managing large-scale urban networks and requires expensive infrastructure [9]. A multi-dimensional comparison of these three signal control methods is presented in Table 1. These inherent complexities and costs have paved the way for data-driven, model-free reinforcement learning methods.

In recent years, the rapid advancement of artificial intelligence has introduced new paradigms for solving complex TSC problems, with data-driven methods, such as Reinforcement Learning (RL), being particularly prominent [10]. RL enables an agent to learn an optimal control policy through direct interaction and “trial-and-error” with the traffic environment. This process does not rely on precise traffic models and thus shows immense potential [11].

Urban traffic networks are inherently distributed, requiring coordinated decision-making among signalized intersections to optimize network-wide efficiency. MARL models each intersection or signal controller as an agent, achieving collective optimization of the entire network through collaborative learning [12,13,14]. This approach effectively addresses the complexity of large-scale networks, overcoming the challenges centralized RL faces in handling high-dimensional joint action spaces [15,16]. However, most state-of-the-art MARL algorithms’ architectures and training paradigms depend on a critical yet often overlooked prerequisite: agent homogeneity. This assumption mandates that all agents possess identical observation and action space dimensions, which provides the foundation for efficient training mechanisms like parameter sharing.

Nevertheless, real-world urban road networks are highly heterogeneous [17], comprising various types and complexities of intersections, such as four-way, multi-way, and complex junctions with ramps and roundabouts [18]. The idealized premise of applying existing MARL methods to real traffic networks is untenable in the intricate reality of urban road systems [19]. This inherent network heterogeneity constitutes a significant barrier to transitioning MARL algorithms from theory to practical deployment, manifesting at two levels:

Observation Space Heterogeneity: The physical topology of urban intersections varies greatly, including standard four-way crossroads, three-way T-junctions, and irregular intersections with asymmetric lane counts. Different intersections, or agents, may have access to different traffic information. In MARL, this is often linked to the “partial observability” problem [20], where agents possess distinct observation spaces due to their physical location, sensor configuration, or role. For instance, some may receive detailed vehicle position and speed data, while others only obtain macroscopic information like queue length or vehicle counts [21]. This structural variance results in local state information (e.g., lane occupancy, queue length) that is naturally inconsistent in dimension, leading to variable-length observation vectors.
Action Space Heterogeneity: Corresponding to the observation space, the set of legal signal phases (actions) for different intersections also varies in size and composition due to their unique geometric and traffic regulations. Agents may execute various types or ranges of actions, resulting in heterogeneous action spaces. This discrepancy is fatal for parameter-sharing MARL algorithms that pursue high efficiency and scalability. A unified policy network designed for a complex intersection (e.g., four phases) will likely generate invalid actions beyond the legal range for a simpler intersection (e.g., three phases), severely hindering effective policy learning and leading to training collapse.

These problems pose several obstacles for existing MARL frameworks. First, a policy trained for a specific type of intersection lacks compatibility and transferability due to dimensional mismatch. Second, teaching a separate model for each intersection type would lead to a proliferation of models and inefficient training, and more importantly, it would disregard the universal traffic flow knowledge across different intersection types, failing to achieve knowledge sharing. Finally, for rare topologies within the network (e.g., a five-way intersection), data sparsity would make it difficult for the model to be adequately trained and converge.

To overcome these obstacles, this paper proposes a learning framework named Heterogeneity-Aware Policy Sharing Proximal Policy Optimization (HAPS-PPO). This framework ensures training robustness and policy effectiveness in complex heterogeneous environments by incorporating built-in heterogeneity-aware and adaptive mechanisms, making applying advanced MARL algorithms in highly heterogeneous real-world urban traffic networks feasible.

The main contributions of this paper can be summarized as follows:

Identified and formally defined the heterogeneity problem—the heterogeneity of observation and action spaces—that obstructs the application of MARL in real-world TSC, and clarified its fundamental constraints on existing mainstream algorithms, especially parameter-sharing ones.
Proposed the HAPS-PPO framework to address the heterogeneity challenge systematically. The framework normalizes heterogeneous observations via an OPW and trains dedicated policies for agent groups with different action spaces through a DMSGL mechanism, achieving compatibility with heterogeneous agents within a unified training process.
Conducted comprehensive empirical evaluations in a high-fidelity heterogeneous simulation environment based on a real-world road network. The results demonstrate that HAPS-PPO significantly outperforms Fixed-time control and various mainstream MARL baselines in improving traffic efficiency.
Provided a systematic, scalable, plug-and-play solution paradigm for seamlessly migrating parameter-sharing MARL algorithms from idealized homogeneous environments to complex, real-world heterogeneous networks.

The remainder of this paper is organized as follows: Section 2 reviews the state-of-the-art in traffic signal control and related MARL techniques. Section 3 elaborates on the theoretical modeling, core module design, and implementation details of our proposed HAPS-PPO method. Section 4 introduces the experimental setup and provides an in-depth analysis of the results. Section 5 concludes the paper and discusses future research directions.

2. Related Work

This chapter aims to review two core areas directly related to this study: traditional traffic signal control methodologies and the technological evolution of applying reinforcement learning to this problem. We will focus on analyzing the inherent limitations of existing research in addressing traffic network heterogeneity. The relevant terms and definitions used throughout this study are presented in Table 2.

2.1. Reinforcement Learning in Traffic Signal Control

The traditional TSC methods rely heavily on precise, predefined models, which prove inadequate when addressing traffic flow’s complex and dynamic spatiotemporal characteristics. In contrast, the DRL paradigm, by integrating the powerful nonlinear perception capabilities of Deep Learning with the strong learning abilities of RL for sequential decision-making problems, provides a data-driven, end-to-end solution to tackle the complex TSC problem without the need for predefined models [22].

2.1.1. Early Exploration Limitations of Single Agent to Independent Learner Approaches

Early research first explored modeling a single intersection as a single-agent problem, later extending to multi-agent scenarios [23]. Researchers applied DRL-based algorithms like DQN [24] and PPO [25] within this framework. However, simplifying a network problem into isolated single-agent problems inherently ignores the dynamic coupling relationships between intersections. When this idea was directly extended to multi-intersection settings, where each intersection acts as an Independent Learner (IL), its core deficiency became apparent. For any given agent in the network, other agents’ policies constantly change during training, making the environment perceived as non-stationary. This instability severely violates the fundamental assumption of the Markov Decision Process, making policies difficult to converge and significantly limiting model performance.

2.1.2. Mainstream Paradigm CTDE

To address the non-stationarity challenge, the CTDE [26] architecture emerged and quickly became the mainstream paradigm in the MARL field. The core idea of CTDE is to allow the algorithm access to global information (such as all agents’ observations and actions) during the training phase, thereby providing a stable, global frame of reference for each agent’s learning. During the execution phase, each agent makes decisions based solely on local observations, ensuring the system’s scalability and real-time responsiveness [27].

Within the CTDE framework, two main categories of methods exist:

Value Function Decomposition Methods: Algorithms like VDN [28] and QMIX [29] achieve coordination by decomposing a centralized global value function into the sum of individual agents’ local value functions. However, their strict constraints on value decomposition (e.g., monotonicity) may limit their expressive power in complex TSC scenarios with high conflict and nonlinear coordination.
Actor-Critic Methods: Paradigms such as MADDPG [30] and MAPPO [31] maintain an independent policy network (Actor) for each agent while utilizing one or more centralized value networks (Critics) that have access to global information to guide the training of all Actors. The presence of the Critic allows each Actor to receive stable and information-rich gradient signals. Due to its excellent stability and performance, the multi-agent version of PPO, MAPPO, has become a robust baseline for cooperative MARL tasks.

2.1.3. Current Bottleneck: The Neglected Problem of Heterogeneity

Despite the tremendous success of the CTDE framework, the vast majority of MARL research applied to TSC has been conducted under an idealized but unrealistic assumption: the homogeneity of the environment and agents. When these advanced algorithms are directly confronted with the inherent heterogeneity of real-world urban road networks, their effectiveness is compromised. To date, exploratory work on this bottleneck is still nascent.

Regarding observation heterogeneity, existing MARL algorithms are deficient in fusing heterogeneous information from traffic networks, as they tend to use homogeneous graph neural networks [32]. Some studies have attempted to use Graph Neural Networks (GNNs) [33] to encode road network topology. While GNNs can theoretically handle heterogeneous observation inputs, they do not address the more intractable, structural problem of action space heterogeneity. A unified GNN policy head still cannot output a correctly dimensioned and logically valid action distribution for a three-phase and four-phase intersection. Bie et al. [34] designed heterogeneous correlation metrics and reward functions and used a spatio-temporal graph attention network to process complex traffic flow features. Yang et al. [35] used an inductive heterogeneous graph neural network for representation learning to handle unseen nodes and new traffic networks, encoding heterogeneous features and structural information combined with a multi-agent Actor-Critic framework for policy learning. Zhang et al. [36] introduced a General Feature Extraction module and an Intersection-Specific Information Extraction module to address the limited representation capability of existing MARL models in handling intersections with different traffic flows and topologies. The scalability of fully centralized MARL is limited for large-scale, heterogeneous traffic networks. Some researchers have adopted a regional control strategy, dividing the entire network into multiple regions and applying centralized RL within each area to mitigate scalability and non-stationarity issues [37]. Heterogeneous observation spaces can lead to information inconsistency among agents, making coordinated decision-making more difficult. To solve this, methods for learning belief states of the underlying system have been proposed to help agents train decentrally in partially observable environments [38]. Some researchers proposed using multi-agent transfer reinforcement learning combined with a multi-view encoder to help agents process observation information from different sources or angles, thereby better understanding complex traffic conditions [39].

Regarding action heterogeneity, in TSC tasks, actions refer to the agent’s adjustments to the traffic signals. Commonly, action settings include three methods: phase selection [40,41,42,43], phase switching [44], and phase duration adjustment [45,46]. Phase selection offers flexibility in phase combinations at the cost of relatively frequent signal changes [34]. Haoqing Luo et al. [47] addressed the inherent defects of discrete or continuous action spaces in optimization. They better balanced the trade-off between frequent switching and unsaturated release in a unified hybrid action space. Agents must make decisions based on local observations and capabilities in a heterogeneous environment while coordinating with other agents. This has driven the development of decentralized MARL frameworks to balance local decision-making and global optimization. Currently, there are no mature solutions. A makeshift solution is Action Masking, which forcibly sets the probabilities of illegal actions to zero at the output layer of the policy network. However, this is a superficial fix rather than a fundamental solution. This masking operation corrupts the policy gradient calculation. Since the probability of a masked action is zero, the gradient of its log-probability will be undefined or zero, preventing the policy network from receiving an effective learning signal from the “attempting an incorrect action” behavior. This severely suppresses exploration efficiency, especially in scenarios with complex action spaces. GNN-based methods focus more on feature representation at the input layer, while Action Masking is a post-processing step at the output layer. The “policy grouping” of HAPS-PPO is a structured solution at the policy level.

In summary, existing work either avoids the heterogeneity of traffic networks or proposes solutions (like GNNs or Action Masking) with fundamental flaws, failing to provide a unified framework that can natively and efficiently handle the heterogeneity of observations and actions. Therefore, designing such a framework has become a key scientific problem that must be solved to advance MARL technology from simulation to real-world applications. The HAPS-PPO framework proposed in this paper is intended to fill this critical research gap.

3. Methodology: An Adaptive Traffic Signal Control Method for Heterogeneous Urban Road Networks

To address the challenges posed by the diversity of intersections in real-world urban traffic networks, we propose a novel multi-agent reinforcement learning framework—HAPS-PPO. This framework is designed to overcome the limitations of traditional MARL algorithms in handling agents with different observation dimensions and action sets, thereby learning efficient and coordinated traffic signal control policies for the entire heterogeneous network.

3.1. Problem Formulation: TSC as a Dec-POMDP

We model the cooperative control of multiple traffic signals as a Decentralized Partially Observable Markov Decision Process (Dec-POMDP). This model provides a rigorous mathematical outline for a group of autonomous agents making sequential decisions in a shared environment based on local information. A 9-tuple formally defines a Dec-POMDP:

< I, S, {A_{i}}_{i \in I}, T, {R_{i}}_{i \in I}, {Ω_{i}}_{i \in I} O, γ, b_{0} >

(1)

Each component is detailed below:

1.: Set of Agents ( $L$ ):

A finite set of agents,

L

= {1, 2, …,

n

}. In the TSC scenario, each traffic signal controller is defined as an independent agent.

2.: Set of Agents ( $S$ ):

A set of global states describing all aspects of the environment, even those not directly perceptible to the agents. A state contains precise information about all vehicles in the network (e.g., position, speed, waiting time), the current phase of all signals, and key traffic parameters for each approach lane, such as queue length, vehicle count, and average speed. These features are widely proven/be directly related to intersection congestion and provide a sufficient basis for agent decision-making. This global state is partially or fully invisible to any single agent.

3.: Set of Action Spaces ( ${A_{i}}_{i \in I}$ ):

A local action set

A_{i}

is defined for each agent

i

. An action

a_{i} \in {A_{i}}

represents a control decision, such as “switch to the next predefined phase” or “maintain the current phase.” Due to differences in intersection geometry and phase design (e.g., a T-junction vs. a four-way intersection), the cardinality of each agent’s action space

|A_{i}|

is heterogeneous. The joint action space is

A = \times_{i \in I} A_{i}

.

4.: State Transition Function ( $T$ ):

A probabilistic function that defines the system’s dynamics. It gives the probability

P (s^{’} |s, a)

of the environment transitioning to the next state

s^{’}

given the current state

s

and the joint action

a \in A

. In our work, this function is implicitly modeled by the high-fidelity traffic simulator SUMO, either deterministically or stochastically, depending on the simulation configuration.

5.: Set of Reward Functions ( ${R_{i}}_{i \in I}$ ):

In the RL framework, the design of the reward function is crucial, as it directly defines the agent’s optimization objective and guides its learning of the optimal control policy. We define a reward function

R_{i} : S \times A \to ℝ

for each agent

i

. It returns a scalar reward

r_{i} = R_{i} (s, a)

to evaluate the immediate effect of executing the joint action

a

in state

s

. To achieve effective traffic signal control, this paper’s core objective is to minimize vehicles’ unproductive waiting time at intersections, thereby improving traffic efficiency and alleviating congestion.

Based on this objective, this study adopts the widely used Cumulative Waiting Time Difference as the reward function. This function provides immediate and effective feedback for the agent’s actions by quantifying the change in intersection congestion before and after a decision.

For any traffic signal agent

i

at a decision time step

t

, its reward

R_{i, t}

is defined as the difference in the cumulative waiting time of all vehicles on its approach lanes between the previous and current decision cycles. Its mathematical formulation is as follows:

R_{i, t} = W_{i, t - 1} - W_{i, t}

(2)

where

W_{i, t}

represents the total cumulative waiting time of vehicles on all approach lanes of the intersection controlled by the agent

i

during the current decision cycle (from time step

t - 1

to

t

). Specifically,

W_{i, t} = \sum_{v \in V_{i}} w_{v, t}

, where

V_{i}

is the set of all vehicles approaching or at the intersection during the current cycle, and

w_{v, t}

is the waiting time of the car

v

in that cycle.

W_{i, t - 1}

represents the total cumulative waiting time of all vehicles at the same intersection during the previous decision cycle.

The intrinsic logic of this reward mechanism is as follows:

Positive Reward (

R_{i, t} > 0

): When an agent’s action (e.g., switching to a specific phase) results in a lower total waiting time in the current cycle than the previous one, the agent receives a positive reward. This indicates that the decision effectively alleviates congestion, and the model will be inclined to repeat such actions in similar future states.

Negative Reward (

R_{i, t} < 0

): When a decision leads to an increase in total waiting time (

W_{i, t} > W_{i, t - 1}

), the agent is penalized with a negative reward. This signals that the action exacerbated congestion, and the model will learn to avoid such behaviors.

By maximizing the cumulative reward

\sum R_{i, t}

, the RL algorithm drives the agents to learn a policy that implicitly minimizes long-term cumulative waiting time. Based on immediate change, this reward design provides clearer learning signals and more stable gradients than directly using the negative of waiting time as a reward. It has been proven to have good convergence and effectiveness in multi-agent traffic control tasks.

6.: Set of Observation Spaces ( ${Ω_{i}}_{i \in I}$ ):

A local observation set

Ω_{i}

is defined for each agent

i

. An observation

o_{i} \in Ω_{i}

is the local, possibly noisy, perception that an agent receives from the environment. In our setting,

o_{i}

is a real-valued vector containing local traffic indicators such as queue length and total vehicle count for each incoming lane. Due to differences in the physical structure of intersections (e.g., number of lanes), the dimension of the observation vector

\dim (o_{i})

is heterogeneous. The joint observation space is

Ω = \times_{i \in I} Ω_{i}

.

7.: Joint Observation Function ( $O$ ):

O : S^{’} \times A \times Ω \to [0, 1]

is a probability function couples states with observations. It defines the probability

P (o |s^{’}, a)

that all agents receive a joint observation

o \in Ω

after the system has transitioned to a new state

s^{’}

and a joint action

a

has just been executed. This formalizes the core concept of “partial observability”.

8.: Discount Factor ( $γ$ ):

The discount factor

γ \in [0, 1)

is used to decay future rewards when calculating cumulative returns exponentially. It balances the relative importance of immediate benefits and long-term goals. In this study,

γ = 0.999

indicates that the model focuses on optimizing long-term traffic efficiency.

Under the above Dec-POMDP framework, each agent aims to learn a policy to maximize the long-term return. The ultimate goal of the joint policy is to maximize the expected global discounted return

J (π)

:

J (π) = E_{\begin{array}{l} a_{t} ~ π (\cdot |o_{t}) \\ s_{t + 1} ~ P (\cdot |s_{t}, a_{t}) \end{array}} [\sum_{t = 0}^{T} γ^{t} \sum_{i \in N} r_{t}^{i}]

(3)

where

o_{t} = {(o_{t}^{i})}_{i \in N}

is the joint observation at time

t

. The HAPS-PPO framework proposed in this paper addresses the heterogeneity challenge of observation and action spaces and effectively solves this optimization problem through the PPO algorithm.

9.: Initial State Distribution ( $b_{0}$ ):

A probability distribution over the initial state space,

b_{0} : S \to [0, 1]

, describes the probability that the system is in each possible state at time

t = 0

.

Core Challenges: Under this Dec-POMDP framework, each agent aims to learn a local policy

π_{i} : O_{i} \to A_{i}

that maps local observations to actions to maximize its expected long-term discounted return

J (π)

. The core technical challenge of this study is how to design a unified reinforcement learning algorithm that can learn such a set of efficient heterogeneous policies, while effectively coping with the problem of inconsistent model inputs and outputs caused by the heterogeneity of

{Ω_{i}}

and

{A_{i}}

.

3.2. HAPS-PPO Framework Design

HAPS-PPO resolves the heterogeneity problem by introducing two key adapter modules, enabling a single, unified training pipeline to empower all agents. Its technical roadmap is illustrated in Figure 1.

3.2.1. Unified Observation Space Representation: Observation Padding Wrapper

We designed an Online Zero-Padding mechanism to enable a model with shared network weights to process inputs of varying dimensions. This is implemented through the OPW, whose workflow is as follows:

Max-Dimension Identification: During environment initialization, the system iterates through all potential agents and identifies the maximum observation vector dimension, denoted as $D_{\max} = \max_{i \in I} \dim (o_{i})$ , across the entire network.
Real-time Padding: At each simulation time step, when the environment returns the raw observation vectors $o_{i}$ for each agent, the wrapper dynamically applies post-padding with zeros to each vector, extending its dimension from its original $\dim (o_{i})$ to the unified $D_{\max}$ . The mathematical expression for this mechanism is:

o_{i}^{’} = p a d (o_{i}, (0, D_{\max} - \dim (o_{i})))

where

o_{i}^{’}

is the padded observation vector. Its pseudo-code is as shown in Algorithm 1:

Algorithm 1: Observation Padding Wrapper

Input : Raw observation dictionary O_{r a w} = {(a_{i}, o_{i}) |a_{i} \in A}

Output : Standardized observation dictionary O_{s t d}

1://Step 1: Identify the maximum observation dimension

2 : d_{\max} \leftarrow 0

3 : for each agent i with observation o_{i} \in O_{r a w} d o

4 : d_{i} \leftarrow

dimension of o_{i}

5 : d_{\max}

\leftarrow

\max (d_{\max}, d_{i})

6: end for
7://Step 2: Perform post-padding with zeros on each observation vector

8 : O_{s t d} \leftarrow

{}

9 : for each agent i with observation o_{i} \in O_{r a w} d o

10 : d_{i} \leftarrow

dimension of o_{i}

11 : padding_width \leftarrow d_{\max} - d_{i}

12 : o_{i}^{’} \leftarrow

Post Pad (o_{i}

, padding_width, value = 0)

13 : O_{s t d} [i] \leftarrow o_{i}^{’}

14: end for

15 : return O_{s t d}

Through this method, all observation tensors input to the policy network have the same dimension (

D_{\max}

). This allows us to employ a shared feature extraction backbone to process observations from all agents, significantly improving sample efficiency and model generalization, as similar traffic patterns (e.g., “congestion”) can be encoded similarly by the network, regardless of the intersection at which they occur.

3.2.2. Policy Network Architecture: Dynamic Multi-Strategy Grouping Learning Based on Action Space Dimension

Unlike observation spaces, the heterogeneity of action spaces cannot be resolved by simple padding, as it directly relates to the policy network’s output layer structure. To address this issue, we adopt the Dynamic Multi-Strategy Grouping Learning (DMSGL) mechanism, which strikes a balance between fully independent policies and a fully shared policy.

Agent Clustering is performed before training begins. We cluster agents based on the cardinality of their action spaces

|A_{i}|

. All agents with the same number of actions are placed in the same group. Subsequently, we create an independent PPO policy, denoted as

π_{N}

, for each group of agents with a unique action space dimension. These policies (

π_{1}, π_{2}, \dots

) share a common feature extraction backbone (e.g., MLP layers of 256). Still, each group is equipped with a separate policy output head that matches the action space dimension of that group. During training, gradients from all policy heads are back-propagated to update the shared backbone jointly. This “Shared-Backbone, Separate-Heads” design allows different types of agents to share knowledge of general traffic state representations while ensuring the structural correctness of the output actions.

We implement a Dynamic Policy Mapping using an efficient mapping function, ‘policy_mapping_fn’. At each step of training and inference, when the system needs to select a policy for a specific ‘agent_id’, this function instantly maps it to the corresponding policy of its group (e.g., ‘agent_k’ is mapped to

π_{|A_{k}|}

). The pseudo-code is as shown in Algorithm 2:

Algorithm 2: Dynamic Multi-Strategy Grouping Learning

Input: Agent ID agent_id, pre-computed agent groups

G r o u p = {G_{k}}

Output: Policy network policy corresponding to the agent
1://Agent clustering is completed before training starts
2://Example of Groups structure: {3:[agent_1,agent_5],4[agent_2,agent_3,agent_4,agent_6]}
3:// Where the key is the action space cardinality

k, and the value is the list of agents in the group G_{k}

4 : / / k \leftarrow

Get Action Space Size(agent_id)// Obtain the action space size of the agent

5 : / / group_id \leftarrow k

6://policy

\leftarrow

Policy Head Registry[group_id]// Find the corresponding policy head from the policy head registry
7://return policy// Return the exclusive policy network (shared backbone + specific head)

This method of “parameter sharing within groups, structural separation between groups” has the advantage of improving learning efficiency while ensuring structural correctness: agents of the same type (same action space) can share gradient information, learn and optimize policies together, thereby accelerating convergence. And using independent policy heads for agent groups with different action spaces fundamentally ensures the logical correctness of action selection.

3.3. Algorithm Implementation and Distributed Training

The HAPS-PPO framework uses the Ray RLlib distributed reinforcement learning library and the SUMO traffic simulator. We chose PPO as the underlying algorithm. PPO’s signature Clipped Surrogate Objective effectively constrains the magnitude of policy updates, which is crucial for the dynamic and high-variance traffic simulation environment, ensuring the stability and robustness of the training process. To fully leverage the potential of modern computing hardware and accelerate experiments, we designed a highly parallel training architecture with the following key configurations:

Saturated Data Collection: Ray’s parallel workers (‘num_workers’) are close to the total number of physical CPU cores to maximize data collection throughput, eliminate CPU bottlenecks, and ensure GPU resources are consistently highly utilized.
Large-Scale Batch Training: The ‘train_batch_size’ is dynamically set as the product of ‘num_workers’ and ‘rollout_fragment_length’ to ensure each gradient update is based on a large and diverse set of experiences. Concurrently, we significantly increase the ‘sgd_minibatch_size’ (e.g., to 4096) to leverage the parallel computing power of the GPU and improve the efficiency of a single training operation.
Asymmetric Resource Allocation: A fractional GPU allocation strategy is adopted, assigning almost all GPU computing resources to the primary training process and a nominal, tiny GPU share (e.g., 0.001) to each CPU-intensive data collection worker. This asymmetric allocation model resolves scheduling challenges in RLlib, ensuring the GPU is dedicated to the computationally intensive task of model parameter updates.

Through this systematic engineering design and implementation, HAPS-PPO theoretically solves the core challenges of heterogeneous MARL and practically constructs an efficient, scalable, and highly reproducible training framework for advanced traffic signal control.

4. Experiments

This chapter aims to comprehensively evaluate the performance of the HAPS-PPO framework through a series of experiments in a high-fidelity simulation environment. We first introduce the experimental environment, key parameter configurations, and evaluation metrics. Then, we present and analyze the convergence process of HAPS-PPO and quantitatively compare it with traditional Fixed-time control methods.

4.1. Experimental Setup

4.1.1. Simulation Environment and Scenario

Our experiments are based on two major open-source platforms: the microscopic traffic simulator SUMO and the distributed reinforcement learning library Ray RLlib. The specific parameters for the experimental platform are listed in Table 3. The experimental road network is a 2 × 3 grid area based on the real urban road network surrounding Fujian University of Technology in Fujian Province, China. It includes six major signalized intersections, each modeled as an independent agent. The experimental data consists of traffic flow information collected from six intersections between 17:30 and 19:30 on a specific day, with a data logging interval of 5 min. The core feature of this scenario is its heterogeneity (see Figure 2 and Figure 3), designed to test the algorithm’s adaptability rigorously:

Topological Heterogeneity: The network mixes standard four-way intersections with T-junctions.

Action Space Heterogeneity: Different signal phasing plans (e.g., three-phase and four-phase systems) are employed based on the intersections’ geometry and traffic flow patterns (see Figure 4). This directly leads to inconsistent action space dimensions among agents, highlighting the necessity of the HAPS-PPO framework.

To connect SUMO and RLlib, we utilized the “sumo-rl” library and implemented our custom OPW to handle the heterogeneous observation dimensions—simulation duration: 7200 s, utilizing real-world traffic flow data. Yellow light duration is 3 s, minimum green light duration is 15 s, and maximum green light duration is 60 s. Random perturbations were introduced during training to enhance the model’s generalization capability.

4.1.2. Algorithm Configuration and Baselines

HAPS-PPO was configured using PPO as its core algorithm with a total training step count of 5,000,000. The Adam optimizer and ReLU activation function were employed with a 256 × 256 neural network structure. Key hyperparameter settings are shown in Table 4. This configuration aims to balance exploration efficiency and convergence stability.

4.1.3. Model Evaluation

After training has converged, we load the optimal model checkpoint and conduct a performance evaluation in a visualized SUMO environment with the GUI enabled. The evaluation process precisely reproduces the dynamic multi-policy mapping mechanism of HAPS-PPO, ensuring each agent calls its dedicated policy. We record and analyze the evaluation metrics for qualitative observation and quantitative comparison.

4.2. Experimental Results and Analysis

4.2.1. Training Convergence Analysis

HAPS-PPO exhibits strong learning efficiency and stability. As shown in Figure 5, the reward statistics (maximum, minimum, and mean) increase monotonically and eventually converge, indicating that the agents have learned coordinated policies that consistently reduce vehicle waiting time. The comparison of mean rewards across the three grouped policies in Figure 6 further shows that all groups learn effectively and converge, validating the feasibility of our approach.

Figure 7 and Figure 8 depict the evolution of the policy loss. For all three groups, the loss first rises rapidly and then gradually settles to a stable level; the aggregate convergence curve (computed as the average across groups) confirms the same trend. This behavior suggests the policy network receives stable and informative gradient signals during optimization. The highly synchronized trajectories across groups are attributable to the gradient-aggregation effect of the shared backbone in the DMSGL mechanism, through which local optimization experience from each group is propagated and shared globally, thereby accelerating overall convergence.

Figure 9 and Figure 10 focus on the value loss, which reflects the critic’s state-value estimation error. The loss decreases over time and stabilizes around 1.0, indicating steadily improving value estimation. The three groups follow nearly identical trajectories, corroborating the effectiveness of the shared backbone for value estimation: despite heterogeneous action spaces, standardized observations, and shared feature extraction enable learning of generalizable value-assessment criteria that provide consistent optimization signals to all groups.

Figure 11 and Figure 12 present the evolution of the entropy loss. The entropy decays gradually and stabilizes at approximately 0.2, indicating high exploration early in training (e.g., trying diverse phase-switching strategies) followed by reduced exploration as the policies mature—balancing the exploration–exploitation trade-off in line with PPO’s design. By grouping policies, HAPS-PPO adapts exploration to the complexity of each group’s action space, further enhancing training stability.

4.2.2. Evaluation Metrics

We evaluate the performance of the HAPS-PPO framework in a typical heterogeneous traffic scenario and compare it with the existing fixed-phase scheme. The analysis is conducted based on traffic efficiency and capacity, using the following evaluation metrics:

(1): Average Speed:

Average speed is a traffic efficiency indicator that measures the average speed of all vehicles travelling in the control area of a particular intersection throughout the simulation period. It reflects the smoothness of traffic flow at that intersection. Its calculation formula is in two steps:

First, at each simulation time step, the instantaneous average speeds of all vehicles in the control lane of the intersection are calculated:

S_{i, t} = \frac{1}{|V_{i, t}|} \sum_{v \in V_{i, t}} s_{v} (t)

(4)

where

V_{i, t}

is the set of vehicles in the control lane of the intersection at the time step

t

,

|V_{i, t}|

is the number of vehicles, and

s_{v} (t)

is the instantaneous speed of the vehicles at that moment.

The instantaneous average velocities of all time steps are then averaged again to obtain the final average velocity of the intersection throughout the simulation:

\bar{S_{i}} = \frac{1}{N_{s t e p s}} \sum_{t = 1}^{N_{s t e p s}} S_{i, t}

(5)

where

N_{s t e p s}

is the total number of simulation steps.

Average speed is the most intuitive indicator of traffic operations and congestion efficiency. A high average speed usually means that traffic flows smoothly, vehicles can travel close to their desired speeds, delays are low, and intersections are efficient. Low average speeds indicate the presence of traffic congestion, queuing, and frequent starts and stops. This leads to increased journey times and a poorer driving experience, often accompanied by higher fuel consumption and pollutant emissions.

(2): Avera ge CO₂ emissions:

CO₂ emissions are a key environmental impact indicator that quantifies the carbon dioxide emissions produced by an average vehicle travelling through the intersection. This indicator directly correlates traffic efficiency with ecological costs. The formula for its calculation is as follows:

E_{{CO}_{2}, per vehicle} = \frac{\sum_{t = 1}^{N_{steps}} Δ E_{{CO}_{2}, i} (t)}{N_{throughput, i}}

(6)

where

\sum_{t = 1}^{N_{s t e p s}} Δ E_{{CO}_{2}, i} (t)

is the total incremental CO₂ emissions accumulated at the intersection over all simulation steps.

N_{throughput, i}

is the total number of vehicles passing through the intersection during the entire simulation period.

This indicator reflects the traffic control strategy’s environmental friendliness and energy efficiency. A low value indicates that the control strategy is efficient and environmentally friendly. It allows vehicles to pass through intersections smoothly and quickly, reducing unnecessary emissions from inefficient driving conditions such as idling, frequent acceleration, and deceleration. High values mean the traffic flow is poorly organised and the vehicle experiences a lot of waiting, starting, and stopping. These conditions are the zones of lowest engine fuel efficiency and highest emissions. Therefore, high values point directly to higher environmental pollution and energy waste.

(3): Average Delay Time (ADT):

It is a key performance indicator for measuring traffic system efficiency, especially for evaluating the severity of traffic congestion and the service level of intersections. Unlike the “average travel time,” which macroscopically reflects the total travel time, the average delay time focuses more on the additional time loss caused by various “non-ideal” conditions in the traffic flow. It accurately quantifies the extra time spent by vehicles compared to driving under ideal (free-flow) conditions due to traffic control (e.g., signal waiting), queuing, and mutual interference between other cars.

This indicator can directly and sensitively reflect the bottlenecks of the traffic network. A traffic optimization strategy (such as an advanced signal control algorithm) may improve the total travel time, but its core success lies in significantly reducing delays, especially at key intersections. Therefore, the average delay time is a more targeted micro indicator for diagnosing problems and evaluating the effectiveness of specific intervention measures (such as signal timing optimization, ramp control, etc.).

Its calculation formula can usually be expressed as:

A D T = \frac{1}{N} \sum_{i = 1}^{N} D_{i}

(7)

where

A D T

: Average delay time (unit: seconds/vehicle)

N

: Total number of cars passing through the road network during the observation period

D_{i}

: Total delay time experienced by the

i

-th vehicle

(4): Average Travel Time (ATT):

It is a core performance indicator for measuring the traffic network’s overall operational efficiency and road users’ service quality experience. The advantage of this indicator is its comprehensiveness: it does not measure a specific congestion phenomenon in isolation but integrally summarizes the cumulative impact of all traffic impedance factors (such as intersection delays, queuing, slow driving, etc.) on the final travel. Therefore, regardless of the specific mechanism of the optimization strategy, the significant reduction in average travel time is the final and most direct macro basis for judging whether the approach is effective and successful. Its calculation formula is as follows:

A T T = \frac{1}{N} \sum_{i = 1}^{N} T_{i}

(8)

where

A T T

: Average travel time (unit: seconds/vehicle)

N

: Total number of cars passing through the road network during the observation period

T_{i}

: Actual total travel time of the i-th vehicle in the road network (including driving time + waiting time)

(5): Average Waiting Time (AWT):

It is a key diagnostic indicator for in-depth analysis of travel time. If the average travel time is a macro “result” for measuring network performance, then the average waiting time accurately reveals the micro core “cause” leading to this result. It quantifies the cumulative duration during which vehicles are completely stationary due to traffic signal control or congestion queuing. A high proportion of waiting time in the total travel time indicates serious bottlenecks at intersections and severe friction in road network operation. Therefore, this indicator is an accurate tool for diagnosing network operation bottlenecks and a benchmark for evaluating the pros and cons of traffic signal control strategies. Its calculation formula is as follows:

A W T = \frac{1}{N} \sum_{i = 1}^{N} W_{i}

(9)

where

A W T

: Average waiting time (unit: seconds/vehicle)

N

: Total number of cars passing through the target area during the observation period

W_{i}

: Complete stationary waiting time of the i-th vehicle at the intersection/bottleneck (excluding slow creeping time)

(6): Completed Trips:

The number of completed vehicle trips is a macro-indicator of the entire road network’s service capacity and operational efficiency. It counts the number of vehicles successfully traveling from the start to the end point within the specified simulation time. It is a direct counting statistic whose calculation is obtained from the SUMO simulation result statistics.

4.2.3. Comparative Analysis

We compare HAPS-PPO with the following baseline methods, including traditional and deep reinforcement learning approaches.

Fixed-time:

Signal controllers operate continuously according to a pre-set fixed timing plan. This is a commonly used control scheme due to its low cost and ease of maintenance. We use the actual Fixed-time plan from the real-world network for a more authentic and representative comparison. The phase and timing settings for fixed-time signal control are shown in Figure 13.

MADQN:

The Multi-Agent Deep Q-Network (MADQN) adopts a decentralized architecture, deploying independent Deep Q-Network (DQN) agents at each traffic intersection within the urban road network. Each agent makes decisions based solely on the local traffic conditions at its assigned intersection (such as queue length and vehicle speed), without relying on global information. This model learns through real-time iterative interaction between agents and the simulation within the SUMO micro-scale traffic simulation environment. It autonomously generates optimal signal switching schemes adapted to dynamic traffic flows, enhancing intersection throughput and reducing vehicle delays. Based on MADQN’s algorithmic principles, hyperparameters were tuned to ensure optimal performance. The Adam optimizer and ReLU activation function were employed with a 256 × 256 neural network structure. Training comprised 5,000,000 steps, initiated with a minimum replay pool sample size of 20,000 to prevent training instability caused by insufficient initial samples. SUMO simulation parameters were configured identically to the HAPS-PPO model. The corresponding hyperparameter table is shown in Table 5:

We compared and analysed the traffic conditions at six intersections in the road network, specifically analysing several metrics such as the average speed per vehicle through the intersection (m/s), the average carbon dioxide emissions (g/car), the average delay time(s), the average travel time(s), the average waiting time(s) and the total number of vehicles in the simulation (Trips). The comparison results of the three methods are shown in Table 6:

The experimental results show that the proposed HAPS-PPO algorithm significantly improves traffic network optimization compared to the traditional Fixed-time scheme. The algorithm reduced the average delay time by 44.74%, average travel time by 23.47%, and impressively cut the average waiting time by 59.60%. Compared to the MADQN algorithm, HAPS-PPO reduced average delay time by 22.59%, average travel time by 9.88%, and average waiting time by 32.46%. Different levels of effectiveness have also been achieved regarding average speed and CO₂ emissions. This demonstrates the strong potential of the HAPS-PPO algorithm in reducing traffic delays and enhancing road network efficiency.

In this network scenario, each evaluation period includes 28,615 vehicle demands. Our performance metrics are calculated based on all vehicles that completed their predefined trips. Due to the varying efficiency of different control algorithms, the number of completed trips differs slightly in a single simulation run. HAPS-PPO enabled 27,575 vehicles to complete their trips in a typical run, whereas the Fixed-Time baseline saw approximately 27,189 completions.

To further test the model’s adaptability and robustness, we switched to a different heterogeneous road network, selecting an 8-intersection network from the Cologne region in Germany, as shown in Figure 14 and Figure 15.

We used the same evaluation metrics to compare HAPS-PPO with the MPLight and FMA2C models [48], with the results shown in Table 7.

As can be seen, HAPS-PPO achieved the best results. Compared to the Fixed-time plan, it reduced average delay time by 39.23%, average travel time by 16.49%, and average waiting time by 52.43%. Compared to the FMA2C baseline, it achieved further reductions of 9.86% in average delay time, 1.95% in average travel time, and 1.55% in average waiting time.

5. Conclusions and Future Work

5.1. Conclusions

This paper addressed a critical yet often overlooked challenge in applying MARL to real-world TSC: the inherent heterogeneity of observation and action spaces across different intersections in urban road networks. We demonstrated that this heterogeneity poses an obstacle to parameter-sharing MARL algorithms, which typically assume agent homogeneity.

To overcome this bottleneck, we proposed the HAPS-PPO framework, which innovatively integrates two core mechanisms: the OPW for handling varying observation dimensions and the DMSGL for managing diverse action spaces. This design allows effective knowledge sharing through a standard feature extraction backbone while ensuring structural correctness via dedicated policy heads for different agent groups.

Comprehensive experimental evaluations conducted on high-fidelity simulation environments based on real-world road networks robustly validate the superiority of HAPS-PPO. The results demonstrate that HAPS-PPO significantly outperforms traditional Fixed-time control and mainstream MARL baselines (MADQN, FMA2C). Key improvements include a substantial reduction in average delay time (up to 44.74%), average travel time (up to 23.47%), and most notably, average waiting time (up to 59.60%) compared to Fixed-time control. Furthermore, HAPS-PPO achieved superior performance over MADQN (e.g., 22.59% reduction in delay) and FMA2C, confirming its effectiveness in learning coordinated policies for heterogeneous networks. These quantitative results support the conclusion that HAPS-PPO successfully mitigates the heterogeneity problem and enhances overall traffic efficiency.

In summary, this research formally defines a pivotal problem hindering the practical deployment of MARL in TSC and provides a scalable, plug-and-play solution. The HAPS-PPO framework offers a valuable pathway for developing intelligent traffic control systems operating in complex, real-world heterogeneous environments.

While HAPS-PPO demonstrates strong performance, its design requires a separate policy head for each unique action space type. This could increase model parameters in scenarios with an extremely high variety of intersection types. Additionally, this study primarily focuses on heterogeneity in observations and actions, leaving heterogeneous reward functions or potential agent conflicts for future exploration.

5.2. Future Work

The HAPS-PPO framework presents a robust solution for MARL in heterogeneous traffic networks. Building upon this foundation, several promising directions emerge for future research to enhance the framework’s practicality, scalability, and intelligence:

Advanced Policy Architecture for Scalability: While DMSGL effectively handles a moderate number of intersection types, employing separate policy heads may face scalability challenges in metropolitan-scale networks with dozens of unique intersection topologies. Future work will explore more parameter-efficient fine-tuning (PEFT) techniques, such as integrating Adapter modules into the DMSGL mechanism. This would allow for fine-tuning a small set of parameters for each action group while keeping the vast majority of the shared backbone fixed, thus managing a wider variety of action spaces without a linear increase in trainable parameters.
Multi-Objective Optimization and Heterogeneous Rewards: This study primarily optimized for traffic efficiency (e.g., minimizing waiting time). Future research will extend the HAPS framework to multi-objective optimization, incorporating heterogeneous reward functions that balance global efficiency with local constraints or conflicting goals (e.g., prioritizing public transport, ensuring pedestrian safety, minimizing emissions for specific sensitive areas). Investigating reward-shaping techniques and multi-objective MARL algorithms within the HAPS architecture will be crucial for developing more comprehensive and equitable traffic control policies.
Enhanced Generalization and Robustness via Meta-Learning: The current model is trained and evaluated on specific traffic patterns. To improve generalization to unseen network topologies, fluctuating demand patterns, and unexpected events (e.g., incidents), we plan to incorporate Meta-Reinforcement Learning (Meta-RL) and Domain Randomization strategies. The goal is to train a meta-policy that can quickly adapt to new heterogeneous intersections or regional control scenarios with minimal fine-tuning, significantly accelerating deployment in novel environments.
Integrated Perception and Control in V2X Environments: With Vehicle-to-Everything (V2X) communication advancement, future traffic systems will access rich, real-time vehicle-level data. A key direction is expanding the OPW mechanism to fuse this new data modality (e.g., vehicle trajectories, intentions) with infrastructure-based observations. This will enable a shift from reactive control to predictive and cooperative decision-making, where signal controllers can anticipate traffic flow and optimize phases proactively based on a more complete picture of the network state.

Author Contributions

Conceptualization, Q.L. and H.F.; methodology, H.F.; software, Q.L.; validation, H.F., Z.Y. and G.Z.; formal analysis, H.F.; investigation, H.F.; resources, Q.L.; data curation, H.F.; writing—original draft preparation, H.F.; writing—review and editing, H.F.; visualization, G.Z.; supervision, Q.L.; project administration, Q.L.; funding acquisition, Q.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the Xinjiang Uygur Autonomous Region “Tianchi Talent” Introduction Program for Young PhDs.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflict of interest.

References

Schrank, D.; Eisele, B.; Lomax, T. 2021 Urban Mobility Report; Texas A&M Transportation Institute: Bryan, TX, USA, 2021. [Google Scholar]
Wang, Z.; Xu, L.; Ma, J. Carbon Dioxide Emission Reduction-Oriented Optimal Control of Traffic Signals in Mixed Traffic Flow Based on Deep Reinforcement Learning. Sustainability 2023, 15, 16564. [Google Scholar] [CrossRef]
Ouyang, Y.; Jain, R.; Varaiya, P. On the Existence of Near-Optimal Fixed Time Control of Traffic Intersection Signals. In Proceedings of the Fifty-Fourth Annual Allerton Conference on Communication, Control, and Computing (Allerton), Monticello, IL, USA, 27–30 September 2016. [Google Scholar]
Furth, P.G.; Cesme, B. Lost Time and Cycle Length for Actuated Traffic Signal. Transp. Res. Rec. J. Transp. Res. Board 2009, 2128, 152–160. [Google Scholar] [CrossRef]
Zhao, Z.; Wang, K.; Wang, Y.; Liang, X. Enhancing traffic signal control with composite deep intelligence. Expert Syst. Appl. 2024, 244, 123020. [Google Scholar] [CrossRef]
Sims, A.G.; Dobinson, K.W. The Sydney coordinated adaptive traffic (SCAT) system philosophy and benefits. IEEE Trans. Veh. Technol. 1980, 29, 130–137. [Google Scholar] [CrossRef]
Hunt, P.B.; Robertson, D.I.; Bretherton, R.D.; Winton, R.I. SCOOT—A Traffic Responsive Method of Coordinating Signals; Urban Networks Division, Traffic Engineering Department, Transport and Road Research Laboratory: London, UK, 1981. [Google Scholar]
Manandhar, B.; Joshi, B. Adaptive traffic light control with statistical multiplexing technique and particle swarm optimization in smart cities. In Proceedings of the 2018 IEEE 3rd International Conference on Computing, Communication and Security (ICCCS), Kathmandu, Nepal, 25–27 October 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 210–217. [Google Scholar]
Haddad, T.A.; Hedjazi, D.; Aouag, S. An IoT-based adaptive traffic light control algorithm for isolated intersection. In Advances in Computing Systems and Applications, Proceedings of the 4th Conference on Computing Systems and Applications, Algiers, Algeria, 20–21 April 2020; Springer International Publishing: Cham, Switzerland, 2021; pp. 107–117. [Google Scholar]
Yau, K.-L.A.; Qadir, J.; Khoo, H.L.; Ling, M.H.; Komisarczuk, P. A Survey on Reinforcement Learning Models and Algorithms for Traffic Signal Control. ACM Comput. Surv. 2017, 50, 1–38. [Google Scholar] [CrossRef]
Yılmaz, A.H.Y. Deep Reinforcement Learning for Intelligent Transportation Systems: A Survey. IEEE Trans. Intell. Transp. Syst. 2022, 23, 11–32. [Google Scholar]
Prabuchandran, K.J.; Hemanth Kumar, A.N.; Bhatnagar, S. Multi-agent reinforcement learning for traffic signal control. In Proceedings of the 17th International IEEE Conference on Intelligent Transportation Systems (ITSC), Qingdao, China, 8–11 October 2014; pp. 2529–2534. [Google Scholar]
Zhang, Z.; Yang, J.; Zha, H. Integrating independent and centralized multi-agent reinforcement learning for traffic signal network optimization. arXiv 2019, arXiv:1909.10651. [Google Scholar] [CrossRef]
Ouyang, C.; Zhan, Z.; Lv, F. A Comparative Study of Traffic Signal Control Based on Reinforcement Learning Algorithms. World Electr. Veh. J. 2024, 15, 246. [Google Scholar] [CrossRef]
Zhao, Y.; Hu, J.-M.; Gao, M.-Y.; Zhang, Z. Multi-Agent Deep Reinforcement Learning for Decentralized Cooperative Traffic Signal Control. CICTP 2020, 2020, 458–470. [Google Scholar]
Chen, Y.; Li, C.; Yue, W.; Zhang, H.; Mao, G. Engineering A Large-Scale Traffic Signal Control: A Multi-Agent Reinforcement Learning Approach. In Proceedings of the IEEE INFOCOM 2021—IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS), Virtual, 9–12 May 2021; pp. 1–6. [Google Scholar]
Zhang, Y.; Su, R.; Zhang, Y.; Sun, C. Modelling and Traffic Signal Control of Heterogeneous Traffic Systems. arXiv 2017, arXiv:1705.03713. [Google Scholar] [CrossRef]
Chen, B.; Tan, K.; Li, J. HiLight: Heterogeneous Traffic Signal Control for Automatic Drive Guidance Based on Multi-Agent Reinforcement Learning. In Proceedings of the 2024 8th CAA International Conference on Vehicular Control and Intelligence (CVCI), Chongqing, China, 25–27 October 2024; pp. 1–6. [Google Scholar]
Zhao, H.; Dong, C.; Cao, J.; Chen, Q. A survey on deep reinforcement learning approaches for traffic signal control. Eng. Appl. Artif. Intell. 2024, 133, 108100. [Google Scholar] [CrossRef]
Do, H.K.; Quynh Dinh, T.; Nguyen, M.D.; Hoa Nguyen, T. Semantic Communication for Partial Observation Multi-Agent Reinforcement Learning. In Proceedings of the 2023 IEEE Statistical Signal Processing Workshop (SSP), Hanoi, Vietnam, 2–5 July 2023; pp. 319–323. [Google Scholar]
Li, D.; Zhu, F.; Wu, J.; Wong, Y.D.; Chen, T. Managing mixed traffic at signalized intersections: An adaptive signal control and CAV coordination system based on deep reinforcement learning. Expert Syst. Appl. 2024, 238, 121959. [Google Scholar] [CrossRef]
Yin, X.; Wu, G.; Wei, J.; Shen, Y.; Qi, H.; Yin, B. Deep Learning on Traffic Prediction: Methods, Analysis, and Future Directions. IEEE Trans. Intell. Transp. Syst. 2022, 23, 4927–4943. [Google Scholar] [CrossRef]
Wei, H.; Zheng, G.; Gayah, V.; Li, Z. A Survey on Traffic Signal Control Methods. arXiv 2019, arXiv:1904.08117. [Google Scholar]
Jamil, Q.U.; Kallu, K.D.; Khan, M.J.; Safdar, M.; Zafar, A.; Ali, M.U. Urban traffic signal control optimization through Deep Q Learning and double Deep Q Learning: A novel approach for efficient traffic management. Multimed. Tools Appl. 2024, 84, 24933–24956. [Google Scholar] [CrossRef]
Zhu, Y.; Cai, M.; Schwarz, C.W.; Li, J.; Xiao, S. Intelligent Traffic Light via Policy-based Deep Reinforcement Learning. Int. J. Intell. Transp. Syst. Res. 2022, 20, 734–744. [Google Scholar] [CrossRef]
Amato, C. An Introduction to Centralized Training for Decentralized Execution in Cooperative Multi-Agent Reinforcement Learning. arXiv 2024, arXiv:2409.03052. [Google Scholar] [CrossRef]
Zhou, Y.; Liu, S.; Qing, Y.; Chen, K.; Zheng, T.; Song, J.; Song, M. Is Centralized Training with Decentralized Execution Framework Centralized Enough for MARL? arXiv 2023, arXiv:2305.17352. [Google Scholar] [CrossRef]
Sunehag, P.; Lever, G.; Gruslys, A.; Marian, W.; Vinicius, C.; Max, Z.; Marc, J.; Nicolas, L.; Joel, S.; Leibo, Z.; et al. Value-Decomposition Networks for Cooperative Multi-Agent Learning Based on Team Reward. In Proceedings of the 17th International Conference on Autonomous Agents and Multi Agent Systems (AAMAS’18), Stockholm, Sweden, 10–15 July 2018; International Foundation for Autonomous Agents and Multiagent Systems: Richland, SC, USA, 2018; pp. 2085–2087. [Google Scholar]
Rashid, T.; Samvelyan, M.; De Witt, C.S.; Farquhar, G.; Foerster, J.; Whiteson, S. QMIX: Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning. J. Mach. Learn. Res. 2018, 21, 1–51. [Google Scholar]
Lowe, R.; Wu, Y.; Tamar, A.; Harb, J.; Abbeel, P.; Mordatch, I. Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments. arXiv 2017, arXiv:1706.02275. [Google Scholar]
Yu, C.; Velu, A.; Vinitsky, E.; Gao, J.; Wang, Y.; Bayen, A.; Wu, Y. The surprising effectiveness of PPO in cooperative multi-agent games. In Proceedings of the 36th International Conference on Neural Information Processing Systems (NIPS’22), Orleans, LA, USA, 28 November–9 December 2022; Curran Associates Inc.: Red Hook, NY, USA, 2022; pp. 24611–24624. [Google Scholar]
Yang, S.; Yang, B. An inductive heterogeneous graph attention-based multi-agent deep graph infomax algorithm for adaptive traffic signal control. Inf. Fusion 2022, 88, 249–262. [Google Scholar] [CrossRef]
Wei, H.; Zheng, G.; Yao, H.; Li, Z. IntelliLight: A Reinforcement Learning Approach for Intelligent Traffic Light Control. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD’18), London, UK, 19–23 August 2018; Association for Computing Machinery: New York, NY, USA, 2018; pp. 2496–2505. [Google Scholar]
Bie, Y.; Ji, Y.; Ma, D. Multi-agent Deep Reinforcement Learning collaborative Traffic Signal Control method considering intersection heterogeneity. Transp. Res. Part C Emerg. Technol. 2024, 164, 104663. [Google Scholar] [CrossRef]
Yang, S.; Yang, B.; Kang, Z.; Deng, L. IHG-MA: Inductive heterogeneous graph multi-agent reinforcement learning for multi-intersection traffic signal control. Neural Netw. 2021, 139, 265–277. [Google Scholar] [CrossRef]
Zhang, Y.; Li, P.; Fan, M.; Sartoretti, G. HeteroLight: A General and Efficient Learning Approach for Heterogeneous Traffic Signal Control. In Proceedings of the 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Abu Dhabi, United Arab Emirates, 14–18 October 2024; pp. 1010–1017. [Google Scholar]
Gu, H.; Wang, S.; Ma, X.; Jia, D.; Mao, G.; Lim, E.G.; Wong, C.P.R. Large-Scale Traffic Signal Control Using Constrained Network Partition and Adaptive Deep Reinforcement Learning. IEEE Trans. Intell. Transp. Syst. 2024, 25, 7619–7632. [Google Scholar] [CrossRef]
Pritz, P.J.; Leung, K.K. Belief States for Cooperative Multi-Agent Reinforcement Learning Under Partial Observability. arXiv 2025, arXiv:2504.08417. [Google Scholar] [CrossRef]
Ge, H.; Gao, D.; Sun, L.; Hou, Y.; Yu, C.; Wang, Y.; Tan, G. Multi-Agent Transfer Reinforcement Learning with Multi-View Encoder for Adaptive Traffic Signal Control. IEEE Trans. Intell. Transp. Syst. 2022, 23, 12572–12587. [Google Scholar] [CrossRef]
He, L.N.; Fu, S.; Zhang, X.; Hu, Q.; Du, W.; Li, H.; Chen, T.; Chen, C.; Jiang, Y.; Zhou, Y.; et al. Baseline and early changes in circulating Serum Amyloid A (SAA) predict survival outcomes in advanced non-small cell lung cancer patients treated with Anti-PD-1/PD-L1 monotherapy. Lung Cancer 2021, 158, 1–8. [Google Scholar] [CrossRef]
Wang, M.; Wu, L.; Li, M.; Wu, D.; Shi, X.; Ma, C. Meta-learning based spatial-temporal graph attention network for traffic signal control. Knowl.-Based Syst. 2022, 250, 109166. [Google Scholar] [CrossRef]
Kong, A.Y.; Lu, B.X.; Yang, C.Z.; Zhang, D.M. A Deep Reinforcement Learning Framework with Memory Network to Coordinate Traffic Signal Control. In Proceedings of the 2022 IEEE 25th International Conference on Intelligent Transportation Systems (ITSC), Macau, China, 8–12 October 2022; pp. 3825–3830. [Google Scholar]
Bokade, R.; Jin, X.; Amato, C. Multi-Agent Reinforcement Learning Based on Representational Communication for Large-Scale Traffic Signal Control. IEEE Access 2023, 11, 47646–47658. [Google Scholar] [CrossRef]
Liu, J.; Qin, S.; Su, M.; Luo, Y.; Wang, Y.; Yang, S. Multiple intersections traffic signal control based on cooperative multi-agent reinforcement learning. Inf. Sci. 2023, 647, 119484. [Google Scholar] [CrossRef]
Li, Z.; Yu, H.; Zhang, G.; Dong, S.; Xu, C.-Z. Network-wide traffic signal control optimization using a multi-agent deep reinforcement learning. Transp. Res. Part C Emerg. Technol. 2021, 125, 103059. [Google Scholar] [CrossRef]
Yoon, J.; Ahn, K.; Park, J.; Yeo, H. Transferable traffic signal control: Reinforcement learning with graph-centric state representation. Transp. Res. Part C Emerg. Technol. 2021, 130, 103321. [Google Scholar] [CrossRef]
Luo, H.; Bie, Y.; Jin, S. Reinforcement Learning for Traffic Signal Control in Hybrid Action Space. IEEE Trans. Intell. Transp. Syst. 2024, 25, 5225–5241. [Google Scholar] [CrossRef]
Ault, J.; Sharon, G. Reinforcement Learning Benchmarks for Traffic Signal Control. In Proceedings of the Thirty-Fifth Conference on Neural Information Processing Systems (NeurIPS 2021) Datasets and Benchmarks Track, Virtual, 7–10 December 2021. [Google Scholar]

Figure 1. Technical Roadmap of HAPS-PPO.

Figure 2. Real Road Network.

Figure 3. SUMO Road Network.

Figure 4. Traffic Flow Diagram.

Figure 5. Training Performance Metrics Over Iterations.

Figure 6. Average Rewards Comparison for Different Policies.

Figure 7. Comparison of Policy Loss Convergence for Different Policies.

Figure 8. Average Policy Loss Convergence.

Figure 9. Comparison of Value Loss Convergence for Different Policies.

Figure 10. Average Value Loss Convergence.

Figure 11. Comparison of Entropy Convergence for Different Policies.

Figure 12. Average Entropy Convergence.

Figure 13. Phase and Timing Settings for Fixed-Time Signal Control.

Figure 14. Cologne Regional Road Network.

Figure 15. Cologne Regional Road Network.

Table 1. Multi-dimensional Comparison of Three Traffic Signal Control Methods.

	Fixed-Time Control	Actuated Control	Adaptive Control
Core Principle	It operates in a fixed pattern based on historical traffic data, preset signal cycle, timing, and green splits.	Dynamically adjusts green light duration based on real-time vehicle detection and predefined logic.	Models and analyzes traffic flow in real-time using sensors and algorithms to predict future states and automatically adjust control parameters.
Applicable Scenarios	Scenarios with stable and predictable traffic flow patterns.	Intersections with significant flow disparities between major and minor roads require local responsiveness.	Complex and variable traffic flow scenarios require global optimization.
Advantages	1. Simple equipment, stable and reliable operation. 2. Low operational and maintenance costs. 3. Easy to deploy and scale.	1. Efficiently utilizes green time, reducing vehicle stops. 2. Responds to local traffic fluctuations, ensuring smooth flow on main arteries.	1. High flexibility and adaptability, superior control performance. 2. Can integrate intelligent algorithms for complex problems.
Limitations	1. Cannot respond to real-time traffic flow changes. 2. Fixed timing plans.	1. “Reactive” decision-making, lacking predictive capability. 2. Optimizes individual intersections, potentially shifting congestion elsewhere.	1. Relies on complex traffic models and expensive infrastructure. 2. Centralized architecture may suffer from computational bottlenecks and latency.
TypicalSystems Optimization Goal	Webster’s method, Wattleworth’s ramp metering. Minimize average vehicle delay, multi-objective optimization.	ALINEA algorithm, Furth’s dual-ring actuated control. Reduce vehicle stops and increase the capacity of a single intersection.	SCATS system, SCOOT system, AD-ALINEA, PI-ALINEA. Global traffic efficiency optimization.

Table 2. Glossary of Key Terms and Abbreviations.

Notation	Meaning
TSC	Traffic Signal Control
RL	Reinforcement Learning
DRL	Deep Reinforcement Learning
MARL	Multi-Agent Reinforcement Learning
HAPS-PPO	Heterogeneity-Aware Policy Sharing PPO
MDP	Markov Decision Process
Dec-POMDP	Decentralized Partially Observable Markov Decision Process
OPW	Observation Padding Wrapper
DMSGL	Dynamic Multi-Strategy Grouping Learning
AC	Actor-Critic
PPO	Proximal Policy Optimization
CTDE	Centralized Training with Decentralized Execution
IL	Independent Learner
SUMO	Simulation of Urban Mobility
MADQN	Multi-Agent Deep Q Network

Table 3. Specific parameters for the experimental platform.

	Placement Name	Parameters
Hardware environment	CPU	Intel Core Ultra7 155H
	RAM	32GB
	GPU	NVIDIA GeForce 4060
Software environment	Operating system	Windows 11
	CUDA	11.8
	Python	3.8.10
	Py-Torch	2.4.1
	Ray RLlib	2.8.0
	SUMO	1.21.0
	SUMO-RL	1.4.5
	Traci	1.23.1
	gum	0.26.2

Table 4. Key Hyperparameter Configuration for the HAPS-PPO Algorithm.

Hyperparameter	Symbol	Value	Description
Learning rate	$η$	$3 \times 10^{- 5}$	Learning rate of Actor and Critic networks.
Discount factor	$γ$	0.999	Discount rate for calculating future rewards, close to 1 indicates more focus on long-term returns.
GAE parameter	$λ$	0.95	Smoothing parameter in Generalized Advantage Estimation.
PPO clipping coefficient	$\in$	0.1	Clipping range in the PPO objective function is used to limit the magnitude of policy updates.
Rollout fragment length	$N$	2048	Number of experience steps collected by each Worker before synchronizing back to the Driver.
Training batch size		40,960	Total sample size used for a single gradient update (20 × 2048).
SGD minibatch size		4096	The sample size is fed into the GPU for calculation in a single SGD round.
SGD iteration count	$K$	10	Number of iterative rounds for policy updates on the same batch of data.
Value function coefficient	$c_{1}$	1.0	Weight of the value function loss in the total loss function.
Entropy coefficient	$c_{2}$	0.01	The weight of entropy reward in the total loss function encourages policy exploration.

Table 5. Key Hyperparameter Configuration for the MADQN Algorithm.

Hyperparameter	Symbol	Value	Description
Learning rate	$η$	$5 \times 10^{- 5}$	Adam optimizer learning rate.
GAMMA	$γ$	0.99	Discount rate for calculating future rewards, close to 1 indicates more focus on long-term returns.
BUFFER_CAPACITY	$C_{b u f f e r}$	100,000	Experience Replay Pool Capacity: Stores samples of agent states, actions, rewards, following states, and terminations.
BATCH_SIZE		1024	Total number of samples used per gradient update
TAU	$τ$	0.001	Target Net Soft Update Coefficient
EPSILON_START	$ε_{0}$	1.0	Initial exploration rate: 100% random action selection during early training
EPSILON_MIN	$ε_{\min}$	0.05	Minimum exploration rate, the lowest exploration proportion in the late training phase (to avoid falling into local optima due to pure greed)
EPSILON_DECAY	$ε_{d e c}$	0.999995	This proportion reduces the erosion rate decay coefficient after each training step.
Gradient Trimming Maximum Norm	$λ_{c l i p}$	1.0	Gradient clipping threshold to prevent gradient explosion.

Table 6. Comparison of Experimental Results.

Evaluation Indicator		Average Speed (m/s)	Average CO₂ Emissions (g/car)	Average Delay Time(s)	Average Travel Time(s)	Average Waiting Time(s)	Completed Trips
Fixed-time	NO. 1	0.9	5.7	78.99	153.76	53.32	27,189
	NO. 2	1.1	4.2
	NO. 3	0.9	9.3
	NO. 4	1.5	0.3
	NO. 5	13.3	7
	NO. 6	1.5	5.9
MADQN	NO. 1	1.2	5.4	56.43 (−28.56%)	130.58 (−15.08%)	31.89 (−40.19%)	27,458
	NO. 2	1.2	4.3
	NO. 3	1	9.1
	NO. 4	1.8	0.3
	NO. 5	13.8	6.9
	NO. 6	1.5	5.7
HAPS-PPO	NO. 1	1.2	5.3	43.65 (−44.74%)	117.68 (−23.47%)	21.54 (−59.60%)	27,575
	NO. 2	1.1	4.2
	NO. 3	1	8
	NO. 4	1.9	0.2
	NO. 5	14	6.9
	NO. 6	1.5	5.5

Table 7. Comparison of Experimental Results (Cologne Network).

Evaluation Indicator	Average Delay Time (s)	Average Travel Time (s)	Average Waiting Time (s)
MPLight	60.42 (+22.38%)	123.93 (+8.22%)	30.34 (+3.30%)
Fixed-time	49.37	114.51	29.37
FMA2C	33.28 (−32.59%)	97.53 (−14.83%)	14.19 (−51.69%)
HAPS-PPO	30(−39.23%)	95.63(−16.49%)	13.97(−52.43%)

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Lu, Q.; Fang, H.; Yin, Z.; Zhu, G. HAPS-PPO: A Multi-Agent Reinforcement Learning Architecture for Coordinated Regional Control of Traffic Signals in Heterogeneous Road Networks. Appl. Sci. 2025, 15, 10945. https://doi.org/10.3390/app152010945

AMA Style

Lu Q, Fang H, Yin Z, Zhu G. HAPS-PPO: A Multi-Agent Reinforcement Learning Architecture for Coordinated Regional Control of Traffic Signals in Heterogeneous Road Networks. Applied Sciences. 2025; 15(20):10945. https://doi.org/10.3390/app152010945

Chicago/Turabian Style

Lu, Qiong, Haoda Fang, Zhangcheng Yin, and Guliang Zhu. 2025. "HAPS-PPO: A Multi-Agent Reinforcement Learning Architecture for Coordinated Regional Control of Traffic Signals in Heterogeneous Road Networks" Applied Sciences 15, no. 20: 10945. https://doi.org/10.3390/app152010945

APA Style

Lu, Q., Fang, H., Yin, Z., & Zhu, G. (2025). HAPS-PPO: A Multi-Agent Reinforcement Learning Architecture for Coordinated Regional Control of Traffic Signals in Heterogeneous Road Networks. Applied Sciences, 15(20), 10945. https://doi.org/10.3390/app152010945

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

HAPS-PPO: A Multi-Agent Reinforcement Learning Architecture for Coordinated Regional Control of Traffic Signals in Heterogeneous Road Networks

Abstract

1. Introduction

2. Related Work

2.1. Reinforcement Learning in Traffic Signal Control

2.1.1. Early Exploration Limitations of Single Agent to Independent Learner Approaches

2.1.2. Mainstream Paradigm CTDE

2.1.3. Current Bottleneck: The Neglected Problem of Heterogeneity

3. Methodology: An Adaptive Traffic Signal Control Method for Heterogeneous Urban Road Networks

3.1. Problem Formulation: TSC as a Dec-POMDP

3.2. HAPS-PPO Framework Design

3.2.1. Unified Observation Space Representation: Observation Padding Wrapper

3.2.2. Policy Network Architecture: Dynamic Multi-Strategy Grouping Learning Based on Action Space Dimension

3.3. Algorithm Implementation and Distributed Training

4. Experiments

4.1. Experimental Setup

4.1.1. Simulation Environment and Scenario

4.1.2. Algorithm Configuration and Baselines

4.1.3. Model Evaluation

4.2. Experimental Results and Analysis

4.2.1. Training Convergence Analysis

4.2.2. Evaluation Metrics

4.2.3. Comparative Analysis

5. Conclusions and Future Work

5.1. Conclusions

5.2. Future Work

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI