Multi-Agent Deep Reinforcement Learning with Contrastive Policy Diversification and Hierarchical Graph Networks for Urban Traffic Signal Control

Yan, Liping; Jia, Haojie; Wang, Shaofeng; Wu, Peiran; Zhao, Wenzhi

doi:10.3390/ijgi15060229

Open AccessArticle

Multi-Agent Deep Reinforcement Learning with Contrastive Policy Diversification and Hierarchical Graph Networks for Urban Traffic Signal Control

by

Liping Yan

^1,*,

Haojie Jia

¹

,

Shaofeng Wang

²

,

Peiran Wu

¹ and

Wenzhi Zhao

¹

School of Information and Software Engineering, East China Jiaotong University, Nanchang 330013, China

²

MOE Engineering Research Center of Railway Environmental Vibration and Noise, East China Jiaotong University, Nanchang 330013, China

^*

Author to whom correspondence should be addressed.

ISPRS Int. J. Geo-Inf. 2026, 15(6), 229; https://doi.org/10.3390/ijgi15060229

Submission received: 17 March 2026 / Revised: 5 May 2026 / Accepted: 18 May 2026 / Published: 22 May 2026

(This article belongs to the Topic Applications of Intelligent Technologies in the Life Cycle of Transportation Infrastructure)

Download

Browse Figures

Versions Notes

Abstract

Multi-Agent Reinforcement Learning (MARL) provides an effective approach for urban multi-intersection traffic signal control. However, existing methods have faced two fundamental challenges, policy homogenization and inefficient credit assignment. The former led to convergent agent policies that failed to adapt to heterogeneous traffic patterns, while the latter prevented agents from accurately evaluating their individual contributions to system performance. To address these issues, this paper proposes a Multi-Agent Hierarchical Contrastive Learning Traffic Signal Control (MAHCL-TSC) model. The model incorporates an unsupervised contrastive learning module that enhances the discriminative power of state representations, thereby alleviating policy homogenization. Additionally, it designs a hierarchical graph convolutional credit allocation network that leverages road network topology and functional characteristics to enable structure-aware collaborative value estimation, significantly improving the precision of credit assignment. Based on these components, a Contrastive QTRAN with Hierarchical Graph Convolution (CQTRAN-HGC) algorithm is proposed, which jointly optimizes contrastive learning loss and QTRAN constraint loss. Experiments conducted in the Simulation of Urban Mobility (SUMO) simulation environment on 4 × 4 and 6 × 6 synthetic grid networks demonstrate that the proposed model improves traffic signal control performance under the tested structured simulation settings and shows potential scalability as the network size increases.

Keywords:

multi-agent reinforcement learning; traffic signal control; unsupervised contrastive learning; credit assignment; graph convolutional network

1. Introduction

Traffic congestion represents a persistent problem in modern urban development, leading to significant time loss, economic costs, and increased energy consumption and environmental pollution [1]. In this context, traffic signal control plays a crucial role in improving road network efficiency. With advances in artificial intelligence, reinforcement learning-based methods have shown considerable promise for signal control. However, in multi-intersection coordination scenarios, challenges such as policy convergence among agents and inefficient global reward allocation continue to limit their effectiveness in large-scale networks [2].

As a fundamental component of urban intelligent transportation systems, traffic signal control is inherently a sequential decision-making problem with high-dimensional state spaces and complex dynamics. Reinforcement learning (RL) has consequently emerged as a prominent model for traffic signal control [3,4], given its capacity to learn optimal policies through environmental interaction. Initial research mainly focused on single-agent RL architectures [5,6], wherein a centralized controller makes unified decisions for all intersections in a region. Although conceptually straightforward, this centralized paradigm encounters substantial computational complexity and scalability limitations in practical deployment [7,8]. With growing traffic network complexity, multi-agent reinforcement learning (MARL) has become the predominant research direction [9,10]. In this model, each intersection operates as an autonomous agent that makes decentralized decisions based on local observations. However, this distributed approach introduces two critical challenges. First, the credit assignment problem arises because all agents share a global reward signal, making it difficult for individuals to evaluate their specific contribution to system performance. Second, policy homogenization limits the system’s adaptability. Conventional MARL methods typically employ identical network architectures and learning rules across all agents, resulting in convergent behavior. Such homogeneous policies cannot adequately accommodate the heterogeneity of intersection topologies, traffic patterns, and functional requirements, ultimately compromising control efficiency.

To address these challenges, this study proposes a Multi-Agent Hierarchical Contrastive Learning (MAHCL-TSC) model for traffic signal control. The regional traffic network is modeled as a graph, where intersections represent nodes and road segments form edges. To effectively capture the complex spatial dependencies within the network, we introduce a hierarchical Graph Convolutional Network (GCN). By stacking multiple GCN layers, each intersection agent can aggregate information from its multi-hop neighbors, thereby extending its perceptual field and capturing non-local traffic dynamics. This architecture not only extracts relational representations between nodes but also reduces the communication load between agents through hierarchical aggregation. Furthermore, we incorporate an unsupervised contrastive learning mechanism. By clustering and optimizing the state representations of the agents, the model’s ability to discriminate between diverse traffic patterns is significantly enhanced. This mechanism fundamentally mitigates policy homogenization and fosters the development of efficient, coordinated strategies among agents. The main contributions of this work are as follows.

(1): A MAHCL-TSC model under the Centralized Training with Decentralized Execution (CTDE) paradigm. It integrates four core modules—control environment, data acquisition, network architecture, and contrastive learning—into an intelligent closed-loop system. This design addresses policy coordination and asynchronous decision-making in multi-intersection control.
(2): A policy diversification mechanism using unsupervised contrastive learning is designed. It generates regional pseudo-labels via multimodal feature fusion and K-means clustering, then refines agent representations with supervised contrastive loss. This enhances the discrimination of heterogeneous traffic patterns, mitigates policy homogenization, and provides potential benefits for adapting to varying traffic demand patterns.
(3): A hierarchical graph convolutional credit assignment network is developed. It partitions the road network into functional regions via clustering, while GCNs hierarchically extract intra-region and global features. This explicitly models agent interactions, optimizes credit assignment in QTRAN, and strengthens the global–local reward association, boosting collaborative efficiency and scalability in large networks.

The remainder of this paper is organized as follows. Section 2 reviews related work on multi-agent reinforcement learning for traffic signal control. Section 3 introduces the MAHCL-TSC model and provides its formal formulation based on the Decentralized Partially Observable Markov Decision Process (Dec-POMDP). Section 4 details the CQTRAN-HGC algorithm, with emphasis on its two core innovative components, the Contrastive Policy Diversification Module and the Hierarchical Credit Assignment Network. Section 5 presents the experimental setup and discusses the results. Finally, Section 6 concludes the paper and outlines potential directions for future research.

2. Related Work

The evolution of traffic signal control technology has undergone a paradigm shift from static planning to dynamic optimization, and from isolated single-point intelligence to multi-agent collaboration. Research in this field aims to overcome the limitations of traditional methods—namely, poor adaptability—and the scalability constraints of single-agent approaches. The developmental trajectory can be broadly categorized into several key directions.

Early research established the foundation for applying reinforcement learning (RL) in this domain. As fixed-time plans became increasingly inadequate for dynamic traffic flows, researchers turned to RL techniques. Beginning with Mikami’s [11] pioneering work that demonstrated the feasibility of RL, the subsequent deep integration of deep learning led to the emergence of deep reinforcement learning (DRL) methods [12,13], which effectively addressed the challenges of high-dimensional state spaces. Initial efforts primarily focused on optimizing control for single intersections. However, when extended to multi-intersection networks, single-agent RL methods suffered from severely degraded performance due to the curse of dimensionality in the joint action space, prompting the adoption of multi-agent deep reinforcement learning (MARL) as a necessary evolution.

To achieve effective coordination, subsequent work primarily followed two paths, information sharing and network-aware modeling. VanderPol [14] pioneered the application of DRL to multi-intersection coordination, introducing a novel reward function that integrated multiple metrics. Subsequently, Casas [15] formulated a continuous control model using deep deterministic policy gradients, while Balaji [16] proposed a distributed Q-learning model enabling real-time congestion information sharing among neighboring agents. Liang et al. [17] approached the problem from an action representation perspective, modeling it as a phase duration optimization problem. While these studies improved system scalability, they remained constrained by the limited observational capabilities of individual agents.

To address the partial observability problem, researchers developed various enhanced architectures and training schemes to improve cooperative efficiency. Chu et al. [18] extended the independent A2C algorithm by incorporating neighbor policy fingerprinting and spatial discount factors. Building on this, Lin et al. [19] designed a model accommodating heterogeneous state observations and employed generalized advantage estimation to enhance policy diversity. Ge [20] proposed a Q-value transfer mechanism facilitating value function sharing among agents. Concurrently, graph neural networks (GNNs) emerged as a powerful tool for structured modeling. Xu [21] identified pivotal nodes using the CRRank algorithm, constructed a bidirectional tripartite graph model, and implemented adaptive control. Recent studies have also investigated intersection operations from the perspective of simulation-based evaluation and data-driven delay estimation. For example, Owais et al. [22] evaluated when roundabouts should be converted into signalized intersections through simulation case studies in Jeddah and Al-Madinah, highlighting the importance of intersection-level operational assessment under changing traffic demand.

Recent studies have further extended intelligent transportation research by considering more complex traffic scenarios, heterogeneous traffic participants, and data-driven traffic evaluation. For example, Luo et al. [23] developed a real-time early warning framework for multi-dimensional driving risk of heavy-duty trucks using trajectory data, while Zhai et al. [24] proposed a throttle-based self-stabilizing control scheme integrated into an anisotropic continuum model to improve traffic system resilience under cyber-attacks in connected vehicle scenarios. These studies indicate that intelligent transportation systems are moving toward more realistic, heterogeneous, and safety-aware traffic modeling. However, they mainly focus on driving risk assessment, traffic flow stability, and connected-vehicle control, rather than multi-intersection traffic signal coordination. In contrast, this study focuses on MARL-based urban traffic signal control and addresses two specific challenges: policy homogenization among intersection agents and inefficient credit assignment under global reward feedback.

3. Problem Definition

3.1. MAHCL-TSC Model

Multi-intersection traffic signal control can be naturally formulated as a cooperative multi-agent decision-making problem. In this setting, each intersection controller is treated as an agent that makes signal phase decisions based on its own local traffic observations, while the overall traffic performance depends on the coordinated behavior of all agents. Since each agent has access only to partial local information, whereas the control objective is defined at the network level, the problem is modeled as a Decentralized Partially Observable Markov Decision Process (Dec-POMDP) under the Centralized Training and Decentralized Execution (CTDE) paradigm.

Formally, the multi-intersection traffic signal control problem can be characterized by a Dec-POMDP framework

⟨ S, A, P, R, Ω, O, γ, N ⟩

, where

S

denotes the global state space of the traffic network,

A = \prod_{i = 1}^{N} A_{i}

denotes the joint action space of all agents,

\overline{P} (s^{'} | s, a)

denotes the state transition probability,

R (s, a)

denotes the global reward function,

Ω

denotes the local observation space,

O (o_{i} | s, i)

denotes the observation function of agent

i, γ

is the discount factor, and

N

is the number of controlled intersections. Under this formulation, each agent

i

selects its action

a_{i} \in A_{i}

according to its local observation

o_{i} \in Ω

, while the environment evolves according to the joint action

a = (a_{1}, \dots, a_{N})

.

Figure 1 illustrates the overall framework of the proposed MAHCL-TSC model. Following the CTDE paradigm [25], each agent executes its policy independently based on local observations during deployment, whereas centralized information is utilized during training to improve coordination and learning efficiency. Specifically, each intersection agent perceives local traffic features, including queue lengths at incoming lanes, current signal phases, and traffic flow conditions, and then selects the corresponding signal phase action. Meanwhile, the interactions between agents and the traffic environment generate transition trajectories represented as:

τ = (s_{t}, a_{t}, r_{t}, s_{t + 1})

(1)

where

s_{t}

denotes the global traffic state at time step

t, a_{t}

denotes the joint action of all agents,

r_{t}

denotes the global reward, and

s_{t + 1}

denotes the next global state. These trajectories are stored in the replay buffer

D

and used for offline training of the proposed CQTRAN-HGC algorithm.

Under the CTDE paradigm, the execution and training stages use different information scopes. During decentralized execution, each intersection agent selects its signal phase only according to its local observation, including queue length, current phase, lane occupancy, and traffic flow information around the controlled intersection. During centralized training, however, the learning process can access the global traffic state, the joint actions of all agents, and the global reward generated by the entire road network. These centralized signals are stored in the replay buffer and used to train the joint value network in the CQTRAN-HGC algorithm.

The overall workflow of the MAHCL-TSC model forms a closed-loop optimization process. First, the traffic simulator generates dynamic traffic states according to road network topology and traffic flow patterns. Then, each agent collects its local observations and interacts with the environment to produce state–action–reward–state transition data. Based on these collected experiences, the proposed framework integrates three major components for cooperative signal control: a contrastive policy diversification module for enhancing representation discrimination, a policy network for local decision-making, and a joint value network for global credit assignment. Through the coordinated optimization of these components, the proposed method effectively alleviates policy homogenization and improves credit assignment in large-scale multi-intersection traffic signal control scenarios.

3.2. RL Parameter Configuration

In the Dec-POMDP formulation, the global state represents the overall traffic condition of the entire road network and is mainly used during centralized training. In contrast, each intersection agent can only access its local observation during decentralized execution. Therefore, the practical input of each agent is not the full global state, but a local observation vector composed of traffic features collected from its controlled intersection. To avoid ambiguity, this section uses local observation to describe the agent-level traffic information used for decision-making, while the term global state refers to the network-level state used in the centralized training process.

The MAHCL-TSC model centers on the formulation of three fundamental components in traffic signal control: local observation, action space, and reward function. The sections below elaborate on the representation of local traffic observations and the design of the action space and reward mechanism.

Local Observation Space (S): The local observation space provides each agent with a multi-dimensional representation of the local traffic environment, which serves as the basis for decentralized decision-making. In this study, the local observation representation consists of four key components to characterize real-time traffic conditions at an intersection.

Phase State (

s_{phase} \in {0, 1}^{k}

): The current traffic signal phase is represented by a one-hot encoded vector, where

k

denotes the number of feasible signal phases at the intersection. In this vector, the active phase is set to 1, while all other elements are set to 0.

Lane Space Occupancy (

s_{lane} \in R^{M \times H}

): This component describes the fine-grained spatial occupancy of approach lanes. Here,

M

denotes the number of approach lanes, and

H

denotes the number of discretized grid cells along each lane. Each element

s_{lane} [m, h]

represents the occupancy status of the

h

-th grid cell on lane

m

, indicating the spatial distribution of vehicles.

Lane Traffic Flow (

s_{volume} \in R^{M}

): This vector represents the traffic flow intensity of each inbound lane. Each element

s_{volume} [m]

denotes the traffic volume or flow rate observed on lane

m

within the most recent observation interval.

Lane Queue Length (

s_{q u e u e} \in N^{M}

): This non-negative integer vector of dimension M quantifies the queuing conditions on each approach lane. An element

s_{q u e u e} [m]

indicates the number of vehicles in a queued state—typically defined as those with instantaneous speed below a minimum speed threshold such as 0.1 m/s—in lane m.

Action Space (

A

): The action space is defined as a finite set of feasible signal phases that can be selected by each agent. For the synthetic four-phase intersection considered in this study, the action space is defined as

A = {a_{1}, a_{2}, a_{3}, a_{4}}

(2)

where each action corresponds to a specific signal phase configuration. As illustrated in Figure 2, these four actions represent the candidate traffic signal phases for intersection control. At each decision step, agent

i

evaluates the action-value

Q_{i} (o_{i}, a_{i})

for all

a_{i} \in A

, and selects the action with the maximum estimated value for execution at the next time step. For more complex real-world road networks, the number and ordering of feasible phases may vary according to intersection geometry and traffic control requirements. In such cases, the discrete action set can be generalized as

A = {a_{1}, a_{2}, \dots, a_{K}}

(3)

where

K

denotes the number of feasible signal phase configurations.

In this study, the control influence of each agent is reflected in the selection of the next signal phase rather than the direct optimization of variable green time or cycle length. At each decision step, agent

i

observes the local traffic state and selects one feasible phase from the discrete action set. Once a phase is selected, it is executed for a fixed signal decision interval, while the minimum green time, yellow time, and all-red time are predefined in the SUMO simulation settings. Therefore, the proposed model focuses on adaptive phase selection under a fixed-time execution interval.

For example, consider an intersection agent at one decision step. Suppose the current local observation indicates that the east–west incoming lanes have longer queues and higher traffic flow than the north–south incoming lanes. The agent evaluates all feasible signal phases according to the local action-value function

Q_{i} (o_{i}, a_{i})

, where

o_{i}

includes the current signal, lane occupancy, lane traffic flow, and queue length. If selecting the east–west through phase is expected to reduce the queue length, waiting time, and intersection pressure more effectively, this phase will obtain a higher action value and is more likely to be selected. After the selected phase is executed for the fixed decision interval, the environment returns the updated traffic state and the reward computed from queue length, waiting time, and pressure. This example illustrates how local traffic parameters influence phase-selection decisions in one control cycle.

Reward Function (

R

):

The reward function is designed to reflect both local congestion and network-level coordination. Queue length is included because it directly represents the congestion level at each intersection. Waiting time is used to measure the travel efficiency experienced by individual vehicles. Intersection pressure is introduced to capture the imbalance between incoming and outgoing traffic flows, which is commonly related to network-level traffic coordination. Therefore, the three terms jointly describe local queue accumulation, vehicle delay, and flow balance in the road network.

At each discrete time step

t

, agent

i

receives an immediate reward

r_{t}

from the traffic environment. To accommodate multiple traffic control objectives, we design the reward as a weighted multi-objective combination of three key factors: queue length, vehicle waiting time, and intersection pressure. Accordingly, the reward function is defined as

r_{t} = - (α \sum_{i = 1}^{M} q_{i} + β \sum_{j = 1}^{V} w_{j} + γ p_{t})

(4)

where

q_{i}

denotes the queue length of inbound lane

i

, representing the number of vehicles waiting behind the stop line;

w_{j}

denotes the incremental waiting time of vehicle

j

at the current time step; and

p_{t}

denotes the intersection pressure at time step

t

, defined as the difference between the total queue length of inbound lanes and that of outbound lanes.

The coefficients

α

,

β

, and

γ

are weighting parameters used to balance queue length, waiting time, and intersection pressure. In this study, the reward weights are set to

α = 0.4

,

β = 0.3

, and

γ = 0.3

, respectively, based on preliminary experiments. A slightly larger weight is assigned to queue length because queue accumulation directly reflects local congestion at intersections, while waiting time and intersection pressure are also considered to balance individual travel efficiency and network-level traffic coordination.

4. CQTRAN-HGC Algorithm for Multi-Intersection Traffic Signal Control

Building upon the MAHCL-TSC model established in Section 3, this section introduces the core CQTRAN-HGC algorithm, which integrates two fundamental concepts, agent network architecture design and unsupervised contrastive learning for enhanced collaborative decision-making. The agent network architecture acts as the computational core of the system, implementing hierarchical graph convolutions alongside policy-value dual networks to enable distributed decision-making and collaborative value modeling across agents. Simultaneously, the unsupervised contrastive learning model provides dynamic environment perception and feature optimization capabilities. Section 4.1 describes the policy network structure and the contrastive learning policy. Section 4.2 presents the hierarchical graph convolutional architecture and joint value network design. Finally, Section 4.3 describes the parameter update procedure for the joint value network within the CQTRAN-HGC algorithm mode.

The proposed CQTRAN-HGC algorithm combines the contrastive policy diversification module and the hierarchical credit assignment network within a unified training framework. First, each agent encodes its local observation and generates a latent representation for action selection. Then, K-means clustering and pseudo-label-guided contrastive learning are applied to these representations to enhance their discriminability. The refined representations and clustering results are further used by HCAN to construct regional graph features and estimate the global joint value. Finally, the QTRAN loss and the contrastive learning loss are jointly optimized, enabling the policy network and the hierarchical value network to be updated collaboratively.

4.1. Contrastive Policy Diversification Module

In conventional QTRAN-based methods, each agent typically learns its policy mainly from local observations, which often leads to highly similar latent representations across agents. As a result, different intersections may converge to homogeneous policies, even when they exhibit distinct spatial structures and traffic dynamics. To alleviate this problem, this study introduces a contrastive policy diversification module, which enhances the discriminability of agent representations through multimodal feature fusion and contrastive learning. The objective of this module is to enable different agents to learn region-sensitive policy representations, thereby encouraging more discriminative agent representations and potentially reducing policy homogenization in heterogeneous traffic environments.

As shown in Figure 3a, the proposed local policy network adopts a multimodal encoding architecture to extract heterogeneous traffic information from each intersection. Specifically, three complementary feature sources are considered: lane queue observations, bird’s-eye-view spatial topology, and temporal delay information. These features are first encoded separately and then fused into a unified latent representation, which is further modeled by a recurrent unit to capture temporal dependencies. Based on the resulting latent embedding, each agent outputs its action policy, while contrastive learning is introduced to enhance the separability of latent representations corresponding to different regional traffic patterns. The lane queue encoder processes the queue observation vector

o_{q u e u e} \in R^{M}

through a two-layer fully connected network:

h_{q u e u e} = ReLU (W_{q 1} o_{q u e u e} + b_{q 1}), W_{q 1} \in R^{M \times 128}

(5)

e_{q u e u e} = ReLU (W_{q 2} h_{q u e u e} + b_{q 2}), W_{q 2} \in R^{128 \times 64}

(6)

where

x_{q}

denotes the lane queue observation vector of an intersection,

W_{q}^{1}

and

W_{q}^{2}

are learnable weight matrices, and

b_{q}^{1}

and

b_{q}^{2}

are bias terms.

σ (\cdot)

denotes the nonlinear activation function. The output

h_{q}

represents the encoded queue feature, which is used as one of the multimodal inputs for subsequent feature fusion.

The Bird’s Eye View (BEV) encoder utilizes a ResNet-18 architecture to process spatial features

o_{b e v} \in R^{D \times H \times W}

and extract topological relationships through convolutional layers:

e_{b e v} = {R e s N e t}_{θ_{2}} (o_{b e v}), e_{b e v} \in R^{256}

(7)

The temporal delay encoder maps temporal statistics

o_{t i m e} \in R^{2}

into a compact temporal embedding:

e_{t i m e} = {M L P}_{θ_{3}} (o_{t i m e}), e_{t i m e} \in R^{64}

(8)

To obtain a unified latent representation, the encoded features are fused by linear projection followed by element-wise addition:

{\hat{e}}_{b e v} = W_{b} e_{b e v}, {\hat{e}}_{t i m e} = W_{t} e_{t i m e}, h_{i} = e_{q u e u e} + {\hat{e}}_{b e v} + {\hat{e}}_{t i m e}

(9)

where

W_{b}

and

W_{t}

are learnable projection matrices used to align the feature dimensions. This formulation is consistent with the intended element-wise fusion mechanism and avoids ambiguity between summation and concatenation. The fused latent representation is then fed into a gated recurrent unit (GRU) to model temporal dependencies:

z_{i} = σ (W_{z} [h_{i}, h_{i - 1}] + b_{z})

(10)

ρ_{i} = σ (W_{r} [h_{i}, h_{i - 1}] + b_{r})

(11)

{\tilde{h}}_{i} = t a n h (W_{h} [ρ_{i} ⊙ h_{i - 1}, h_{i}] + b_{h})

(12)

{\overline{h}}_{i} = (1 - z_{i}) ⊙ h_{i - 1} + z_{i} ⊙ {\tilde{h}}_{i}

(13)

where

z_{t}

,

r_{t}

, and

{\tilde{h}}_{t}

denote the update gate, reset gate, and candidate hidden state of the GRU, respectively.

h_{t}

is the hidden state at time step

t

, and

h_{t - 1}

is the hidden state from the previous time step.

W_{z}

,

W_{r}

,

W_{h}

,

U_{z}

,

U_{r}

, and

U_{h}

are learnable weight matrices, while

b_{z}

,

b_{r}

, and

b_{h}

are bias terms.

σ (\cdot)

denotes the sigmoid activation function,

t a n h (\cdot)

denotes the hyperbolic tangent function, and

⊙

represents element-wise multiplication.

Based on the recurrent representation

{\overline{h}}_{i}

, the policy network outputs the action probability distribution of agent

i

through a fully connected layer with softmax activation:

π_{i} (a_{i} ∣ o_{i}) = softmax (W_{π} {\overline{h}}_{i} + b_{π})

(14)

To address policy homogenization among agents, we incorporate an unsupervised contrastive learning mechanism [26]. the proposed mechanism does not rely on manually annotated class labels. Instead, it first groups agents’ latent representations by K-means clustering and then uses the resulting cluster assignments as pseudo-labels to construct the contrastive learning.

To further alleviate policy homogenization, a contrastive learning objective (Figure 3b) is imposed on the latent representations of all agents. Specifically, the set of latent embeddings

{{\overline{h}}_{1}, {\overline{h}}_{2}, \dots, {\overline{h}}_{N}}

is clustered by K-means to generate pseudo-labels

y_{i} \in {1, \dots, K}

, where agents assigned to the same cluster are regarded as sharing similar regional traffic patterns. Based on these pseudo-labels, a supervised contrastive objective is used to improve the separability of latent representations:

L_{c o n t} = - \sum_{i = 1}^{N} l o g \frac{e x p (sim ({\overline{h}}_{i}, {\overline{h}}_{i}^{+}) / τ)}{\sum_{j = 1}^{N} e x p (sim ({\overline{h}}_{i}, {\overline{h}}_{j}) / τ)}

(15)

where

s i m (u, v) = \frac{u^{⊤} v}{∥ u ∥ ∥ v ∥}

denotes cosine similarity,

{\overline{h}}_{i}^{+}

denotes a positive sample belonging to the same cluster as

{\overline{h}}_{i}

, and

τ

is the temperature parameter. Through this pseudo-label-guided contrastive learning process, intersections with different regional traffic characteristics are encouraged to learn more discriminative latent representations, thereby improving policy diversity and adaptation capability in heterogeneous urban traffic scenarios.

In this study, the number of clusters

K

is set according to the network scale and the number of controlled agents. Specifically,

K = 3

is used for the 4 × 4 network with 10 controlled agents, and

K = 6

is used for the 6 × 6 network with 20 controlled agents. This setting aims to group intersections with similar spatial and traffic characteristics into functional regions while avoiding overly fragmented clusters. During training, K-means clustering is applied to the latent representations of agents to generate pseudo-labels, which are then used to construct the contrastive learning objective. Since the latent representations evolve during training, the pseudo-labels may also change accordingly. This dynamic update allows the contrastive module to adapt to the learned representation space. However, a systematic sensitivity analysis of

K

, cluster stability, and pseudo-label drift is not included in the current study and will be further investigated in future work.

The proposed pseudo-label-guided contrastive learning module is designed to encourage agents under different regional traffic patterns to learn more discriminative latent representations, thereby providing a potential mechanism for alleviating policy homogenization. In the current study, its effect is mainly reflected in the overall traffic performance improvement under the tested simulation settings. In addition, the clustering results are reused in the subsequent joint value learning process to organize intersections into functional regional groups. This provides a structural basis for hierarchical feature aggregation and refined credit assignment in the joint value network, which will be described in Section 4.2.

4.2. Credit Allocation Network

While QTRAN provides a theoretically grounded framework for cooperative credit assignment through the optimization terms

L_{o p t}

and

L_{n o p t}

, its joint state–action value estimation still faces substantial challenges in large-scale urban road networks. In particular, the fully connected structure commonly used in joint value modeling becomes computationally expensive as the number of intersections increases, and it does not explicitly exploit the heterogeneous spatial interactions induced by road topology and functional differences across intersections. To address these limitations, this study introduces a Hierarchical Credit Assignment Network (HCAN), which performs structure-aware joint value estimation through regional clustering and hierarchical graph convolution.

The proposed HCAN serves as the credit assignment component of the CQTRAN-HGC framework. Rather than directly relying on a dense global mixing structure, it organizes the traffic network into a hierarchy of regional subgraphs and learns joint value representations in a node-to-cluster-to- global manner. In this way, the network improves both the scalability of joint value learning and the precision of global-to-local credit attribution. Figure 4 illustrates the overall architecture of the proposed joint value network.

To improve computational efficiency, the global road network is first decomposed into

K

functional clusters. This regional decomposition reduces the complexity of global interaction modeling from

O (N^{2})

to

O (\sum_{k = 1}^{K} {| C_{k} |}^{2})

, where

C_{k}

denotes the set of intersections in the

k

-th cluster. Since graph operations are then performed within each cluster independently, the proposed architecture supports parallel processing over regional subgraphs and is therefore more suitable for large-scale urban traffic control. The clustering process is performed according to road topology and traffic flow characteristics. For each intersection

i

, a regional feature vector is defined as

x_{i} = [p_{i}, f_{i}]

(16)

where

p_{i} \in R^{2}

denotes the spatial coordinates of intersection

i

, and

f_{i} \in R^{3}

contains statistical traffic features such as mean flow, flow variance, and peak flow. Based on these feature vectors, the K-means algorithm partitions the

N

intersections into

K

functional clusters

{C_{1}, \dots, C_{K}}

by minimizing the intra-cluster variance:

m i n \sum_{k = 1}^{K} \sum_{x_{i} \in C_{k}} {∥ x_{i} - μ_{k} ∥}^{2}

(17)

This clustering step groups intersections with similar geographical and traffic characteristics, thereby providing a structural basis for region-aware value learning. After clustering, each functional cluster

C_{k}

is treated as a sub-regional graph:

G_{k} = (V_{k}, E_{k})

(18)

where

V_{k}

denotes the set of intersections within cluster

k

, and

E_{k}

denotes the set of intra-cluster connections. Different from conventional graph-based traffic models, the node features are not raw observations, but the multimodal latent representations

{\overline{h}}_{i}

generated by the local policy network in Section 4.1. Since these features have already been refined by contrastive learning, they contain richer semantic information about regional traffic patterns. To model both spatial proximity and traffic interaction intensity, the weighted edge between intersections

i

and

j

is defined as

w_{i j} = e^{- d_{i j}} c_{i j}

(19)

where

d_{i j}

denotes the Euclidean distance between intersections

i

and

j

, and

c_{i j}

denotes the corresponding road capacity. This formulation assigns larger edge weights to nearby intersections with stronger traffic connectivity, thereby improving the structural fidelity of the subgraph representation.

A two-layer graph convolutional network is then applied to each subgraph

G_{k}

:

H_{k}^{(1)} = ReLU ({\tilde{D}}_{k}^{- 1 / 2} {\tilde{A}}_{k} {\tilde{D}}_{k}^{- 1 / 2} H_{k}^{(0)} W^{(0)})

(20)

H_{k}^{(2)} = {\tilde{D}}_{k}^{- 1 / 2} {\tilde{A}}_{k} {\tilde{D}}_{k}^{- 1 / 2} H_{k}^{(1)} W^{(1)}

(21)

where

H_{k}^{(0)} \in R^{| V_{k} | \times d}

is the input node feature matrix,

{\tilde{A}}_{k} = A_{k} + I

is the adjacency matrix with self-loops, and

{\tilde{D}}_{k}

is the corresponding degree matrix. The output

H_{k}^{(2)}

represents the region-aware node embeddings within cluster

k

. To obtain a compact representation for each cluster, max pooling is applied over the node embeddings:

g_{k} = \underset{i \in V_{k}}{m a x} H_{k, i}^{(2)}

(22)

The resulting cluster-level features

{g_{1}, \dots, g_{K}}

are then fused to construct the joint value estimation heads. Specifically, the joint action-value function and the joint state-value function are defined as

Q_{joint} (s, a) = W_{Q} (⨁_{k = 1}^{K} g_{k}) + b_{Q}

(23)

V_{joint} (s) = W_{V} (\frac{1}{K} \sum_{k = 1}^{K} g_{k}) + b_{V}

(24)

where

⨁

denotes vector concatenation. In this formulation, the concatenated cluster representations are used to model the joint action-value function, while the averaged cluster representation is used to estimate the global state value. Together, these two outputs provide the structural basis for enforcing the QTRAN consistency constraints in the subsequent optimization stage.

Through this hierarchical feature extraction process from the node level to the cluster level and finally to the global level, the proposed HCAN facilitates structure-aware credit assignment within the QTRAN framework. By explicitly modeling dependencies among geographically adjacent intersections with similar traffic patterns, HCAN generates regional representations that help the joint value estimator capture the influence of local intersections and regional substructures on the global traffic state. Compared with a dense global value estimation structure, this hierarchical design provides a more informative basis for associating global value estimation with local agent behaviors. Therefore, the proposed HCAN strengthens cooperative value modeling and improves the coordination capability of multi-intersection traffic signal control.

4.3. CQTRAN-HGC Algorithm

Before defining the optimization objective, the training process of CQTRAN-HGC is briefly summarized. At each decision step, all agents first encode their local observations and select signal phase actions according to their local policy networks. The resulting transition, including the global state, joint action, global reward, next state, and local observations, is stored in the replay buffer. During training, mini-batches are sampled from the replay buffer to compute agent latent representations. These representations are used for both pseudo-label-guided contrastive learning and hierarchical graph-based joint value estimation. The QTRAN loss and the contrastive learning loss are then jointly optimized to update the local policy networks and the hierarchical joint value network.

Building upon the contrastive policy diversification module and the hierarchical credit assignment network introduced in Section 4.1 and Section 4.2, this section presents the overall optimization procedure of the proposed CQTRAN-HGC algorithm. Within the proposed framework, the local policy network generates agent-specific action decisions from local observations, while the hierarchical joint value network estimates the global cooperative value through region-aware graph representations. The interactions between agents and the traffic environment are stored in an experience replay buffer and used to jointly optimize the QTRAN [27] objective and the contrastive learning objective. Let a transition sample be denoted by

(s, u, r, s^{'})

, where

s

and

s^{'}

are the current and next global states,

u = (u_{1}, \dots, u_{N})

is the joint action of all agents, and

r

is the global reward. Based on the QTRAN framework, the overall objective of CQTRAN-HGC is formulated as,

L_{total} = L_{QTRAN} (s, u, r, s^{'}; θ) + λ_{cont} L_{cont}

(25)

where

L_{QTRAN}

denotes the value decomposition loss derived from QTRAN,

L_{cont}

denotes the contrastive learning loss introduced in Section 4.1, and

λ_{cont}

is the balancing coefficient. The QTRAN loss consists of three components,

L_{QTRAN} (s, u, r, s^{'}; θ) = L_{td} + λ_{opt} L_{opt} + λ_{nopt} L_{nopt}

(26)

where

L_{t d}

is the temporal-difference loss,

L_{o p t}

is the optimality constraint loss, and

L_{n o p t}

is the non-optimality constraint loss. The temporal-difference loss is used to train the joint action-value network:

L_{t d} = {(Q_{joint} (s, u) - y^{d q_{n}})}^{2}

(27)

where the target value is defined as:

y^{d q_{n}} = r + γ Q_{joint} (s^{'}, \overline{u}; θ^{-})

(28)

and

\overline{u}

denotes the greedy action selected by the target network. The optimality constraint loss ensures the consistency between the transformed joint action-value and the jointly optimal value:

L_{o p t} = {({\hat{Q}}_{joint} (s, u) - {\overline{Q}}_{joint} (s, \overline{u}) + V_{joint} (s))}^{2}

(29)

where

{\overline{Q}}_{joint} (s, \overline{u})

is the fixed target estimate of the joint action-value, and

V_{joint} (s)

is the global state-value function produced by the hierarchical credit assignment network.

The non-optimality constraint loss is defined as

L_{n o p t} = {(m i n [Q_{j o i n t} (s, u) - {\hat{Q}}_{j o i n t} (s, u) + V_{j o i n t} (s), 0])}^{2}

(30)

This term constrains non-optimal joint actions to satisfy the decomposition condition required by QTRAN, thereby improving the consistency between local greedy actions and the global cooperative objective. To further enhance representation learning, each agent produces a latent embedding

{\overline{h}}_{i}

through the local policy network. The set of latent representations

{{\overline{h}}_{1}, {\overline{h}}_{2}, \dots, {\overline{h}}_{N}}

is clustered by K-means to generate pseudo-labels

y_{i}

, and the contrastive learning loss is computed as

L_{c o n t} = - \sum_{i = 1}^{N} l o g \frac{e x p (s i m ({\overline{h}}_{i}, {\overline{h}}_{i}^{+}) / τ)}{\sum_{j = 1}^{N} e x p (s i m ({\overline{h}}_{i}, {\overline{h}}_{j}) / τ)}

(31)

where

s i m (u, v) = \frac{u^{⊤} v}{∥ u ∥ ∥ v ∥}

denotes cosine similarity,

{\overline{h}}_{i}^{+}

denotes a positive sample belonging to the same cluster as

{\overline{h}}_{i}

, and

τ

is the temperature parameter. By minimizing

L_{c o n t}

, the policy network learns more discriminative latent representations, which improves policy diversity and stabilizes collaborative learning in heterogeneous traffic environments.

The training procedure of CQTRAN-HGC is summarized as follows. First, the replay memory

D

is initialized to store interaction trajectories, and the parameters of the local policy network, the hierarchical joint value network, and the target network are randomly initialized. During each episode, each agent selects its action according to an

ϵ

-greedy strategy. After executing the joint action, the environment returns the next state and the global reward, and the transition

(s, u, r, s^{'})

is stored in the replay buffer. Mini-batches are then sampled from

D

to compute

L_{t d}

,

L_{o p t}

,

L_{n o p t}

, and

L_{t o t a l}

is minimized to update the network parameters. To improve training stability, the target network parameters

θ^{-}

are periodically synchronized with the online network parameters.

The training procedure of the CQTRAN-HGC algorithm is summarized in Algorithm 1.

Algorithm 1: Training Procedure of CQTRAN-HGC

Input : Agent set N

, replay buffer D

, discount factor γ

, batch size B

, exploration rate ε

, loss weights λ_{o p t}

, λ_{n o p t}

, and λ_{c o n t}

.

Output : trained network parameters θ

.

1 . Initialize replay memory D

2 . Initialize online network parameters θ

.

3 . Initialize target network parameters θ^{-} = θ

.

4 . For episode = 1 to M

do:

5 . Observe the initial state s

and local observations {o_{i}}

for all agents.

6 . For t

= 1 to T

do:

7 . With probability ε

, each agent selects a random action u_{i}

.

8. Otherwise, each agent selects the greedy action according to its policy.

9 . Execute the joint action u = (u_{1}, u_{2}, \dots, u_{N})

.

10 . Observe the reward r

, next state s^{'}

, and next observations {{o^{'}}_{i}}

.

11 . Store transition (s, u, r, s^{'})

in D

.

12 . Sample a mini - batch of transitions from D

.

13 . Compute latent representations {{\overline{h}}_{i}}

for all agents.

14 . Apply K - means to {{\overline{h}}_{i}}

and generate pseudo - labels y_{i}

.

15 . Compute the contrastive loss L_{cont}

.

16 . Compute the target value y^{dqn}

.

17 . Compute the temporal - difference loss L_{td}

.

18 . Compute the optimality loss L_{opt}

.

19 . Compute the non - optimality loss L_{nopt}

.

20 . Compute the total loss :

L_{total} = L_{td} + λ_{opt} L_{opt} + λ_{nopt} L_{nopt} + λ_{cont} L_{cont} .

21 . Update θ

by minimizing L_{total}

.

22 . Periodically update the target network parameters θ^{-}

.

23 . Set s = s^{'}

.

24. End for

25. End for

5. Experiments

To evaluate the effectiveness of the proposed MAHCL-TSC model and CQTRAN-HGC algorithm, we established a high-fidelity traffic simulation environment using Simulation of Urban Mobility (SUMO) [28] and developed a systematic experimental protocol. The experimental design incorporates a classic Manhattan grid network topology with two distinct configurations, a 4 × 4 grid layout comprising 16 intersections (with 10 agent-controlled nodes) for medium-scale network validation, and a 6 × 6 grid layout containing 36 intersections (with 20 agent-controlled nodes) for large-scale network scalability assessment. This hierarchical experimental structure enables comprehensive evaluation of the model’s control performance, coordination efficiency, and system stability across varying traffic density conditions, from conventional to high-density scenarios. In the SUMO simulations, we monitored abnormal simulation events, including emergency braking warnings, collision warnings, vehicle teleportation, and vehicle removal caused by excessive waiting or severe congestion. These events were checked to ensure that the reported results were not dominated by abnormal simulation failures. In the controlled synthetic grid experiments, the traffic demand was generated within the feasible capacity range of the networks, and no abnormal episodes with severe simulation failure were included in the final reported results.

5.1. Comparative Benchmarks

To comprehensively evaluate the performance of the proposed MAHCL-TSC model and CQTRAN-HGC algorithm, we selected three representative multi-agent reinforcement learning algorithms as baseline comparisons: QTRAN, Multi-Agent Deep Deterministic Policy Gradient (MADDPG) [29], and Multi-Agent Proximal Policy Optimization (MAPPO) [30]. All baseline methods were trained under identical experimental conditions, including the same state spaces and reward functions, to ensure a fair comparison. The QTRAN algorithm follows a value function decomposition approach, whose core strength lies in its ability to accurately decompose the joint action-value function into individual agents’ local value functions under provable constraints, thereby theoretically ensuring consistency between individual and global optima. MADDPG builds upon the actor–critic model and is distinguished by its centralized-training-with-decentralized-execution architecture. The algorithm uses a centralized critic network that accesses state–action information from all agents for global value estimation, while each agent maintains its own actor network for distributed decision-making. This design enables full utilization of global information during training while preserving execution-time independence. MAPPO extends proximal policy optimization to multi-agent settings, combining a centralized value function with trust region optimization. It coordinates policy updates across agents via the centralized value function, while the clipping mechanism from proximal policy optimization ensures training stability and mitigates policy degradation in multi-agent coordination. The selected baselines are intended to evaluate the proposed method from the perspective of representative multi-agent reinforcement learning frameworks. Specifically, QTRAN is included because the proposed CQTRAN-HGC algorithm is developed upon the QTRAN value decomposition framework, making it a direct baseline for assessing the effect of the proposed hierarchical graph convolution and contrastive representation learning modules. MADDPG is selected as a representative actor–critic method under the centralized training and decentralized execution paradigm. Specifically, the actor network outputs a probability distribution over all feasible signal phases through a softmax layer, and the phase with the highest probability is selected during execution. During training, the centralized critic evaluates the joint state and the one-hot encoded joint actions of all agents. In this way, MADDPG can be used as a CTDE actor–critic baseline under the same discrete action space as the proposed method. MAPPO is included as a widely used policy-gradient-based multi-agent reinforcement learning method with stable policy optimization.

It should be noted that the current comparison mainly focuses on MARL-based baselines. Classical traffic signal control methods, such as fixed-time control and max-pressure control, as well as more recent graph-based RL-TSC methods, are not included in the current experimental comparison. This is a limitation of the present study. Future work will further extend the experimental evaluation by incorporating these traffic-domain-specific baselines to provide a more comprehensive assessment of the proposed method.

This study adopts a systematic hyperparameter configuration to ensure model stability and reproducibility. Key hyperparameters include a policy network hidden dimension of 128 and a value network hidden dimension of 256 to balance representational capacity and computational efficiency; a learning rate of 3 × 10⁻⁴, discount factor γ = 0.99, target network update parameter τ = 0.01, and a two-layer GCN structure with 128 hidden units per layer to effectively capture spatial dependencies in the road network. Detailed parameter settings are summarized in Table 1, with selections following common practices in deep reinforcement learning and preliminary experimental validation. To improve the reproducibility of the experiments, Table 1 further summarizes the main algorithm, simulation, and training settings used in this study. All compared methods were implemented under the same state space, action space, reward function, and traffic demand settings to ensure a fair comparison. Each model was trained for 200,000 steps. The values in Table 2 and Table 3 are reported as mean ± standard deviation over independent runs with different random seeds. For the baseline methods, QTRAN, MADDPG, and MAPPO were trained using the same traffic scenarios, evaluation metrics, state representation, action space, and reward function as the proposed MAHCL-TSC model.

All quantitative results in Table 2 and Table 3 are reported as mean ± standard deviation over independent runs with different random seeds. The standard deviation values are used to reflect the variability of training and evaluation results caused by random initialization and stochastic traffic simulation. Although formal statistical significance tests are not included in the current version, the reported mean and standard deviation values provide an initial indication of the stability of the proposed method under the tested simulation settings. More rigorous statistical significance analysis will be incorporated in future work.

5.2. The 4 × 4 Synthetic Grid Network

Figure 5a illustrates the 4 × 4 grid traffic network used in this study, consisting of 16 signal-controlled intersections. Each intersection is configured with six approach lanes, the east–west arterial roads are designed as four-lane bidirectional roadways with a speed limit of 70 km/h, while the north–south roads are two-lane bidirectional roadways with a speed limit of 40 km/h. To simulate realistic traffic flow patterns, four main vehicle routes were established, Path 1 (F1) comprises traffic from E16 to E6 (represented by blue lines in Figure 5a), Path 2 (f1) covers traffic from E16 to E7 (Light Blue lines), Path 3 (F2) includes traffic from E16 to E8 (Orange lines), and Path 4 (f2) contains traffic from E16 to E10 (Light green lines). Fifteen minutes after the simulation begins, the traffic volumes on Paths 1 and 2 gradually decrease, while Paths 3 and 4 begin to generate traffic flow. Figure 5b details the dynamic evolution of these four traffic flow types throughout the simulation cycle, capturing the generation, dissipation, and temporal variation patterns of each route. This configuration provides a controlled simulation setting for evaluating the response of different methods to time-varying traffic demand.

In the training performance analysis, Figure 6 presents the training curves of three multi-agent reinforcement learning algorithms and the MAHCL-TSC model, all trained for 200,000 steps on the same 4 × 4 grid network. The solid lines indicate the moving average of the mean reward, while the shaded areas represent the standard deviation ranges. Generally, as training progresses, the agents gradually improve their policies through accumulated experience, reflected in a consistent increase in mean reward values. Specifically, the MAPPO algorithm demonstrates rapid convergence during the initial training phase but reaches a performance plateau after approximately 80,000 steps. Its final performance is limited by the fully connected network architecture’s inability to effectively model complex spatial relationships. The MADDPG algorithm shows substantial training instability due to its inherent challenges in adapting to discrete action spaces, exhibiting notably higher variance in training rewards compared to other methods. The QTRAN algorithm maintains relatively stable training progress through its constrained optimization mechanism, though with a comparatively slower convergence rate. In contrast, the proposed MAHCL-TSC model demonstrates rapid reward improvement during early training and sustains a stable upward trajectory throughout the entire training process.

The average queue length is calculated by dividing the total number of queuing vehicles across all intersection approaches by the number of intersections, providing an intuitive measure of overall network congestion. Figure 7 shows the evolution of average queue lengths over the simulation period for four control methods—three baseline algorithms (QTRAN, MADDPG, and MAPPO) and the proposed MAHCL-TSC model—in the 4 × 4 grid network. As the simulation progresses, the network experiences significantly increased load when traffic flows (Paths 3 and 4) begin operating after 15 min. Under these conditions, all three baseline reinforcement learning methods exhibit continuously growing queue lengths, indicating their limited adaptability to dynamically changing traffic demand. By contrast, the proposed MAHCL-TSC model maintains the lowest queue levels throughout the simulation period, demonstrating particular effectiveness in stabilizing queue lengths during high-load phases.

To comprehensively evaluate the overall performance of various signal control methods in practical traffic efficiency, a multidimensional analysis of four approaches is performed using vehicle trajectory data. Table 2 summarizes the values of key metrics—including average delay (seconds), average waiting time (seconds), and intersection pressure—for each method in the 4 × 4 synthetic road network scenario.

The experimental results indicate that the MADDPG algorithm fails to effectively capture complex spatial dependencies within its fully connected network structure, leading to moderate performance across all evaluation metrics. The MAPPO algorithm achieves reasonable performance in average waiting time; however, its adaptation mechanism for discrete action spaces results in decision instability, causing considerable fluctuations in average queue length measurements. While the QTRAN algorithm attains certain advantages in average delay metrics through its stable policy optimization process, it still exhibits deficiencies in handling dynamic traffic flow variations during peak periods. In comparison, the proposed MAHCL-TSC model achieves optimal results across all evaluation metrics, demonstrating particularly significant improvements in average waiting time and intersection pressure. This performance improvement may be attributed to the combined design of the contrastive representation learning mechanism and the hierarchical graph-based value estimation structure. However, the individual contribution of each component requires further verification through dedicated ablation studies.

5.3. The 6 × 6 Synthetic Road Network

To further evaluate the scalability of the proposed model as the network size increases, extended experiments were conducted on a 6 × 6 Manhattan-style synthetic grid network, as shown in Figure 8. The network contains 36 signal-controlled intersections, each using a standard four-phase scheme. Both east–west and north–south directions are configured with dual four-lane roads with speed limits of 70 km/h and 50 km/h, respectively. Through this carefully designed extension experiment, we primarily assess the control performance and stability of the MAHCL-TSC model under conditions of higher intersection density and more complex traffic flow interactions. Twenty minutes after the simulation begins, the traffic flow in each path undergoes dynamic adjustments according to preset patterns, simulating the spatiotemporal evolution of travel demand during the evening commute peak.

In the 6 × 6 large-scale network environment, the training processes and convergence characteristics of each algorithm are clearly reflected in their reward function curves, as shown in Figure 9. It can be observed that with the significant expansion of state and action spaces, all baseline algorithms face varying degrees of training challenges. The QTRAN algorithm exhibits considerable fluctuations in its reward curve during later stages, primarily due to imprecise credit assignment; MADDPG shows the slowest convergence speed, affected by environmental non-stationarity; while MAPPO maintains better stability, though its final converged reward value remains at a relatively low level. In stark contrast, the proposed MAHCL-TSC model demonstrates remarkable scalability and learning efficiency. Its reward curve not only rises at a significantly faster rate during early training—indicating the algorithm’s capability to quickly identify high-performance policy directions—but also maintains steady growth throughout the entire training cycle, eventually stabilizing at a reward level substantially higher than other benchmark methods.

In the 6 × 6 road network environment, the scalability and stability of the MAHCL-TSC model were further validated. Figure 10 shows the trends in average queue length for each method under these complex conditions. As network scale increases and traffic patterns become more complex, the control performance of baseline methods—including QTRAN, MADDPG, and MAPPO—degrades to varying degrees. Limited by its fully connected structure in capturing complex spatial dependencies, MADDPG shows a rapid rise in queue length during mid-simulation. MAPPO displays considerable fluctuations in control performance due to its challenges in adapting to discrete action spaces. Although QTRAN maintains relative stability, it still underperforms when handling dynamic traffic flows during peak periods. In comparison, the MAHCL-TSC model demonstrates excellent scalability, maintaining the lowest queue levels throughout the simulation while showing consistent advantages in handling peak traffic conditions.

The quantitative results in Table 3 further confirm these advantages. In the 6 × 6 network, the MAHCL-TSC model outperformed all baseline methods across key metrics, including average queue length and intersection pressure. Specifically, it reduced average waiting time by 18.1% compared to MAPPO and by 28.7% compared to QTRAN. These results indicate that the combination of hierarchical graph-based value estimation and contrastive policy diversification contributes to improved coordination performance in the tested 6 × 6 synthetic grid network. However, the individual contribution of each component still requires further verification through systematic ablation studies.

The MAHCL-TSC model demonstrates notable scalability advantages in the following aspects. In terms of control effectiveness, as presented in Table 2 and Table 3, the MAHCL-TSC model maintains superior performance over other multi-agent reinforcement learning methods on key metrics such as average queue length and intersection pressure as the road network scales from a 4 × 4 grid (16 intersections) to a 6 × 6 grid (36 intersections), with the performance advantage becoming more pronounced with increasing network size.

6. Conclusions and Outlook

To address the core challenges in multi-intersection cooperative control—particularly policy homogenization and inefficient credit assignment—this paper proposes a Multi-Agent Hierarchical Contrastive Learning Traffic Signal Control (MAHCL-TSC) model. It incorporates an unsupervised contrastive learning mechanism to enhance the diversity of agent state representations. It incorporates an unsupervised contrastive learning mechanism to enhance the discrimination of agent state representations, thereby providing a potential mechanism for alleviating policy homogenization. A hierarchical graph convolutional credit assignment network is designed to model road network topology and support structure-aware joint value estimation. Based on these components, a CQTRAN-HGC algorithm is proposed, which jointly optimizes contrastive learning loss and QTRAN constraint loss. Experiments conducted on 4 × 4 and 6 × 6 synthetic grid networks demonstrated that the proposed MAHCL-TSC model achieved better performance than the baseline methods, including QTRAN, MADDPG, and MAPPO, in terms of average delay, average waiting time, and intersection pressure. These results indicate the effectiveness of the proposed method under the tested structured simulation settings and show its potential scalability as the network size increases.

Nevertheless, the current experimental evaluation is still limited to synthetic grid networks with predefined traffic demand patterns. Although these settings allow controlled and reproducible comparisons among different methods, they cannot fully represent more complex real-world traffic conditions, such as oversaturated flows, random demand fluctuations, incident-induced congestion, unbalanced turning movements, and non-periodic congestion patterns. Therefore, further validation under more diverse traffic scenarios and irregular real-world road networks will be an important direction of our future work.

In spite of the work achieved in this paper, future research could be explored in the following aspects. Firstly, introducing more advanced learning mechanisms and designing progressive training environments and tasks that escalate from simple to complex scenarios. Secondly, incorporating external disturbance factors such as weather conditions, pedestrian activity, and special events into the state space would help establish a more comprehensive environment perception and decision-making system. This will significantly enhance model robustness and practical applicability in real-world traffic environments. Systematic ablation studies will be conducted in future work to further quantify the contribution of each component.

Author Contributions

Conceptualization, Liping Yan and Haojie Jia; methodology, Liping Yan; software, Wenzhi Zhao; validation, Peiran Wu and Shaofeng Wang; resources, Liping Yan; data curation, Haojie Jia; writing—original draft preparation, Haojie Jia; writing—review and editing, Haojie Jia; funding acquisition, Liping Yan. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China (Grants Nos. 62362031, 52268066 and 62262022) and the Jiangxi Provincial Natural Science Foundation (Grants No. 20252BAC240353).

Data Availability Statement

The simulation data used in this study were generated based on synthetic 4 × 4 and 6 × 6 grid networks in the SUMO environment. The main network configurations, traffic flow settings, and experimental parameters have been described in the revised manuscript to improve reproducibility. The SUMO network files, route files, and traffic demand configurations used in the experiments will be made available by the corresponding author upon reasonable request. In future work, we will further organize and release the simulation files through a public repository to facilitate reproducibility and comparison by other researchers.

Acknowledgments

The authors also gratefully acknowledge the helpful comments and suggestions of the reviewers, which have improved the article’s presentation.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

MAHCL-TSC	Multi-Agent Hierarchical Contrastive Learning Traffic Signal Control
CQTRAN-HGC	Contrastive QTRAN with Hierarchical Graph Convolution
SUMO	Simulation of Urban Mobility
GCN	Graph Convolutional Network
MARL	Multi-Agent Reinforcement Learning
Dec-POMDP	Decentralized Partially Observable Markov Decision Process
CTDE	Centralized Training with Decentralized Execution
DRL	Deep Reinforcement Learning

References

Noaeen, M.; Naik, A.; Goodman, L.; Crebo, J.; Abrar, T.; Abad, Z.S.H.; Bazzan, A.L.; Far, B. Reinforcement learning in urban network traffic signal control: A systematic literature review. Expert Syst. Appl. 2022, 199, 116830. [Google Scholar] [CrossRef]
Wei, H.; Zheng, G.; Gayah, V.; Li, Z. Recent advances in reinforcement learning for traffic signal control: A survey of models and evaluation. ACM SIGKDD Explor. Newsl. 2021, 22, 12–18. [Google Scholar] [CrossRef]
Abdulhai, B.; Pringle, R.; Karakoulas, G.J. Reinforcement learning for true adaptive traffic signal control. J. Transp. Eng. 2003, 129, 278–285. [Google Scholar] [CrossRef]
Camponogara, E.; Kraus, W. Distributed learning agents in urban traffic control. In Proceedings of the Program on Artificial Intelligence; Pires, F.M., Abreu, S., Eds.; Springer: Berlin/Heidelberg, Germany, 2003; pp. 324–335. [Google Scholar]
Wen, K.; Qu, S.; Zhang, Y. A stochastic adaptive control model for isolated intersections. In Proceedings of the IEEE International Conference on Robotics and Biomimetics (ROBIO), Sanya, China, 15–18 December 2007; pp. 2256–2260. [Google Scholar]
Lu, S.; Liu, X.; Dai, S. Q-learning for adaptive traffic signal control based on delay minimization strategy. In Proceedings of the IEEE International Conference on Networks, Sensors, and Control, Sanya, China, 6–8 April 2008; pp. 687–691. [Google Scholar]
Wiering, M. Multi-agent reinforcement learning for traffic light control. In Proceedings of the 17th International Conference on Machine Learning (ICML); Morgan Kaufmann Publishers Inc.: San Francisco, CA, USA, 2000; pp. 1151–1158. [Google Scholar]
Steingrover, M.; Schouten, R.; Peelen, S.; Nijhuis, E.; Bakker, B. Reinforcement learning of traffic light controllers adapting to traffic congestion. In Proceedings of the Seventeenth Belgium-Netherlands Conference on Artificial Intelligence, Brussels, Belgium, 17–18 October 2005; pp. 216–223. [Google Scholar]
Brys, T.; Pham, T.T.; Taylor, M.E. Distributed learning and multi-objectivity in traffic light control. Connect. Sci. 2014, 26, 65–83. [Google Scholar] [CrossRef]
Taylor, M.E.; Jain, M.; Tandon, P.; Yokoo, M.; Tambe, M. Distributed on-line multi-agent optimization under uncertainty: Balancing exploration and exploitation. Adv. Complex Syst. 2011, 14, 471–528. [Google Scholar] [CrossRef]
Mikami, S.; Kakazu, Y. Genetic reinforcement learning for cooperative traffic signal control. In Proceedings of the First IEEE Conference on Evolutionary Computation, IEEE World Congress on Computational Intelligence, Orlando, FL, USA, 27–29 June 1994; pp. 223–228. [Google Scholar]
Arulkumaran, K.; Deisenroth, M.P.; Brundage, M.; Bharath, A.A. Deep reinforcement learning: A brief survey. IEEE Signal Process. Mag. 2017, 34, 26–38. [Google Scholar] [CrossRef]
Abdoos, M.; Mozayani, N.; Bazzan, A.L.C. Hierarchical control of traffic signals using Q-learning with tile coding. Int. J. Speech Technol. 2014, 40, 201–213. [Google Scholar] [CrossRef]
Van der Pol, E.; Oliehoek, F.A. Coordinated deep reinforcement learners for traffic light control. In Proceedings of the International Conference on Learning, Inference, and Control of Multi-Agent Systems (NIPS), Barcelona, Spain, 5–10 December 2016. [Google Scholar]
Casas, N. Deep deterministic policy gradient for urban traffic light control. arXiv 2017, arXiv:1703.09035. [Google Scholar] [CrossRef]
Balaji, P.G.; German, X.; Srinivasan, D. Urban traffic signal control using reinforcement learning agents. IET Intell. Transp. Syst. 2010, 4, 177–188. [Google Scholar] [CrossRef]
Liang, X.; Du, X.; Wang, G.; Han, Z. A deep reinforcement learning network for traffic light cycle control. IEEE Trans. Veh. Technol. 2019, 68, 1243–1253. [Google Scholar] [CrossRef]
Chu, T.; Wang, J.; Codecà, L.; Li, Z. Multi-agent deep reinforcement learning for large-scale traffic signal control. IEEE Trans. Intell. Transp. Syst. 2020, 21, 1086–1095. [Google Scholar] [CrossRef]
Lin, Y.; Dai, X.; Li, L.; Wang, F.-Y. An efficient deep reinforcement learning model for urban traffic control. arXiv 2018, arXiv:1808.01876. [Google Scholar] [CrossRef]
Ge, H.; Song, Y.; Wu, C.; Ren, J.; Tan, G. Cooperative deep Q-learning with Q-value transfer for multi-intersection signal control. IEEE Access 2019, 7, 40797–40809. [Google Scholar] [CrossRef]
Xu, M.; Wu, J.; Huang, L.; Zhou, R.; Wang, T.; Hu, D. Network-wide traffic signal control based on the discovery of critical nodes and deep reinforcement learning. J. Intell. Transp. Syst. 2020, 24, 1–10. [Google Scholar] [CrossRef]
Owais, M.; Abulwafa, O.; Abbas, Y.A. When to Decide to Convert a Roundabout to a Signalized Intersection: Simulation Approach for Case Studies in Jeddah and Al-Madinah. Arab. J. Sci. Eng. 2020, 45, 7897–7914. [Google Scholar] [CrossRef]
Luo, Q.; Lu, X.; Zang, Z.; Gong, H.; Guo, X.; Chen, X. A Real-Time Early Warning Framework for Multi-Dimensional Driving Risk of Heavy-Duty Trucks Using Trajectory Data. Systems 2026, 14, 204. [Google Scholar] [CrossRef]
Zhai, C.; Wu, W.; Xiao, Y.; Zhang, J.; Zhai, M.; Wu, Y. A Novel Throttle-Based Self-Stabilizing Control Scheme Integrated into an Anisotropic Continuum Model to Mitigate Cyber-Attacks in Connected Vehicle Scenarios. Chaos Solitons Fractals 2025, 201, 117319. [Google Scholar] [CrossRef]
Sunehag, P.; Lever, G.; Gruslys, A.; Czarnecki, W.M.; Zambaldi, V.; Jaderberg, M.; Lanctot, M.; Sonnerat, N.; Leibojz, J.Z.; Tuyls, K.; et al. Value-decomposition networks for cooperative multi-agent learning. arXiv 2017, arXiv:1706.05296. [Google Scholar]
He, K.; Fan, H.; Wu, Y.; Xie, S.; Girshick, R. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13 June–19 June 2020; pp. 9729–9973. [Google Scholar]
Son, K. Learning to Factorize with Regularization for Cooperative Multi-Agent Reinforcement Learning. Master’s Thesis, Korea Advanced Institute of Science & Technology (KAIST), Daejeon, Republic of Korea, 2019. Available online: https://koasas.kaist.ac.kr/handle/10203/266898 (accessed on 17 May 2026).
Lopez, P.A.; Behrisch, M.; Bieker-Walz, L.; Erdmann, J.; Flotteröd, Y.-P.; Hilbrich, R.; Lücken, L.; Rummel, J.; Wagner, P.; Wießner, E. Microscopic traffic simulation using sumo. In 2018 21st International Conference on Intelligent Transportation Systems (ITSC); IEEE: Piscataway, NJ, USA, 2018; pp. 2575–2582. [Google Scholar]
Lowe, R.; Wu, Y.I.; Tamar, A.; Harb, J.; Pieter Abbeel, O.; Mordatch, I. Multi-agent actor-critic for mixed cooperative-competitive environments. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar]
Yu, C.; Velu, A.; Vinitsky, E.; Wang, Y.; Wu, Y.; Yu, C. The surprising effectiveness of PPO in cooperative multi-agent games. In Proceedings of the 36th International Conference on Neural Information Processing Systems; Curran Associates Inc.: Red Hook, NY, USA, 2022; Volume 35, pp. 24611–24624. [Google Scholar]

Figure 1. Framework of MAHCL-TSC Model.

Figure 2. Schematic Diagram of a Four-Phase Intersection.

Figure 3. Contrastive Policy Diversification Module of the MAHCL-TSC Model. (a) Policy Network Architecture. (b) Contrastive Learning Module Architecture.

Figure 4. Joint Action Value Network Architecture.

Figure 5. The 4 × 4 simulation network experiment setup. (a) The 4 × 4 simulated road network structure. (b) Traffic flow in the simulated road network.

Figure 6. Training Rewards on 4 × 4 Simulated Road Network.

Figure 7. Average Queue Length Variation.

Figure 8. The 6 × 6 Simulation Network.

Figure 9. Training Rewards in 6 × 6 Simulated Road Network.

Figure 10. Average Queue Length in 6 × 6 Simulation Network.

Table 1. Algorithm, Simulation, and Training Parameter Settings.

Parameter Category	CQTRAN-HGC Parameter Value
Policy Network Hidden Dimension	128
Value Network Hidden Dimension	256
Learning Rate	3 × 10⁻⁴
Discount Factor (γ)	0.99
Target Network Update (τ)	0.01
Replay Buffer Size	1 × 10⁶
Batch Size	512
Exploration Noise Variance	0.1
Number of Clusters (K)	3 (10 agents)/6 (20 agents)
Contrastive Loss Weight (λ)	0.3
GCN Layers	2
GCN Hidden Units	128
Training Steps	200,000
Signal Decision Interval	15 s
Yellow Time	3 s
All-Red Time	0 s
Initial Exploration Rate	1.0
Number of Independent Runs	5
Random Seeds	[0, 1, 2, 3, 4]
Reward Weights α, β, γ:	0.4, 0.3, 0.3

Table 2. Experimental Results for 4 × 4 Network.

Metric	MADDPG	MAPPO	QTRAN	MAHCL-TSC
Average Delay (s)	109.0 ± 3.10	97.50 ± 1.30	85.40 ± 2.10	62.30 ± 1.30
Average Wait (s)	3.79 ± 0.11	2.93 ± 0.22	2.69 ± 0.18	2.11 ± 0.09
Intersection Pressure	8.85 ± 0.41	7.19 ± 0.33	5.35 ± 0.12	3.87 ± 0.07

Table 3. Experimental Results on 6 × 6 Simulation Network.

Metric	MADDPG	MAPPO	QTRAN	MAHCL-TSC
Average Delay (s)	189.9 ± 4.30	108.30 ± 2.70	116.90 ± 3.50	74.30 ± 1.80
Avg. Wait (s)	6.51 ± 0.50	3.33 ± 0.28	3.81 ± 0.32	2.72 ± 0.11
Intersection Pressure	12.11 ± 0.32	6.56 ± 0.31	7.16 ± 0.20	4.86 ± 0.13

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2026 by the authors. Published by MDPI on behalf of the International Society for Photogrammetry and Remote Sensing. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.

Share and Cite

MDPI and ACS Style

Yan, L.; Jia, H.; Wang, S.; Wu, P.; Zhao, W. Multi-Agent Deep Reinforcement Learning with Contrastive Policy Diversification and Hierarchical Graph Networks for Urban Traffic Signal Control. ISPRS Int. J. Geo-Inf. 2026, 15, 229. https://doi.org/10.3390/ijgi15060229

AMA Style

Yan L, Jia H, Wang S, Wu P, Zhao W. Multi-Agent Deep Reinforcement Learning with Contrastive Policy Diversification and Hierarchical Graph Networks for Urban Traffic Signal Control. ISPRS International Journal of Geo-Information. 2026; 15(6):229. https://doi.org/10.3390/ijgi15060229

Chicago/Turabian Style

Yan, Liping, Haojie Jia, Shaofeng Wang, Peiran Wu, and Wenzhi Zhao. 2026. "Multi-Agent Deep Reinforcement Learning with Contrastive Policy Diversification and Hierarchical Graph Networks for Urban Traffic Signal Control" ISPRS International Journal of Geo-Information 15, no. 6: 229. https://doi.org/10.3390/ijgi15060229

APA Style

Yan, L., Jia, H., Wang, S., Wu, P., & Zhao, W. (2026). Multi-Agent Deep Reinforcement Learning with Contrastive Policy Diversification and Hierarchical Graph Networks for Urban Traffic Signal Control. ISPRS International Journal of Geo-Information, 15(6), 229. https://doi.org/10.3390/ijgi15060229

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Multi-Agent Deep Reinforcement Learning with Contrastive Policy Diversification and Hierarchical Graph Networks for Urban Traffic Signal Control

Abstract

1. Introduction

2. Related Work

3. Problem Definition

3.1. MAHCL-TSC Model

3.2. RL Parameter Configuration

4. CQTRAN-HGC Algorithm for Multi-Intersection Traffic Signal Control

4.1. Contrastive Policy Diversification Module

4.2. Credit Allocation Network

4.3. CQTRAN-HGC Algorithm

5. Experiments

5.1. Comparative Benchmarks

5.2. The 4 × 4 Synthetic Grid Network

5.3. The 6 × 6 Synthetic Road Network

6. Conclusions and Outlook

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI