Multi-Agent Deep Reinforcement Learning with Contrastive Policy Diversification and Hierarchical Graph Networks for Urban Traffic Signal Control
Abstract
1. Introduction
- (1)
- A MAHCL-TSC model under the Centralized Training with Decentralized Execution (CTDE) paradigm. It integrates four core modules—control environment, data acquisition, network architecture, and contrastive learning—into an intelligent closed-loop system. This design addresses policy coordination and asynchronous decision-making in multi-intersection control.
- (2)
- A policy diversification mechanism using unsupervised contrastive learning is designed. It generates regional pseudo-labels via multimodal feature fusion and K-means clustering, then refines agent representations with supervised contrastive loss. This enhances the discrimination of heterogeneous traffic patterns, mitigates policy homogenization, and provides potential benefits for adapting to varying traffic demand patterns.
- (3)
- A hierarchical graph convolutional credit assignment network is developed. It partitions the road network into functional regions via clustering, while GCNs hierarchically extract intra-region and global features. This explicitly models agent interactions, optimizes credit assignment in QTRAN, and strengthens the global–local reward association, boosting collaborative efficiency and scalability in large networks.
2. Related Work
3. Problem Definition
3.1. MAHCL-TSC Model
3.2. RL Parameter Configuration
4. CQTRAN-HGC Algorithm for Multi-Intersection Traffic Signal Control
4.1. Contrastive Policy Diversification Module
4.2. Credit Allocation Network
4.3. CQTRAN-HGC Algorithm
| Algorithm 1: Training Procedure of CQTRAN-HGC |
| . . |
| . |
| . |
| do: |
| for all agents. |
| do: |
| . |
| 8. Otherwise, each agent selects the greedy action according to its policy. |
| . |
| . |
| . |
| . |
| for all agents. |
| . |
| . |
| . |
| . |
| . |
| . |
| . |
| . |
| . |
| 24. End for |
| 25. End for |
5. Experiments
5.1. Comparative Benchmarks
5.2. The 4 × 4 Synthetic Grid Network
5.3. The 6 × 6 Synthetic Road Network
6. Conclusions and Outlook
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
| MAHCL-TSC | Multi-Agent Hierarchical Contrastive Learning Traffic Signal Control |
| CQTRAN-HGC | Contrastive QTRAN with Hierarchical Graph Convolution |
| SUMO | Simulation of Urban Mobility |
| GCN | Graph Convolutional Network |
| MARL | Multi-Agent Reinforcement Learning |
| Dec-POMDP | Decentralized Partially Observable Markov Decision Process |
| CTDE | Centralized Training with Decentralized Execution |
| DRL | Deep Reinforcement Learning |
References
- Noaeen, M.; Naik, A.; Goodman, L.; Crebo, J.; Abrar, T.; Abad, Z.S.H.; Bazzan, A.L.; Far, B. Reinforcement learning in urban network traffic signal control: A systematic literature review. Expert Syst. Appl. 2022, 199, 116830. [Google Scholar] [CrossRef]
- Wei, H.; Zheng, G.; Gayah, V.; Li, Z. Recent advances in reinforcement learning for traffic signal control: A survey of models and evaluation. ACM SIGKDD Explor. Newsl. 2021, 22, 12–18. [Google Scholar] [CrossRef]
- Abdulhai, B.; Pringle, R.; Karakoulas, G.J. Reinforcement learning for true adaptive traffic signal control. J. Transp. Eng. 2003, 129, 278–285. [Google Scholar] [CrossRef]
- Camponogara, E.; Kraus, W. Distributed learning agents in urban traffic control. In Proceedings of the Program on Artificial Intelligence; Pires, F.M., Abreu, S., Eds.; Springer: Berlin/Heidelberg, Germany, 2003; pp. 324–335. [Google Scholar]
- Wen, K.; Qu, S.; Zhang, Y. A stochastic adaptive control model for isolated intersections. In Proceedings of the IEEE International Conference on Robotics and Biomimetics (ROBIO), Sanya, China, 15–18 December 2007; pp. 2256–2260. [Google Scholar]
- Lu, S.; Liu, X.; Dai, S. Q-learning for adaptive traffic signal control based on delay minimization strategy. In Proceedings of the IEEE International Conference on Networks, Sensors, and Control, Sanya, China, 6–8 April 2008; pp. 687–691. [Google Scholar]
- Wiering, M. Multi-agent reinforcement learning for traffic light control. In Proceedings of the 17th International Conference on Machine Learning (ICML); Morgan Kaufmann Publishers Inc.: San Francisco, CA, USA, 2000; pp. 1151–1158. [Google Scholar]
- Steingrover, M.; Schouten, R.; Peelen, S.; Nijhuis, E.; Bakker, B. Reinforcement learning of traffic light controllers adapting to traffic congestion. In Proceedings of the Seventeenth Belgium-Netherlands Conference on Artificial Intelligence, Brussels, Belgium, 17–18 October 2005; pp. 216–223. [Google Scholar]
- Brys, T.; Pham, T.T.; Taylor, M.E. Distributed learning and multi-objectivity in traffic light control. Connect. Sci. 2014, 26, 65–83. [Google Scholar] [CrossRef]
- Taylor, M.E.; Jain, M.; Tandon, P.; Yokoo, M.; Tambe, M. Distributed on-line multi-agent optimization under uncertainty: Balancing exploration and exploitation. Adv. Complex Syst. 2011, 14, 471–528. [Google Scholar] [CrossRef]
- Mikami, S.; Kakazu, Y. Genetic reinforcement learning for cooperative traffic signal control. In Proceedings of the First IEEE Conference on Evolutionary Computation, IEEE World Congress on Computational Intelligence, Orlando, FL, USA, 27–29 June 1994; pp. 223–228. [Google Scholar]
- Arulkumaran, K.; Deisenroth, M.P.; Brundage, M.; Bharath, A.A. Deep reinforcement learning: A brief survey. IEEE Signal Process. Mag. 2017, 34, 26–38. [Google Scholar] [CrossRef]
- Abdoos, M.; Mozayani, N.; Bazzan, A.L.C. Hierarchical control of traffic signals using Q-learning with tile coding. Int. J. Speech Technol. 2014, 40, 201–213. [Google Scholar] [CrossRef]
- Van der Pol, E.; Oliehoek, F.A. Coordinated deep reinforcement learners for traffic light control. In Proceedings of the International Conference on Learning, Inference, and Control of Multi-Agent Systems (NIPS), Barcelona, Spain, 5–10 December 2016. [Google Scholar]
- Casas, N. Deep deterministic policy gradient for urban traffic light control. arXiv 2017, arXiv:1703.09035. [Google Scholar] [CrossRef]
- Balaji, P.G.; German, X.; Srinivasan, D. Urban traffic signal control using reinforcement learning agents. IET Intell. Transp. Syst. 2010, 4, 177–188. [Google Scholar] [CrossRef]
- Liang, X.; Du, X.; Wang, G.; Han, Z. A deep reinforcement learning network for traffic light cycle control. IEEE Trans. Veh. Technol. 2019, 68, 1243–1253. [Google Scholar] [CrossRef]
- Chu, T.; Wang, J.; Codecà, L.; Li, Z. Multi-agent deep reinforcement learning for large-scale traffic signal control. IEEE Trans. Intell. Transp. Syst. 2020, 21, 1086–1095. [Google Scholar] [CrossRef]
- Lin, Y.; Dai, X.; Li, L.; Wang, F.-Y. An efficient deep reinforcement learning model for urban traffic control. arXiv 2018, arXiv:1808.01876. [Google Scholar] [CrossRef]
- Ge, H.; Song, Y.; Wu, C.; Ren, J.; Tan, G. Cooperative deep Q-learning with Q-value transfer for multi-intersection signal control. IEEE Access 2019, 7, 40797–40809. [Google Scholar] [CrossRef]
- Xu, M.; Wu, J.; Huang, L.; Zhou, R.; Wang, T.; Hu, D. Network-wide traffic signal control based on the discovery of critical nodes and deep reinforcement learning. J. Intell. Transp. Syst. 2020, 24, 1–10. [Google Scholar] [CrossRef]
- Owais, M.; Abulwafa, O.; Abbas, Y.A. When to Decide to Convert a Roundabout to a Signalized Intersection: Simulation Approach for Case Studies in Jeddah and Al-Madinah. Arab. J. Sci. Eng. 2020, 45, 7897–7914. [Google Scholar] [CrossRef]
- Luo, Q.; Lu, X.; Zang, Z.; Gong, H.; Guo, X.; Chen, X. A Real-Time Early Warning Framework for Multi-Dimensional Driving Risk of Heavy-Duty Trucks Using Trajectory Data. Systems 2026, 14, 204. [Google Scholar] [CrossRef]
- Zhai, C.; Wu, W.; Xiao, Y.; Zhang, J.; Zhai, M.; Wu, Y. A Novel Throttle-Based Self-Stabilizing Control Scheme Integrated into an Anisotropic Continuum Model to Mitigate Cyber-Attacks in Connected Vehicle Scenarios. Chaos Solitons Fractals 2025, 201, 117319. [Google Scholar] [CrossRef]
- Sunehag, P.; Lever, G.; Gruslys, A.; Czarnecki, W.M.; Zambaldi, V.; Jaderberg, M.; Lanctot, M.; Sonnerat, N.; Leibojz, J.Z.; Tuyls, K.; et al. Value-decomposition networks for cooperative multi-agent learning. arXiv 2017, arXiv:1706.05296. [Google Scholar]
- He, K.; Fan, H.; Wu, Y.; Xie, S.; Girshick, R. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13 June–19 June 2020; pp. 9729–9973. [Google Scholar]
- Son, K. Learning to Factorize with Regularization for Cooperative Multi-Agent Reinforcement Learning. Master’s Thesis, Korea Advanced Institute of Science & Technology (KAIST), Daejeon, Republic of Korea, 2019. Available online: https://koasas.kaist.ac.kr/handle/10203/266898 (accessed on 17 May 2026).
- Lopez, P.A.; Behrisch, M.; Bieker-Walz, L.; Erdmann, J.; Flotteröd, Y.-P.; Hilbrich, R.; Lücken, L.; Rummel, J.; Wagner, P.; Wießner, E. Microscopic traffic simulation using sumo. In 2018 21st International Conference on Intelligent Transportation Systems (ITSC); IEEE: Piscataway, NJ, USA, 2018; pp. 2575–2582. [Google Scholar]
- Lowe, R.; Wu, Y.I.; Tamar, A.; Harb, J.; Pieter Abbeel, O.; Mordatch, I. Multi-agent actor-critic for mixed cooperative-competitive environments. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar]
- Yu, C.; Velu, A.; Vinitsky, E.; Wang, Y.; Wu, Y.; Yu, C. The surprising effectiveness of PPO in cooperative multi-agent games. In Proceedings of the 36th International Conference on Neural Information Processing Systems; Curran Associates Inc.: Red Hook, NY, USA, 2022; Volume 35, pp. 24611–24624. [Google Scholar]










| Parameter Category | CQTRAN-HGC Parameter Value |
|---|---|
| Policy Network Hidden Dimension | 128 |
| Value Network Hidden Dimension | 256 |
| Learning Rate | 3 × 10−4 |
| Discount Factor (γ) | 0.99 |
| Target Network Update (τ) | 0.01 |
| Replay Buffer Size | 1 × 106 |
| Batch Size | 512 |
| Exploration Noise Variance | 0.1 |
| Number of Clusters (K) | 3 (10 agents)/6 (20 agents) |
| Contrastive Loss Weight (λ) | 0.3 |
| GCN Layers | 2 |
| GCN Hidden Units | 128 |
| Training Steps | 200,000 |
| Signal Decision Interval | 15 s |
| Yellow Time | 3 s |
| All-Red Time | 0 s |
| Initial Exploration Rate | 1.0 |
| Number of Independent Runs | 5 |
| Random Seeds | [0, 1, 2, 3, 4] |
| Reward Weights α, β, γ: | 0.4, 0.3, 0.3 |
| Metric | MADDPG | MAPPO | QTRAN | MAHCL-TSC |
|---|---|---|---|---|
| Average Delay (s) | 109.0 ± 3.10 | 97.50 ± 1.30 | 85.40 ± 2.10 | 62.30 ± 1.30 |
| Average Wait (s) | 3.79 ± 0.11 | 2.93 ± 0.22 | 2.69 ± 0.18 | 2.11 ± 0.09 |
| Intersection Pressure | 8.85 ± 0.41 | 7.19 ± 0.33 | 5.35 ± 0.12 | 3.87 ± 0.07 |
| Metric | MADDPG | MAPPO | QTRAN | MAHCL-TSC |
|---|---|---|---|---|
| Average Delay (s) | 189.9 ± 4.30 | 108.30 ± 2.70 | 116.90 ± 3.50 | 74.30 ± 1.80 |
| Avg. Wait (s) | 6.51 ± 0.50 | 3.33 ± 0.28 | 3.81 ± 0.32 | 2.72 ± 0.11 |
| Intersection Pressure | 12.11 ± 0.32 | 6.56 ± 0.31 | 7.16 ± 0.20 | 4.86 ± 0.13 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Published by MDPI on behalf of the International Society for Photogrammetry and Remote Sensing. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Yan, L.; Jia, H.; Wang, S.; Wu, P.; Zhao, W. Multi-Agent Deep Reinforcement Learning with Contrastive Policy Diversification and Hierarchical Graph Networks for Urban Traffic Signal Control. ISPRS Int. J. Geo-Inf. 2026, 15, 229. https://doi.org/10.3390/ijgi15060229
Yan L, Jia H, Wang S, Wu P, Zhao W. Multi-Agent Deep Reinforcement Learning with Contrastive Policy Diversification and Hierarchical Graph Networks for Urban Traffic Signal Control. ISPRS International Journal of Geo-Information. 2026; 15(6):229. https://doi.org/10.3390/ijgi15060229
Chicago/Turabian StyleYan, Liping, Haojie Jia, Shaofeng Wang, Peiran Wu, and Wenzhi Zhao. 2026. "Multi-Agent Deep Reinforcement Learning with Contrastive Policy Diversification and Hierarchical Graph Networks for Urban Traffic Signal Control" ISPRS International Journal of Geo-Information 15, no. 6: 229. https://doi.org/10.3390/ijgi15060229
APA StyleYan, L., Jia, H., Wang, S., Wu, P., & Zhao, W. (2026). Multi-Agent Deep Reinforcement Learning with Contrastive Policy Diversification and Hierarchical Graph Networks for Urban Traffic Signal Control. ISPRS International Journal of Geo-Information, 15(6), 229. https://doi.org/10.3390/ijgi15060229

