Fuzzy-Guided Exploration for Multi-Agent Reinforcement Learning in Traffic Signal Control
Abstract
1. Introduction
2. Background
2.1. Reinforcement Learning and Multi-Agent Foundations
2.2. Centralized Training with Decentralized Execution
2.3. Exploration in Reinforcement Learning
3. Proposed Method
3.1. Pressure-Based Action Prioritization
3.2. Multi-Criteria Fuzzy Rule System
3.3. Fuzzy-Guided Exploration Policy
| Algorithm 1 Fuzzy-guided action selection, executed by each agent i at every training step t |
| Input: action-observation history , value function , exploration rate (all current at step t), temperature Output: action 1: for each candidate phase do 2: Compute pressure , queue length , waiting time ▹ Equations (7)–(9) 3: end for 4: Normalize , , across candidate phases ▹ Equation (13) 5: for each candidate phase do 6: Fuzzify normalized values ▹ Equations (10)–(12) 7: Compute rule activations for each rule ▹ Equation (14) 8: Defuzzify using and rule outputs (Table 1) to obtain ▹ Equation (15) 9: end for 10: Compute over based on and ▹ Equation (16) 11: if then ▹ Equation (17) 12: return 13: else 14: return 15: end if |
4. Experiments
4.1. Experimental Setup
4.2. Results
5. Discussion
6. Conclusions
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
| TSC | Traffic Signal Control |
| RL | Reinforcement Learning |
| MARL | Multi-Agent Reinforcement Learning |
| CTDE | Centralized Training with Decentralized Execution |
| IQL | Independent Q-Learning |
| VDN | Value Decomposition Networks |
| MDP | Markov Decision Process |
| POMDP | Partially Observable MDP |
| Dec-POMDP | Decentralized Partially Observable MDP |
| IGM | Individual-Global-Max |
| DQN | Deep Q-Network |
| UCB | Upper Confidence Bound |
| GRU | Gated Recurrent Unit |
| SUMO | Simulation of Urban MObility |
References
- Papageorgiou, M.; Diakaki, C.; Dinopoulou, V.; Kotsialos, A.; Wang, Y. Review of road traffic control strategies. Proc. IEEE 2003, 91, 2043–2067. [Google Scholar] [CrossRef]
- Koonce, P.; Rodegerdts, L.; Lee, K.; Quayle, S.; Beaird, S.; Braud, C.; Bonneson, J.; Tarnoff, P.; Urbanik, T. Traffic Signal Timing Manual; Technical Report FHWA-HOP-08-024; Federal Highway Administration: Washington, DC, USA, 2008.
- Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction, 2nd ed.; The MIT Press: Cambridge, MA, USA, 2018. [Google Scholar]
- Ault, J.; Sharon, G. Reinforcement Learning Benchmarks for Traffic Signal Control. In Proceedings of the Thirty-Fifth Conference on Neural Information Processing Systems (NeurIPS 2021) Datasets and Benchmarks Track, Virtual, 6–14 December 2021. [Google Scholar]
- Michailidis, P.; Michailidis, I.; Lazaridis, C.R.; Kosmatopoulos, E. Traffic Signal Control via Reinforcement Learning: A Review on Applications and Innovations. Infrastructures 2025, 10, 114. [Google Scholar] [CrossRef]
- Saadi, A.; Abghour, N.; Chiba, Z.; Moussaid, K.; Ali, S. A Survey of Reinforcement and Deep Reinforcement Learning for Coordination in Intelligent Traffic Light Control. J. Big Data 2025, 12, 84. [Google Scholar] [CrossRef]
- Cao, K.; Yang, S.; Yang, C.; Yu, M.; Geng, J.; Jung, H. Research on Intelligent Traffic Signal Control Based on Multi-Agent Deep Reinforcement Learning. Mathematics 2026, 14, 149. [Google Scholar] [CrossRef]
- Chu, T.; Wang, J.; Codecà, L.; Li, Z. Multi-Agent Deep Reinforcement Learning for Large-Scale Traffic Signal Control. IEEE Trans. Intell. Transp. Syst. 2019, 21, 1086–1095. [Google Scholar] [CrossRef]
- Oliehoek, F.A.; Amato, C. A Concise Introduction to Decentralized POMDPs; SpringerBriefs in Intelligent Systems; Springer: Cham, Switzerland, 2016. [Google Scholar] [CrossRef]
- Kraemer, L.; Banerjee, B. Multi-agent reinforcement learning as a rehearsal for decentralized planning. Neurocomputing 2016, 190, 82–94. [Google Scholar] [CrossRef]
- Bokade, R.; Jin, X. PyTSC: A Unified Platform for Multi-Agent Reinforcement Learning in Traffic Signal Control. Sensors 2025, 25, 1302. [Google Scholar] [CrossRef]
- Wei, H.; Chen, C.; Zheng, G.; Wu, K.; Gayah, V.; Xu, K.; Li, Z. PressLight: Learning Max Pressure Control to Coordinate Traffic Signals in Arterial Network. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), Anchorage, AL, USA, 4–8 August 2019; pp. 1290–1298. [Google Scholar] [CrossRef]
- Chen, C.; Wei, H.; Xu, N.; Zheng, G.; Yang, M.; Xiong, Y.; Xu, K.; Li, Z. Toward a Thousand Lights: Decentralized Deep Reinforcement Learning for Large-Scale Traffic Signal Control. In Proceedings of the Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI), New York, NY, USA, 7–12 February 2020; pp. 3414–3421. [Google Scholar] [CrossRef]
- Liu, Y.; Luo, G.; Yuan, Q.; Li, J.; Jin, L.; Chen, B.; Pan, R. GPLight: Grouped Multi-Agent Reinforcement Learning for Large-Scale Traffic Signal Control. In Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence (IJCAI), Macau, China, 19–25 August 2023; pp. 199–207. [Google Scholar]
- Bokade, R.; Jin, X.; Amato, C. Multi-Agent Reinforcement Learning Based on Representational Communication for Large-Scale Traffic Signal Control. IEEE Access 2023, 11, 47646–47658. [Google Scholar] [CrossRef]
- Varaiya, P. Max pressure control of a network of signalized intersections. Transp. Res. Part C Emerg. Technol. 2013, 36, 177–195. [Google Scholar] [CrossRef]
- Auer, P.; Cesa-Bianchi, N.; Fischer, P. Finite-time Analysis of the Multiarmed Bandit Problem. Mach. Learn. 2002, 47, 235–256. [Google Scholar] [CrossRef]
- Bellemare, M.G.; Srinivasan, S.; Ostrovski, G.; Schaul, T.; Saxton, D.; Munos, R. Unifying Count-Based Exploration and Intrinsic Motivation. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Barcelona, Spain, 5–10 December 2016; Volume 29. [Google Scholar]
- Pathak, D.; Agrawal, P.; Efros, A.A.; Darrell, T. Curiosity-driven Exploration by Self-supervised Prediction. In Proceedings of the 34th International Conference on Machine Learning (ICML) PMLR, Sydney, Australia, 6–11 August 2017; pp. 2778–2787. [Google Scholar]
- Zhou, G.; Zhang, Z.; Fan, G. AIR: Unifying Individual and Collective Exploration in Cooperative Multi-Agent Reinforcement Learning. In Proceedings of the Thirty-Ninth AAAI Conference on Artificial Intelligence (AAAI), Philadelphia, PA, USA, 25 February–4 March 2025. [Google Scholar]
- Zhang, P.; Hao, J.; Wang, W.; Tang, H.; Ma, Y.; Duan, Y.; Zheng, Y. KoGuN: Accelerating Deep Reinforcement Learning via Integrating Human Suboptimal Knowledge. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI), Virtual, 7–15 January 2021; pp. 2291–2297. [Google Scholar] [CrossRef]
- Qin, P.; Zhao, T. Knowledge Guided Fuzzy Deep Reinforcement Learning. Expert Syst. Appl. 2025, 264, 125823. [Google Scholar] [CrossRef]
- Farag, W.A. Virtual multiphase flow meter for high gas/oil ratios and water-cut reservoirs via ensemble machine learning. Exp. Comput. Multiph. Flow 2025, 7, 133–148. [Google Scholar] [CrossRef]
- Zadeh, L.A. Fuzzy sets. Inf. Control. 1965, 8, 338–353. [Google Scholar] [CrossRef]
- Tan, M. Multi-Agent Reinforcement Learning: Independent versus Cooperative Agents. In Proceedings of the Tenth International Conference on Machine Learning (ICML 1993); Morgan Kaufmann: San Francisco, CA, USA, 1993; pp. 330–337. [Google Scholar]
- Sunehag, P.; Lever, G.; Gruslys, A.; Czarnecki, W.M.; Zambaldi, V.; Jaderberg, M.; Lanctot, M.; Sonnerat, N.; Leibo, J.Z.; Tuyls, K.; et al. Value-Decomposition Networks for Cooperative Multi-Agent Learning Based on Team Reward. In Proceedings of the 17th International Conference on Autonomous Agents and Multiagent Systems (AAMAS), Stockholm, Sweden, 10–15 July 2018; pp. 2085–2087. [Google Scholar] [CrossRef]
- Rashid, T.; Samvelyan, M.; Schroeder de Witt, C.; Farquhar, G.; Foerster, J.; Whiteson, S. QMIX: Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning. In Proceedings of the 35th International Conference on Machine Learning (ICML), PMLR, Stockholm, Sweden, 10–15 July 2018; pp. 4295–4304. [Google Scholar]
- Wang, J.; Ren, Z.; Liu, T.; Yu, Y.; Zhang, C. QPLEX: Duplex Dueling Multi-Agent Q-Learning. In Proceedings of the 9th International Conference on Learning Representations (ICLR), Virtual, 3–7 May 2021. [Google Scholar]
- Alegre, L.N. SUMO-RL. 2019. Available online: https://github.com/LucasAlegre/sumo-rl (accessed on 26 May 2026).
- Oliehoek, F.A.; Spaan, M.T.J.; Vlassis, N. Optimal and Approximate Q-value Functions for Decentralized POMDPs. J. Artif. Intell. Res. 2008, 32, 289–353. [Google Scholar] [CrossRef]
- Son, K.; Kim, D.; Kang, W.J.; Hostallero, D.E.; Yi, Y. QTRAN: Learning to Factorize with Transformation for Cooperative Multi-Agent Reinforcement Learning. In Proceedings of the 36th International Conference on Machine Learning (ICML), PMLR, Long Beach, CA, USA, 9–15 June 2019; pp. 5887–5896. [Google Scholar]
- Rashid, T.; Farquhar, G.; Peng, B.; Whiteson, S. Weighted QMIX: Expanding Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Virtual, 6–12 December 2020. [Google Scholar]
- Marchesini, E.; Baisero, A.; Bhati, R.; Amato, C. On Stateful Value Factorization in Multi-Agent Reinforcement Learning. In Proceedings of the 24th International Conference on Autonomous Agents and Multiagent Systems (AAMAS), Detroit, MI, USA, 19–23 May 2025; pp. 1445–1453. [Google Scholar] [CrossRef]
- Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef] [PubMed]
- Hausknecht, M.; Stone, P. Deep Recurrent Q-Learning for Partially Observable MDPs. In Proceedings of the AAAI Fall Symposium on Sequential Decision Making for Intelligent Agents (AAAI-SDMIA), Arlington, VI, USA, 12–14 November 2015. [Google Scholar]
- Wang, X.; Ke, L.; Qiao, Z.; Chai, X. Large-Scale Traffic Signal Control Using a Novel Multiagent Reinforcement Learning. IEEE Trans. Cybern. 2021, 51, 174–187. [Google Scholar] [CrossRef]
- Koukol, M.; Zajíčková, L.; Marek, L.; Tuček, P. Fuzzy Logic in Traffic Engineering: A Review on Signal Control. Math. Probl. Eng. 2015, 2015, 979160. [Google Scholar] [CrossRef]
- Ruspini, E.H. A New Approach to Clustering. Inf. Control. 1969, 15, 22–32. [Google Scholar] [CrossRef]
- Samvelyan, M.; Rashid, T.; Schroeder de Witt, C.; Farquhar, G.; Nardelli, N.; Rudner, T.G.J.; Hung, C.M.; Torr, P.H.S.; Foerster, J.; Whiteson, S. The StarCraft Multi-Agent Challenge. arXiv 2019, arXiv:1902.04043. [Google Scholar]
- Lopez, P.A.; Behrisch, M.; Bieker-Walz, L.; Erdmann, J.; Flötteröd, Y.P.; Hilbrich, R.; Lücken, L.; Rummel, J.; Wagner, P.; Wiessner, E. Microscopic Traffic Simulation using SUMO. In Proceedings of the 2018 21st International Conference on Intelligent Transportation Systems (ITSC), Maui, HI, USA, 4–7 November 2018; pp. 2575–2582. [Google Scholar] [CrossRef]
- Terry, J.; Black, B.; Grammel, N.; Jayakumar, M.; Hari, A.; Sullivan, R.; Santos, L.S.; Dieffendahl, C.; Horsch, C.; Perez-Vicente, R.; et al. PettingZoo: Gym for Multi-Agent Reinforcement Learning. Adv. Neural Inf. Process. Syst. 2021, 34, 15032–15043. [Google Scholar]
- Papadopoulos, G.; Kontogiannis, A.; Papadopoulou, F.; Poulianou, C.; Koumentis, I.; Vouros, G. An Extended Benchmarking of Multi-Agent Reinforcement Learning Algorithms in Complex Fully Cooperative Tasks. In Proceedings of the 24th International Conference on Autonomous Agents and Multiagent Systems (AAMAS), Detroit, MI, USA, 19–23 May 2025; pp. 1613–1622. [Google Scholar] [CrossRef]
- Papoudakis, G.; Christianos, F.; Schäfer, L.; Albrecht, S.V. Benchmarking Multi-Agent Deep Reinforcement Learning Algorithms in Cooperative Tasks. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks (NeurIPS), Virtual, 6–14 December 2021. [Google Scholar]
- Trillas, E.; Moraga, C. Reasons for Careful Design of Fuzzy Sets. In Proceedings of the 8th Conference of the European Society for Fuzzy Logic and Technology (EUSFLAT 2013), Milan, Italy, 11–13 December 2013; pp. 140–145. [Google Scholar]
- Jang, J.S.R. ANFIS: Adaptive-network-based fuzzy inference system. IEEE Trans. Syst. Man Cybern. 1993, 23, 665–685. [Google Scholar] [CrossRef]
- Cheng, C.A.; Kolobov, A.; Swaminathan, A. Heuristic-Guided Reinforcement Learning. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Virtual, 6–14 December 2021; Volume 34, pp. 13550–13563. [Google Scholar]







| H | M | L | H | M | L | H | M | L | ||
|---|---|---|---|---|---|---|---|---|---|---|
| Q | H | 0.95 | 0.85 | 0.75 | 0.75 | 0.65 | 0.55 | 0.55 | 0.45 | 0.35 |
| M | 0.85 | 0.75 | 0.65 | 0.65 | 0.50 | 0.40 | 0.45 | 0.30 | 0.20 | |
| L | 0.75 | 0.65 | 0.55 | 0.55 | 0.40 | 0.30 | 0.35 | 0.20 | 0.05 | |
| Environment | Algorithm | Metric | Epsilon–Greedy | Fuzzy-Guided | Improv. (%) |
|---|---|---|---|---|---|
| SUMO 33 | MaxPressure * | Wait Time (s) | 23.05 ± 0.33 | ||
| Stopped | 4.12 ± 0.04 | ||||
| Speed (m/s) | 9.73 ± 0.05 | ||||
| IQL | Reward | −0.27 ± 0.02 | −0.27 ± 0.02 | +0.6% | |
| Wait Time (s) | 12.83 ± 0.70 | 10.72 ± 0.31 | +16.4% | ||
| Stopped | 3.89 ± 0.11 | 3.55 ± 0.06 | +8.7% | ||
| Speed (m/s) | 9.54 ± 0.07 | 9.64 ± 0.05 | +1.1% | ||
| VDN | Reward | −0.30 ± 0.03 | −0.27 ± 0.03 | +11.5% | |
| Wait Time (s) | 12.29 ± 0.62 | 10.82 ± 0.20 | +11.9% | ||
| Stopped | 3.89 ± 0.12 | 3.62 ± 0.06 | +6.9% | ||
| Speed (m/s) | 9.47 ± 0.08 | 9.61 ± 0.03 | +1.5% | ||
| QMIX | Reward | −0.28 ± 0.03 | −0.27 ± 0.02 | +5.5% | |
| Wait Time (s) | 11.89 ± 0.38 | 10.65 ± 0.11 | +10.5% | ||
| Stopped | 3.79 ± 0.09 | 3.56 ± 0.03 | +5.8% | ||
| Speed (m/s) | 9.54 ± 0.05 | 9.67 ± 0.07 | +1.4% | ||
| QPLEX | Reward | −0.31 ± 0.01 | −0.28 ± 0.02 | +10.1% | |
| Wait Time (s) | 13.73 ± 1.03 | 11.54 ± 0.35 | +15.9% | ||
| Stopped | 4.09 ± 0.11 | 3.79 ± 0.09 | +7.4% | ||
| Speed (m/s) | 9.40 ± 0.06 | 9.52 ± 0.05 | +1.3% | ||
| Cologne 8 | MaxPressure * | Wait Time (s) | 54.72 ± 5.62 | ||
| Stopped | 4.08 ± 0.30 | ||||
| Speed (m/s) | 8.60 ± 0.05 | ||||
| IQL | Reward | −0.76 ± 0.14 | −0.44 ± 0.04 | +41.8% | |
| Wait Time (s) | 20.29 ± 2.14 | 15.28 ± 1.95 | +24.7% | ||
| Stopped | 4.30 ± 0.10 | 3.67 ± 0.19 | +14.6% | ||
| Speed (m/s) | 8.19 ± 0.09 | 8.38 ± 0.05 | +2.4% | ||
| VDN | Reward | −0.75 ± 0.18 | −0.43 ± 0.12 | +43.5% | |
| Wait Time (s) | 20.92 ± 2.35 | 15.97 ± 2.89 | +23.7% | ||
| Stopped | 4.42 ± 0.25 | 3.81 ± 0.43 | +13.7% | ||
| Speed (m/s) | 8.18 ± 0.10 | 8.27 ± 0.14 | +1.1% | ||
| QMIX | Reward | −0.58 ± 0.13 | −0.42 ± 0.05 | +28.2% | |
| Wait Time (s) | 22.63 ± 4.39 | 14.88 ± 1.68 | +34.2% | ||
| Stopped | 4.45 ± 0.42 | 3.56 ± 0.30 | +20.1% | ||
| Speed (m/s) | 8.20 ± 0.08 | 8.45 ± 0.04 | +3.0% | ||
| QPLEX | Reward | −1.63 ± 0.91 | −0.57 ± 0.14 | +64.7% | |
| Wait Time (s) | 57.59 ± 20.37 | 17.85 ± 1.67 | +69.0% | ||
| Stopped | 6.22 ± 0.56 | 4.06 ± 0.33 | +34.7% | ||
| Speed (m/s) | 7.84 ± 0.11 | 8.31 ± 0.04 | +6.0% |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Ćiprovski, D.; Ilić, N.; Božilović, B.; Vučetić, M. Fuzzy-Guided Exploration for Multi-Agent Reinforcement Learning in Traffic Signal Control. Mathematics 2026, 14, 1942. https://doi.org/10.3390/math14111942
Ćiprovski D, Ilić N, Božilović B, Vučetić M. Fuzzy-Guided Exploration for Multi-Agent Reinforcement Learning in Traffic Signal Control. Mathematics. 2026; 14(11):1942. https://doi.org/10.3390/math14111942
Chicago/Turabian StyleĆiprovski, Dejan, Nemanja Ilić, Boško Božilović, and Miljan Vučetić. 2026. "Fuzzy-Guided Exploration for Multi-Agent Reinforcement Learning in Traffic Signal Control" Mathematics 14, no. 11: 1942. https://doi.org/10.3390/math14111942
APA StyleĆiprovski, D., Ilić, N., Božilović, B., & Vučetić, M. (2026). Fuzzy-Guided Exploration for Multi-Agent Reinforcement Learning in Traffic Signal Control. Mathematics, 14(11), 1942. https://doi.org/10.3390/math14111942

