TSS GAZ PTP: Towards Improving Gumbel AlphaZero with Two-Stage Self-Play for Multi-Constrained Electric Vehicle Routing Problems
Highlights
- We propose a Two-Stage Self-Play strategy that fosters more robust and effective policy improvement.
- We successfully address a key limitation of the original GAZ PTP framework, enabling its application to a wider range of real-world routing challenges.
- As a key challenge for smart cities, the Electric Vehicle Routing Problem (EVRP) is addressed using the proposed deep reinforcement learning (DRL) framework with the explicit goal of minimizing energy consumption.
Abstract
1. Introduction
- We propose a Two-stage self-play strategy that resolves the unbalanced competition issue in GAZ PTP. This strategy forces the learning agent to consistently compete against an opponent of comparable strength, fostering more robust and effective policy improvement.
- We successfully extend our method from fixed-step problems to complex variable-step problems like the multi-constrained EVRP. This addresses a key limitation of the original GAZ PTP framework, enabling its application to a wider range of real-world routing challenges.
- Our proposed TSS GAZ PTP algorithm achieves state-of-the-art performance not only on fixed-step benchmarks but also on both multi-constrained DM-EVRP and EM-EVRP. It demonstrates significant advantages over traditional heuristics and other learning-based methods, particularly on large-scale instances.
2. Related Work
2.1. Games with Deep Reinforcement Learning
2.2. Combinatorial Optimization with Deep Reinforcement Learning
2.3. AlphaGo Zero’s Inspiration for Combinatorial Optimization
3. EVRP
3.1. Problem Formulation

- s.t.
3.2. Energy Consumption
3.3. Markov Decision Process of Multi-Constrained EVRP
- State: represents the state space. In a two-player game, , each player starts from the depot with the initial state , , respectively, and , represent the states of two players at time step t. The graph node state and the electric vehicle state are represented as , . For each node i, , are the static and dynamic information of the node i, respectively. The static information is composed of the two-dimensional coordinates of the node and the demand of each customer . For the vehicle state , is the remaining battery of the electric vehicles; is the current travel time; and is the remaining capacity of the electric vehicles.
- Action: represents the action space. In the two-player game , , the action represents the action that has been chosen at the time step t.
- Reward: Unlike a single task that sets the reward as minimization or maximization of the objective function, the reward is reshaped into a binary based on self-competition, to which we compare the trajectory for the player at the time step t, if .
- Transition: The state transitions deterministically to due to the deterministic state transition function .
4. Methodology
4.1. Two-Stage Self-Play
4.2. Algorithm
| Algorithm 1 Gumbel AlphaZero Play-to-Plan with Two-Stage Self-Play |
Input: : initial state distribution; : set of initial states sampled from Input: : self-play parameter Init policy replay buffer and value replay buffer Init parameters , for policy net and value net Init ’best’ parameters Init stage
|
4.3. Network Architecture
5. Experiments
5.1. Validation on TSP
5.2. Extension to EVRP
5.3. Baselines
- Gurobi: A commercial optimization solver.
- ACO: An improved ant colony algorithm based meta-heuristics to solve EVRP [22].
- ALNS: Adaptive large neighborhood search algorithm, which is enhanced by a local search for intensification to solve EVRP [19].
- AM: A Reinforcement Learning method based on attention mechanism [23].
- DRL: A DRL method with Transformer specifically for EVRP [27].
- GAZ PTP: A Reinforcement Learning method based on self-competition [26].
- GAZ PTP (fine-tuned): The framework is the same as GAZ PTP, but we have fine-tuned parameters for multi-constrained EVRP.
5.4. Results on EVRP
5.5. Visualization Analysis on EVRP
6. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Bonfiglio, A.; Minetti, M.; Procopio, R. Vehicle-to-home service via electric vehicle energy storage virtual partitioning. IEEE Trans. Ind. Appl. 2025, 61, 7790–7802. [Google Scholar] [CrossRef]
- Van Fan, Y.; Perry, S.; Klemeš, J.J.; Lee, C.T. A review on air emissions assessment: Transportation. J. Clean. Prod. 2018, 194, 673–684. [Google Scholar] [CrossRef]
- Kucukoglu, I.; Dewil, R.; Cattrysse, D. The electric vehicle routing problem and its variations: A literature review. Comput. Ind. Eng. 2021, 161, 107650. [Google Scholar] [CrossRef]
- Wang, G.; Qin, Z.; Wang, S.; Sun, H.; Dong, Z.; Zhang, D. Towards accessible shared autonomous electric mobility with dynamic deadlines. IEEE Trans. Mob. Comput. 2024, 23, 925–940. [Google Scholar] [CrossRef]
- Fan, G.; Yang, Z.; Jin, H.; Gan, X.; Wang, X. Enabling optimal control under demand elasticity for electric vehicle charging systems. IEEE Trans. Mob. Comput. 2022, 21, 955–970. [Google Scholar] [CrossRef]
- Yuan, W.; Huang, J.; Jun, Y. Competitive charging station pricing for plug-in electric vehicles. IEEE Trans. Smart Grid 2017, 8, 627–639. [Google Scholar]
- Xu, Y.; Fang, M.; Chen, L.; Xu, G.; Du, Y.; Zhang, C. Reinforcement learning with multiple relational attention for solving vehicle routing problems. IEEE Trans. Cybern. 2022, 52, 11107–11120. [Google Scholar] [CrossRef]
- Li, X.; Luo, W.; Yuan, M.; Wang, J.; Lu, J.; Wang, J.; Lü, J.; Zeng, J. Learning to optimize industry-scale dynamic pickup and delivery problems. In Proceedings of the 2021 IEEE 37th International Conference on Data Engineering (ICDE), Chania, Greece, 19–22 April 2021; pp. 2511–2522. [Google Scholar]
- Imran, N.M.; Won, M. Smartpathfinder: Pushing the limits of heuristic solutions for vehicle routing problem with drones using reinforcement learning. In Proceedings of the 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Abu Dhabi, United Arab Emirates, 14–18 October 2024. [Google Scholar]
- Li, J.; Wang, P.; Ma, S. The impact of different transportation infrastructures on urban carbon emissions: Evidence from china. Energy 2024, 295, 131041. [Google Scholar] [CrossRef]
- Roselli, S.F.; Götvall, P.-L.; Fabian, M.; Åkesson, K. A compositional algorithm for the conflict-free electric vehicle routing problem. IEEE Trans. Autom. Sci. Eng. 2022, 19, 1405–1421. [Google Scholar] [CrossRef]
- Ding, Z.; Teng, F.; Sarikprueck, P.; Hu, Z. Technical review on advanced approaches for electric vehicle charging demand management, part ii: Applications in transportation system coordination and infrastructure planning. IEEE Trans. Ind. Appl. 2020, 56, 5695–5703. [Google Scholar] [CrossRef]
- Pan, Y.A.; Song, Y.; Yang, T.; Ding, Y.; Hu, X. Equitable urban electric vehicle charging: Feasibility and benefits of streetlight charging in kansas city right-of-way. J. Urban Plan. Dev. 2025, 151, 04025066. [Google Scholar] [CrossRef]
- Li, J.; Tian, S.; Zhang, N.; Liu, G.; Wu, Z.; Li, W. Optimization strategy for electric vehicle routing under traffic impedance guidance. Appl. Sci. 2023, 13, 11474. [Google Scholar] [CrossRef]
- Verma, A. Electric vehicle routing problem with time windows, recharging stations and battery swapping stations. EURO J. Transp. Logist. 2018, 7, 415–451. [Google Scholar] [CrossRef]
- Zhang, W.; Fang, X.; Sun, C. The alternative path for fossil oil: Electric vehicles or hydrogen fuel cell vehicles? J. Environ. Manag. 2023, 341, 118019. [Google Scholar] [CrossRef] [PubMed]
- Yang, B.; Ren, T.; Yu, H.; Chen, J.; Wang, Y. An evolutionary algorithm driving by dimensionality reduction operator and knowledge model for the electric vehicle routing problem with flexible charging strategy. Swarm Evol. Comput. 2025, 92, 101814. [Google Scholar] [CrossRef]
- Moradi, N.; Boroujeni, N.M. Prize-collecting electric vehicle routing model for parcel delivery problem. Expert Syst. Appl. 2025, 259, 125183. [Google Scholar] [CrossRef]
- Goeke, D.; Schneider, M. Routing a mixed fleet of electric and conventional vehicles. Eur. J. Oper. Res. 2015, 245, 81–99. [Google Scholar] [CrossRef]
- Sistig, H.M.; Sauer, D.U. Metaheuristic for the integrated electric vehicle and crew scheduling problem. Appl. Energy 2023, 339, 120915. [Google Scholar] [CrossRef]
- Mao, H.; Shi, J.; Zhou, Y.; Zhang, G. The electric vehicle routing problem with time windows and multiple recharging options. IEEE Access 2020, 8, 114864–114875. [Google Scholar] [CrossRef]
- Zhang, S.; Gajpal, Y.; Appadoo, S.S.; Abdulkader, M.M.S. Electric vehicle routing problem with recharging stations for minimizing energy consumption. Int. J. Prod. Econ. 2018, 203, 404–413. [Google Scholar] [CrossRef]
- Kool, W.; van Hoof, H.; Welling, M. Attention, learn to solve routing problems! In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
- Wang, M.; Wei, Y.; Huang, X.; Gao, S. An end-to-end deep reinforcement learning framework for electric vehicle routing problem. IEEE Internet Things J. 2024, 11, 33671–33682. [Google Scholar] [CrossRef]
- Danihelka, I.; Guez, A.; Schrittwieser, J.; Silver, D. Policy improvement by planning with gumbel. In Proceedings of the International Conference on Learning Representations, Virtual, 25–29 April 2022. [Google Scholar]
- Pirnay, J.; Göttl, Q.; Burger, J.; Grimm, D.G. Policy-based self-competition for planning problems. In Proceedings of the Eleventh International Conference on Learning Representations (ICLR), Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
- Tang, M.; Zhuang, W.; Li, B.; Liu, H.; Song, Z.; Yin, G. Energy-optimal routing for electric vehicles using deep reinforcement learning with transformer. Appl. Energy 2023, 350, 121711. [Google Scholar] [CrossRef]
- Silver, D.; Hubert, T.; Schrittwieser, J.; Antonoglou, I.; Lai, M.; Guez, A.; Lanctot, M.; Sifre, L.; Kumaran, D.; Graepel, T.; et al. Mastering chess and shogi by self-play with a general reinforcement learning algorithm. arXiv 2017, arXiv:1712.01815. [Google Scholar] [CrossRef]
- Guez, A.; Weber, T.; Antonoglou, I.; Simonyan, K.; Vinyals, O.; Wierstra, D.; Munos, R.; Silver, D. Learning to search with mctsnets. In Proceedings of the International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018. [Google Scholar]
- Drori, I.; Kharkar, A.; Sickinger, R.; Kates, B.; Ma, Q.; Ge, S.; Dolev, E.; Dietrich, B.; Williamson, D.P.; Udell, M. Learning to solve combinatorial optimization problems on real-world graphs in linear time. In Proceedings of the 2020 19th IEEE International Conference on Machine Learning and Applications (ICMLA), Miami, FL, USA, 14–17 December 2020. [Google Scholar]
- Silver, D.; Hubert, T.; Schrittwieser, J.; Antonoglou, I.; Lai, M.; Guez, A.; Lanctot, M.; Sifre, L.; Kumaran, D.; Graepel, T.; et al. A general reinforcement learning algorithm that masters chess, shogi, and go through self-play. Science 2018, 362, 1140–1144. [Google Scholar] [CrossRef]
- Jaderberg, M.; Czarnecki, W.M.; Dunning, I.; Marris, L.; Lever, G.; Castañeda, A.G.; Beattie, C.; Rabinowitz, N.C.; Morcos, A.S.; Ruderman, A.; et al. Human-level performance in 3d multiplayer games with population-based reinforcement learning. Science 2019, 364, 859–865. [Google Scholar] [CrossRef]
- Vinyals, O.; Babuschkin, I.; Czarnecki, W.M.; Mathieu, M.; Dudzik, A.; Chung, J.; Choi, D.H.; Powell, R.; Ewalds, T.; Georgiev, P.; et al. Grandmaster level in starcraft ii using multi-agent reinforcement learning. Nature 2019, 575, 350–354. [Google Scholar] [CrossRef]
- Zhang, Y.; Li, M.; Chen, Y.; Chiang, Y.-Y.; Hua, Y. A constraint-based routing and charging methodology for battery electric vehicles with deep reinforcement learning. IEEE Trans. Smart Grid 2023, 14, 2446–2459. [Google Scholar] [CrossRef]
- Wang, H.; Preuss, M.; Plaat, A. Adaptive warm-start mcts in alphazero-like deep reinforcement learning. In Proceedings of the Pacific Rim International Conference on Artificial Intelligence, Hanoi, Vietnam, 8–12 November 2021. [Google Scholar]
- Bansal, T.; Pachocki, W.; Sidor, S.; Sutskever, I.; Mordatch, I. Emergent complexity via multi-agent competition. arXiv 2017, arXiv:1710.03748. [Google Scholar]
- Laterre, A.; Fu, Y.; Jabri, M.K.; Cohen, A.-S.; Kas, D.; Hajjar, K.; Dahl, T.S.; Kerkeni, A.; Beguir, K. Ranked reward: Enabling self-play reinforcement learning for combinatorial optimization. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018. [Google Scholar]
- Wang, H.; Preuss, M.; Emmerich, M.; Plaat, A. Tackling morpion solitaire with alphazero-like ranked reward reinforcement learning. In Proceedings of the 2020 22nd International Symposium on Symbolic and Numeric Algorithms for Scientific Computing (SYNASC), Timisoara, Romania, 1–4 September 2020; pp. 149–152. [Google Scholar]
- Wang, Q.; Hao, Y.; Cao, J. Learning to traverse over graphs with a monte carlo tree search-based self-play framework. Eng. Appl. Artif. Intell. 2021, 105, 104422. [Google Scholar] [CrossRef]
- Hao, X.; Hao, J.; Xiao, C.; Li, K.; Li, D.; Zheng, Y. Multiagent gumbel muzero: Efficient planning in combinatorial action spaces. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024. [Google Scholar]






| Parameter | Description |
|---|---|
| L | Capacity of the vehicle |
| Driver’s maximum serving time | |
| Q | Battery capacity of the vehicle |
| Demand of the customer i | |
| Remaining capacity while reaching node i | |
| Travel time while reaching node i | |
| Serving time while reaching node i | |
| Remaining battery capacity while reaching node i | |
| Distance between node i and node j | |
| Travel time of the node i from the node j | |
| Energy consumption from the node i to j | |
| Cargo load from the node i to the node j | |
| Average speed from the node i to the node j |
| Method | n = 20 | n = 50 | n = 100 | |||
|---|---|---|---|---|---|---|
| Obj. | Gap | Obj. | Gap | Obj. | Gap | |
| Optimal Solver (Concorde) | 3.84 | 0.00% | 5.70 | 0.00% | 7.76 | 0.00% |
| Kool (Attention) | 3.85 | 0.34% | 5.80 | 1.76% | 8.12 | 4.53% |
| GAZ PTP | 3.84 | 0.17% | 5.78 | 1.55 % | 8.01 | 3.16% |
| TSS GAZ PTP (ours) | 3.84 | 0.15% | 5.76 | 1.23% | 7.97 | 2.71% |
| Parameter | Description | Value |
|---|---|---|
| L | Capacity of the vehicle | 4000 kg |
| Unladen load | 4100 kg | |
| Q | Battery capacity of the vehicle | 80 kwh |
| A | Frontal surface area | m2 |
| Atmospheric density | kg/m3 | |
| g | Gravitational constant | m/s2 |
| Resistance coefficient | 0.01 | |
| Aerodynamic drag coefficient | 0.7 | |
| Propulsion efficiency | 1.18 | |
| Regenerative braking efficiency | 0.85 | |
| Charging efficiency | 1.11 | |
| Discharging efficiency | 0.93 |
| Method | EM-EVRP10 | EM-EVRP20 | EM-EVRP50 | |||
|---|---|---|---|---|---|---|
| Obj. (kwh) | Gap | Obj. (kwh) | Gap | Obj. (kwh) | Gap | |
| Gurobi | 145.25 | 0% | 225.52 | 0% | T/O | T/O |
| ALNS | 151.15 | 4.04% | 241.54 | 7.19% | 480.64 | 3.61% |
| ACO | 151.98 | 4.61% | 240.15 | 6.49% | 476.46 | 2.71% |
| AM (Greedy) | 152.87 | 5.22% | 247.65 | 9.81% | 486.54 | 4.88% |
| AM (Sample1280) | 149.62 | 2.99% | 238.63 | 5.81% | 471.54 | 1.65% |
| AM (Sample12800) | 149.28 | 2.75% | 237.56 | 5.34% | 468.46 | 0.99% |
| DRL (Greedy) | 151.35 | 4.18% | 244.54 | 8.43% | 482.27 | 3.96% |
| DRL (Sample1280) | 148.77 | 2.40% | 237.43 | 5.28% | 466.70 | 0.61% |
| DRL (Sample12800) | 148.43 | 2.17% | 236.33 | 4.79% | 464.69 | 0.18% |
| GAZ PTP | 160.14 | 10.25% | 252.27 | 11.86% | 570.18 | 22.92% |
| GAZ PTP (fine-tuned) | 155.23 | 6.84% | 245.85 | 9.01% | 550.75 | 18.73% |
| TSS GAZ PTP | 147.79 | 1.72% | 234.68 | 4.06% | 463.85 | 0.00% |
| Gurobi | 346.59 | 0% | 542.55 | 0% | T/O | T/O |
| ALNS | 355.01 | 2.43% | 572.21 | 5.47% | 1147.31 | 4.26% |
| ACO | 353.85 | 2.09% | 568.15 | 4.72% | 1140.74 | 3.66% |
| AM (Greedy) | 361.76 | 4.37% | 584.14 | 7.67% | 1153.05 | 4.78% |
| AM (Sample1280) | 357.23 | 3.07% | 571.45 | 5.33% | 1118.94 | 1.68% |
| AM (Sample12800) | 355.01 | 2.43% | 568.34 | 4.75% | 1112.04 | 1.05% |
| DRL (Greedy) | 357.24 | 3.07% | 577.38 | 6.42% | 1139.56 | 3.55% |
| DRL (Sample1280) | 352.72 | 1.77% | 565.17 | 4.17% | 1108.74 | 0.75% |
| DRL (Sample12800) | 352.35 | 1.66% | 562.16 | 3.61% | 1104.03 | 0.32% |
| GAZ PTP | 380.33 | 9.83% | 599.84 | 10.56% | 1328.14 | 20.69% |
| GAZ PTP (fine-tuned) | 366.07 | 5.62% | 582.05 | 7.28% | 1283.49 | 16.63% |
| TSS GAZ PTP | 351.84 | 1.51% | 560.49 | 3.30% | 1100.47 | 0.00% |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Wang, H.; Zhang, X.; Mu, C. TSS GAZ PTP: Towards Improving Gumbel AlphaZero with Two-Stage Self-Play for Multi-Constrained Electric Vehicle Routing Problems. Smart Cities 2026, 9, 21. https://doi.org/10.3390/smartcities9020021
Wang H, Zhang X, Mu C. TSS GAZ PTP: Towards Improving Gumbel AlphaZero with Two-Stage Self-Play for Multi-Constrained Electric Vehicle Routing Problems. Smart Cities. 2026; 9(2):21. https://doi.org/10.3390/smartcities9020021
Chicago/Turabian StyleWang, Hui, Xufeng Zhang, and Chaoxu Mu. 2026. "TSS GAZ PTP: Towards Improving Gumbel AlphaZero with Two-Stage Self-Play for Multi-Constrained Electric Vehicle Routing Problems" Smart Cities 9, no. 2: 21. https://doi.org/10.3390/smartcities9020021
APA StyleWang, H., Zhang, X., & Mu, C. (2026). TSS GAZ PTP: Towards Improving Gumbel AlphaZero with Two-Stage Self-Play for Multi-Constrained Electric Vehicle Routing Problems. Smart Cities, 9(2), 21. https://doi.org/10.3390/smartcities9020021

