Next Article in Journal
Automated Identification of Heavy BIM Library Components: A Multi-Criteria Analysis Tool for Model Optimization
Next Article in Special Issue
Wavelet–Deep Learning Framework for High-Resolution Fault Detection, Classification, and Localization in WMU-Enabled Distribution Systems
Previous Article in Journal
Mapping Service Accessibility Through Urban Analytics: A Linked Open Data Approach in the Lazio Region (Italy)
Previous Article in Special Issue
A Simple Physics-Informed Assessment of Smart Thermostat Strategies for Luxembourg’s Single-Family Homes
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

TSS GAZ PTP: Towards Improving Gumbel AlphaZero with Two-Stage Self-Play for Multi-Constrained Electric Vehicle Routing Problems

1
School of Artificial Intelligence, Anhui University, Jiulong Road, Heifei 230601, China
2
Anhui Provincial Key Laboratory of Security Artificial Intelligence, Anhui University, Jiulong Road, Heifei 230601, China
*
Author to whom correspondence should be addressed.
Smart Cities 2026, 9(2), 21; https://doi.org/10.3390/smartcities9020021
Submission received: 3 December 2025 / Revised: 13 January 2026 / Accepted: 19 January 2026 / Published: 23 January 2026

Highlights

What are the main findings?
  • We propose a Two-Stage Self-Play strategy that fosters more robust and effective policy improvement.
  • We successfully address a key limitation of the original GAZ PTP framework, enabling its application to a wider range of real-world routing challenges.
What is the implication of the main finding?
  • As a key challenge for smart cities, the Electric Vehicle Routing Problem (EVRP) is addressed using the proposed deep reinforcement learning (DRL) framework with the explicit goal of minimizing energy consumption.

Abstract

Deep reinforcement learning (DRL) with self-play has emerged as a promising paradigm for solving combinatorial optimization (CO) problems. The recently proposed Gumbel AlphaZero Plan-to-Play (GAZ PTP) framework adopts a competitive training setup between a learning agent and an opponent to tackle classical CO tasks such as the Traveling Salesman Problem (TSP). However, in complex and multi-constrained environments like the Electric Vehicle Routing Problem (EVRP), standard self-play often suffers from opponent mismatch: when the opponent is either too weak or too strong, the resulting learning signal becomes ineffective. To address this challenge, we introduce Two-Stage Self-Play GAZ PTP (TSS GAZ PTP), a novel DRL method designed to maintain adaptive and effective learning pressure throughout the training process. In the first stage, the learning agent, guided by Gumbel Monte Carlo Tree Search (MCTS), competes against a greedy opponent that follows the best historical policy. As training progresses, the framework transitions to a second stage in which both agents employ Gumbel MCTS, thereby establishing a dynamically balanced competitive environment that encourages continuous strategy refinement. The primary objective of this work is to develop a robust self-play mechanism capable of handling the high-dimensional constraints inherent in real-world routing problems. We first validate our approach on the TSP, a benchmark used in the original GAZ PTP study, and then extend it to the multi-constrained EVRP, which incorporates practical limitations including battery capacity, time windows, vehicle load limits, and charging infrastructure availability. The experimental results show that TSS GAZ PTP consistently outperforms existing DRL methods, with particularly notable improvements on large-scale instances.

1. Introduction

The expansion of transportation infrastructure, while vital for socioeconomic growth, has created significant environmental challenges, primarily due to carbon emissions [1,2,3]. Recent studies have explored time-constrained SAEV scheduling [4], demand-responsive charging control [5], and competitive charging pricing [6]; while these studies provide valuable insights at the electric mobility ecosystem level, path planning for electric vehicle (EV) at the routing layer is also critical. Path-planning models like the Traveling Salesman Problem (TSP) and the Vehicle Routing Problem (VRP) offer foundational strategies for mitigating these emissions by optimizing routes to minimize travel distance [7,8,9]. However, these classic combinatorial optimization problems and their variants were predominantly designed for traditional fuel vehicles, which remain a major source of carbon pollution [10]. The rapid development of EV technology and its supporting charging infrastructure now presents a promising and sustainable alternative [11,12,13]. Moreover, the integration of EVs is intrinsically linked to the development of Smart Cities, which fundamentally reshapes the objectives and constraints of path-planning algorithms [14,15].
Despite their environmental benefits, EVs face operational hurdles, most notably their limited driving range (500–700 km) and the relative scarcity of charging stations compared to ubiquitous gas stations [16]. This limitation complicates logistics planning, giving rise to a specific and more complex variant known as the Electric Vehicle Routing Problem (EVRP) [17,18]. Conventional approaches to solving the EVRP, such as adaptive large neighborhood search (ALNS) [19,20], variable neighborhood search (VNS) [21], and ant colony optimization (ACO) [22], have demonstrated success on small-scale problems. However, they often struggle with large-scale instances, require extensive parameter tuning, and are prone to converging on local optima.
The advent of deep reinforcement learning (DRL) has significantly transformed the landscape of solving complex optimization problems. By training deep neural network models, DRL offers new avenues for efficient and innovative solutions. Although DRL can provide desirable solutions for both small-scale and large-scale problems with lower computational costs, potentially outperforming traditional techniques, most current DRL research focuses on problems with fewer features and constraints, such as the TSP and VRP, which involve relatively few features and constraints [23,24]. Since these models typically consider distance as the sole objective function, the underlying problems are inherently linear in nature. In contrast, we adopt a multi-constraint EVRP model that adopts energy consumption as the primary objective function. This model rigorously incorporates realistic physical constraints, including road slope, charging and discharging efficiency, and battery capacity limits. By transitioning from distance-based to energy-based optimization, our approach significantly enhances the operational reliability of logistics within Smart Cities. It provides the precise energy forecasting required for grid load balancing and ensures that autonomous logistics systems can navigate complex urban environments without the risk of range failure.
AlphaZero, renowned for its superhuman performance in complex games like Go, leverages Monte Carlo Tree Search (MCTS) to guide its decision-making. Its performance, however, is highly dependent on a large number of MCTS simulations. Gumbel AlphaZero (GAZ) addresses this by achieving strong performance with far fewer simulations [25]. A subsequent variant, Gumbel AlphaZero Play-to-Plan (GAZ PTP), has proven effective for fixed-step problems like the TSP and the Job-Shop Scheduling Problem (JSSP) [26]. A critical drawback of GAZ PTP, however, is that its self-play mechanism can become unbalanced; the superior MCTS-guided player can create a large performance gap, preventing the opposing policy network from effective learning. To overcome this, we introduce a two-stage self-play strategy to enhance the GAZ PTP algorithm. We first validate our TSS GAZ PTP method on the TSP and then extend it to solve the more complex, variable-step multi-constrained EVRP. This is a non-trivial extension, as EV routes can include multiple revisits to the depot and charging stations. Our approach is designed to tackle both the Distance Minimization EVRP (DM-EVRP) and Energy Minimization EVRP (EM-EVRP) variants of the problem [27]. A general visualization of the framework of this proposed algorithm is described in Figure 1.
Above all, the main contributions of this paper can be summarized as follows.
  • We propose a Two-stage self-play strategy that resolves the unbalanced competition issue in GAZ PTP. This strategy forces the learning agent to consistently compete against an opponent of comparable strength, fostering more robust and effective policy improvement.
  • We successfully extend our method from fixed-step problems to complex variable-step problems like the multi-constrained EVRP. This addresses a key limitation of the original GAZ PTP framework, enabling its application to a wider range of real-world routing challenges.
  • Our proposed TSS GAZ PTP algorithm achieves state-of-the-art performance not only on fixed-step benchmarks but also on both multi-constrained DM-EVRP and EM-EVRP. It demonstrates significant advantages over traditional heuristics and other learning-based methods, particularly on large-scale instances.

2. Related Work

2.1. Games with Deep Reinforcement Learning

DeepMind ushered in a new AI era by designing algorithms such as AlphaGo, AlphaGo Zero, and AlphaZero to defeat professional masters in complex games like Go and Chess [28]. These systems rely heavily on MCTS [29] and UCB for decision-making within vast state spaces. Originally, AlphaGo used a two-stage process involving supervised learning from expert experience followed by iterative RL policy optimization [30]. However, the updated AlphaGo Zero and AlphaZero removed the need for human input; knowing only the rules, they trained via self-play and surpassed the original AlphaGo [31].
While DRL has thrived in single-agent and two-player settings, the real world requires navigating multi-agent environments involving both cooperation and competition. DeepMind attained human-level performance in the Capture-the-Flag variant of the 3D multiplayer game Quake III Arena by means of population-based reinforcement learning [32]. Furthermore, DeepMind’s AlphaStar [33] demonstrates that multi-agent RL can master one of the most challenging real-time strategy games with vast action spaces, partial observability, and long-term planning, suggesting broader applicability to complex sequential decision-making problems.

2.2. Combinatorial Optimization with Deep Reinforcement Learning

In recent years, Transformer has shown superior performance in the area of Natural Language Processing (NLP), particularly in machine translation and sequence-to-sequence tasks. COPs are inherently sequence optimization problems, as their solutions can be described through sequential decisions. Consequently, typical COPs like the TSP and the EVRP are increasingly being studied using DRL methods. Therefore, TSP and EVRP, as typical COPs, are also being studied using DRL methods. For instance, Wang [24] proposed a Graph Attention Network (GAT)-based encoder capable of providing high-dimensional node and graph embeddings for downstream EVRP tasks. Tang [27] formulated an energy consumption model (as opposed to a traditional distance model) using a Transformer-based DRL approach. To address the computational inefficiency for large-scale EVRP instances, Zhang [34] designed a two-layer model that identifies near-optimal solutions based on predefined feasibility conditions and reward structures.

2.3. AlphaGo Zero’s Inspiration for Combinatorial Optimization

In addition, based on the self-play training strategy, AlphaGo Zero [31] achieved superhuman performance on the game of Go. Wang [35] used MCTS with warm-start enhancements to enhance the quality of the plays produced by self-play. Although AlphaGo Zero was designed for two-player games, many researchers attempted to apply the AlphaZero algorithm to single-player tasks by creating competitive environments. Most of them reconstruct reward mechanisms based on self-competition for different problems [36,37,38]. However, these methods do not use MCTS during training the policy network or value network; they only apply the MCTS-guided network after training. To address this issue, Wang [39] improved the policy by employing complete information from the MCTS search tree and learning the trajectory produced by MCTS. To save simulation costs, GAZ with Gumbel MCTS was proposed. And in order to improve the GAZ-based policy network, researchers further developed a new framework [25,26].
Through self-play, the agent learns to find strong paths by planning against potential strategies of its previous self and has shown higher performance on classical CO problems such as TSP and JSSP. But GAZ-based methods have not been investigated yet in multi-constrained EVRP with variable planning steps, which still remains a challenge.

3. EVRP

3.1. Problem Formulation

This section introduces the multi-constrained EVRP. Following [27], we define it with a directed graph G = ( V , E ) , V = C D F ^ , where C = { 1 + s , , n + s } is a set of n customers, D = { 0 } represents the depot, F ^ = { 1 , , s } is a set of recharge stations, and E = { ( i , j ) i , j V } is a set of edges connecting two nodes. Each customer i with demand c i can only be served once. Our goal is to visit all customers and plan completed routes for electric vehicles to minimize the total distance or total energy consumption while satisfying all relevant constraints described in the following equations. As shown in Figure 2, electric vehicles start from the depot D with a maximum load capacity L and a maximum battery capacity Q, and then return to the depot D after serving all customers. During the whole journey, we are supposed to ensure that the capacity of the electric vehicles is not less than 0, and the electric vehicles must visit the recharge stations at the appropriate time to avoid being stranded while driving. We also take into account the maximum serving time of the driver, T m a x .
Figure 2. A simple example of the multi-constrained EVRP path planning.
Figure 2. A simple example of the multi-constrained EVRP path planning.
Smartcities 09 00021 g002
min f ( x ) = i V , j V , i j E i j x i j
  • s.t.
j V , i j x i j = 1 i C
j V , i j x i j 1 i F ^
j V , i j x i j j V , i j x j i = 0 i C F ^
τ 0 = 0
0 τ i T m a x , i C D F ^
τ i + g i + t i j x i j T m a x 1 x i j τ j , i , j V , i j
l j l i c i x i j + L 1 x i j i , j V , i j
0 l i L i C D F ^
u 0 = L
e j e i E i j + Q 1 x i j i C j V , i j
e i = Q i D F ^
0 e i Q i C D F ^
x i j { 0 , 1 } i , j V , i j
The objective function of the multi-constrained EM-EVRP is depicted in Equation (1). Constraints (2) and (3) stipulate that each customer should only be served once, and the charging stations can be visited multiple times separately. This ensures that each customer is served only once to avoid resource waste and redundant deliveries. Constraint (4) indicates the route’s continuity, while Constraints (5) and (6) ensure that the driver’s serving time is no longer than T m a x and recounted after returning to the depot.This ensures driver safety and health and prevents fatigued driving while complying with practical operational management standards. Constraint (7) states that the driver’s serving time is constantly updated when arriving at a site. Constraint (8) tracks electric vehicle cargo, while Constraints (9) and (10) ensure that electric vehicles leave the depot fully loaded and have a cargo capacity of L. Constraint (11) states that the battery capacity of electric vehicles is constantly updated when they arrive at a site, while Constraint (12) and (13) ensure that the remaining battery capacity is not greater than Q and is fully charged when arriving at the depot or recharge stations.This is the core feature that distinguishes EVRP from traditional VRP. Accurate modeling of battery behavior is crucial for preventing “mid-route power depletion,” directly determining whether a path is feasible. Constraint (14) defines the decision variables. The parameters used in these equations and their corresponding explanations are described in detail in Table 1.
The multi-constrained DM-EVRP model shares the same constraints (Equations (2)–(14)), but the objective function for multi-constrained DM-EVRP is defined as (15).
min f ( x ) = i V , j V , i j d i j x i j

3.2. Energy Consumption

We calculate the energy consumption of the electric vehicles between node i and node j as follows:
P i j = m i j a + g s i n α i j + C r c o s α i j + S ν i j
S = 0.5 · C d · ρ · A · ν i j 2
where m i j represents the capacity of the electric vehicles between the node i and node j; a is the acceleration of the vehicle and is set to 0 since we assume the speed is constant in our experiments. Gravity is shown by g, air density is indicated by ρ , the resistance coefficient is indicated by C r , the aerodynamic drag coefficient is indicated by C d , and the slope between nodes i and j is indicated by α i j . After obtaining the mechanical power P i j , the energy consumption is calculated as follows:
E i j = ϕ d · φ d · P i j · t i j P i j m i j 0 k w ϕ r · φ r · P i j · t i j P i j m i j < 0 k w
Since the effect of slope is a factor, the formula for energy consumption is divided into two cases: electric vehicles need more energy when traveling uphill and are allowed to charge for power recovery when traveling downhill. While t i j represents the travel time between node i and node j, ϕ d and ϕ r denote the charging and discharging efficiency of the battery, respectively. Our current EVRP formulation assumes uniform vehicle speed and omits transient energy effects such as those from acceleration and traffic-induced speed variations. While this assumption enhances the model’s manageability and aligns with common practices in existing path planning research based on deep reinforcement learning, it inevitably overlooks several critical real-world factors inherent to urban driving.

3.3. Markov Decision Process of Multi-Constrained EVRP

In this paper, we model the multi-constrained EVRP as a two-player game; then, the Markov decision process (MDP) can be defined as a tuple ( S , A , R , P ) , consisting of state S , action A , reward R and state transition P .
  • State: S represents the state space. In a two-player game, S = ( s t 1 , s t 1 ) , each player starts from the depot with the initial state s 0 1 , s 0 1 , respectively, and s t 1 , s t 1 represent the states of two players at time step t. The graph node state and the electric vehicle state are represented as { s t 1 , s t 1 } = { x t 1 , v t 1 , x t 1 , v t 1 } . For each node i, x t i = ( x s i , c t i ) , x s i , c t i are the static and dynamic information of the node i, respectively. The static information is composed of the two-dimensional coordinates of the node x s i and the demand of each customer c t i . For the vehicle state { v t 1 , v t 1 } = { e t 1 , τ t 1 , u t 1 , e t 1 , τ t 1 , u t 1 } , e t is the remaining battery of the electric vehicles; τ t is the current travel time; and u t is the remaining capacity of the electric vehicles.
  • Action: A represents the action space. In the two-player game A = { a 0 1 , a 1 1 , , a T 1 , a 0 1 , a 1 1 , , a T 1 } , the action a t represents the action that has been chosen at the time step t.
  • Reward: Unlike a single task that sets the reward as minimization or maximization of the objective function, the reward R is reshaped into a binary ± 1 based on self-competition, to which we compare the trajectory ζ 0 p = ( s 0 p , a 0 p , , s T 1 p , a T 1 p , s T p ) for the player p { 1 , 1 } at the time step t, R = 0 if t < T 1 .
  • Transition: The state transitions deterministically to P = F ( s t , a t ) due to the deterministic state transition function F : S × A S .

4. Methodology

4.1. Two-Stage Self-Play

This section presents our new GAZ PTP method with the Two-Stage Self-Play strategy. As shown in Figure 3, at stage 1, the learning player uses Gumbel MCTS to choose actions, while the competitor chooses the action from the best historical policy network. In this stage, only the learning player can take the state of the competitor into consideration when in the expansion phase and update the node information through backpropagation. After a period of training episodes, the learning player is unable to find a better trajectory because it has a great advantage to employ Gumbel MCTS in complex tasks, leading to unbalanced competition. Therefore, we introduce the second stage. Both players use Gumbel MCTS, which ensures that they try to find a better trajectory during the competition. It also increases the depth of MCTS because both players can take the other player’s state into account and update the node information through backpropagation.

4.2. Algorithm

Algorithm 1 illustrates our proposed training framework. In each episode, we divide the two players into the learning player and the competitor, while we also divide the training into two stages. In the first stage, the learning player chooses the action according to the policy network π θ based on GAZ PTP [25], while the competitor chooses the action greedily from the policy network π θ B , which is the best historical policy network of the learning player. The π θ B only updates periodically in the arena mode when the outcome of greedily rolling out π θ is better than π θ B . Our aim is to train the model to converge quickly and save computational resources simultaneously as the competitor does not employ Monte Carlo simulations and Gumbel MCTS only needs few simulations. After n episodes of training, the learning player performs much better than the competitor.
Therefore, to ensure that the learning player learns stronger trajectories, in the second stage, each player uses the Gumbel MCTS-based policy network π θ and π θ B to select actions. Our proposed Two-Stage Self-Play strategy can make the learning player constantly compete with an opponent with similar playing strength so that the learning player can learn smarter trajectories from games. In each episode, there is a probability P of performing self-play that helps the learning player learn the trajectories of its own policy and also maintains the diversity of the training data.
Algorithm 1 Gumbel AlphaZero Play-to-Plan with Two-Stage Self-Play
Input: ρ 0 : initial state distribution; J arena : set of initial states sampled from ρ 0
Input: 0 P < 1 : self-play parameter
Init policy replay buffer B π = { } and value replay buffer B V = { }
Init parameters θ , ν for policy net π θ : S Δ A and value net V ν : S × S [ 1 , 1 ]
Init ’best’ parameters θ B θ
Init stage g = 1
  1:
for episode = 1 , , N  do
  2:
      Assign learning player: l random ( { 1 , 1 } )
  3:
      if  P random (0,1) then
  4:
             μ π θ
  5:
      else
  6:
             μ π θ B
  7:
      for step t = 0 , , T 1  do
  8:
            for player p = 1 , 1 ,  do
  9:
                  if player p l and stage g = 1  then
10:
                        Player choose action a t p according to policy μ
11:
                        and update state s t + 1 p
12:
                  else
13:
                        Performing policy improvement I π ( s t p )
14:
                        and using V ν and π θ based on MCTS to
15:
                        choose action a t p and update state s t + 1 p
16:
                        Store ( s t p , I π ( s t p ) ) in policy replay buffer
17:
                         M V
18:
      Trajectories ζ p ( s 0 p , a 0 p , , s T 1 p , a T 1 p , s T p ) for player p { 1 , 1 }
19:
      if  r ( ζ 1 ) r ( ζ 1 )  then
20:
             Game outcome z = 1
21:
      else
22:
             Game outcome z = 1
23:
      Store tuples ( s t 1 , s t 1 , z ) and ( s t 1 , s t + 1 1 , z ) in value replay buffer M V
24:
      if  s 0 J arena r ( ζ 0 , π θ greedy ) r ( ζ 0 , π θ B greedy ) > 0  then
25:
             Update θ B θ
26:
      if episode E n  then
27:
             Set stage g = 2

4.3. Network Architecture

The policy and value networks are based on the Transformer architecture, and the Transformer block is a little different from the Vanilla Transformer block, as shown in Figure 4. We employ batch normalization (BN) before the Multi-Head Attention (MHA) and add gate aggregation [7] after the MHA and feedforward network (FFN) to replace the additive aggregation. For the policy network, we use a pointing mechanism based on state attention to compute the probability of each legal action, which is similar to the method used in [23].

5. Experiments

5.1. Validation on TSP

In order to compare the performance of our proposed TSS GAZ PTP and the original GAZ PTP method, we first performed experiments on TSP instances with 20, 50 and 100 nodes, which are also tested by [23,26]. The coordinates for each instance are sampled from [ 0 , 1 ] 2 . For problems of different scales, we use the same hyper-parameter settings based on GAZ PTP [26]. As shown in Table 2, the experimental results demonstrate that our method achieves the best performance compared to the other historically best learning-based methods on TSP problems.

5.2. Extension to EVRP

Since the advantage of our TSS GAZ PTP method on TSP has been validated, our aim is to extend the method to variable-step problems. Therefore, we further conducted experiments on multi-constrained EVRP instances with n = 10, 20 and 50 nodes, where each category consists of 512 different instances, which is the same as in [27]. For example, instances with 10 and 20 customers have four recharge stations, while instances with 50 customers have eight. Both the customer sites and recharge stations are uniformly distributed in the [ 0 , 100 ] 2 km area, and the depot is randomly distributed in the [ 25 , 75 ] 2 km area. The demand of each customer is uniformly distributed between 0.25 , 0.5 , 0.75 , 1 . The specific parameters and their descriptions of the vehicle are shown in Table 3. The current model employs a simplified energy consumption function based on average speed and distance traveled. While this approach helps maintain computational feasibility, it fails to adequately reflect the dynamic energy consumption variations caused by frequent starts and stops, traffic lights, congestion, and other factors encountered on actual urban roads. For different scales of problems, we use the same hyperparameter settings based on GAZ PTP [26]. The node embedding dimension is 128 and the batch size is 256. We employ the Adam optimizer with a constant learning rate of 10 4 .

5.3. Baselines

We compare the proposed TSS GAZ PTP framework with the following methods:
  • Gurobi: A commercial optimization solver.
  • ACO: An improved ant colony algorithm based meta-heuristics to solve EVRP [22].
  • ALNS: Adaptive large neighborhood search algorithm, which is enhanced by a local search for intensification to solve EVRP [19].
  • AM: A Reinforcement Learning method based on attention mechanism [23].
  • DRL: A DRL method with Transformer specifically for EVRP [27].
  • GAZ PTP: A Reinforcement Learning method based on self-competition [26].
  • GAZ PTP (fine-tuned): The framework is the same as GAZ PTP, but we have fine-tuned parameters for multi-constrained EVRP.

5.4. Results on EVRP

Our techniques are implemented on two NVIDIA GeForce RTX 4090 GPUs and an Intel i9-14900K CPU running at 6.00 GHz. With a total of 50 k training episodes, after the light-weight test, we divided our training into two stages—20 k and 30 k—with a simulation budget of 100 since [23,27] train on 128M trajectories. The following are the training durations for certain problems: 15 h for C10-S4, 40 h for C20-S4, and 120 h for C50-S8.
To present the training curve, we select the category C10-S4 EM-EVRP as an example and present the results in Figure 5. We see that along with training, total energy consumption gradually decreases. For our TSS GAZ PTP method, SOC consumption drastically decreased after 20 K episodes, demonstrating the efficacy of the second training phase. The training curve eventually converges, indicating that, despite sporadic oscillations during the training phase, our approach has learned the best consistent policy.
In addition, we conducted convergence analysis experiments on the complete learning process of the TST GAZ PTP, visualizing the reduction in SOC consumption as the number of training iterations increases. For comparison, we divide the SOC consumption of EVRP for each category by the number of customers to obtain the average battery consumption per customer. The results indicate the significant improvement of the proposed method, especially for larger problems. In Figure 6, the x-axis represents the number of iterations, and the y-axis represents the average SOC consumption for visiting each customer. The decreasing SOC consumption with increasing training iterations demonstrates the effectiveness of the TST GAZ PTP algorithm on all types of EVRP instances. Importantly, we see that for larger instances, our TSS GAZ PTP achieves more significant improvements, indicating its potential to handle larger-sized problems.
The complete results of the comparison experiments for instances of different sizes are shown in Table 4. Similar to [27], we measure performance through the total energy consumption E and the normalized energy consumption between the total energy consumption E and the best objective value E best across all methods. For Gurobi, the time limit was set to 1 h per instance, consistent with common practice in large-scale combinatorial optimization benchmarks. We see that in category C10-S4 and C20-S4, the Gurobi solver performs better than other methods, but fails to find optimal solutions to large-scale problems; therefore, we use “T/O” in the Table 4 to denote that the time limit was exceeded. For Gurobi, the time limit was set to 1 h per instance, consistent with common practice in large-scale combinatorial optimization benchmarks. The proposed TSS GAZ PTP performs better than other learning-based models both in multi-constrained EM-EVRP and DM-EVRP on all different-sized instances. As the number of nodes increases, its advantage becomes increasingly notable. We can see that TSS GAZ PTP significantly outperforms the previous state-of-the-art methods on all C50-S8 instances, indicating the potential of dealing with large-scale multi-constrained problems.

5.5. Visualization Analysis on EVRP

To visually compare the performance of the GAZ PTP, GAZ PTP (fine-tuned), and TSS GAZ PTP algorithm models, we conducted a visual analysis of the solutions they generate. As shown in Figure 7, we illustrate the performance of the three algorithms under the EVRP50 instance. Compared with the other two algorithms, TSS GAZ PTP has minimal path crossovers. It also has fewer total paths than the other two algorithms. The visualization demonstrates that our method has strong path planning capabilities and that our training strategy is effective.

6. Conclusions

In this paper, we proposed TSS GAZ PTP, a novel DRL framework inspired by GAZ, for solving multi-constrained EVRP. Unlike reconstructing the network architecture or adding state feature modules for specific tasks, our aim is to design a new training strategy based on self-play that can be applied to more complex and general tasks and enhance the diversity of training data. Our training strategy ensures that the player continues to explore a smarter trajectory based on self-play and provides an efficient way to eliminate the local optimal. The experimental results show that TSS GAZ PTP consistently improves the solution quality compared to the original GAZ PTP and other deep reinforcement learning baselines, with particularly notable gains on large-scale instances.
For future work, it is also still promising to further explore more efficient self-play strategies for GAZ-type methods and apply these methods to solve more multi-constrained tasks, especially for their large-scale instances. Another direction is to explore alternative ways to reduce the computational cost caused by Monte Carlo simulations in large-scale problems. In addition, combining the proposed approach with the multi-agent methods [40] for more complex tasks requires more investigation.

Author Contributions

X.Z. and H.W.: methodology, experiments, writing original draft; C.M.: conceptualization, writing, review and editing. All authors have read and agreed to the published version of the manuscript.

Funding

Hefei Key Science and Technology Special Projects under Grant 2024SZD006.

Data Availability Statement

In the Experimental Section, we describe how the dataset was generated and provide the fixed parameter seeds. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence this work.

References

  1. Bonfiglio, A.; Minetti, M.; Procopio, R. Vehicle-to-home service via electric vehicle energy storage virtual partitioning. IEEE Trans. Ind. Appl. 2025, 61, 7790–7802. [Google Scholar] [CrossRef]
  2. Van Fan, Y.; Perry, S.; Klemeš, J.J.; Lee, C.T. A review on air emissions assessment: Transportation. J. Clean. Prod. 2018, 194, 673–684. [Google Scholar] [CrossRef]
  3. Kucukoglu, I.; Dewil, R.; Cattrysse, D. The electric vehicle routing problem and its variations: A literature review. Comput. Ind. Eng. 2021, 161, 107650. [Google Scholar] [CrossRef]
  4. Wang, G.; Qin, Z.; Wang, S.; Sun, H.; Dong, Z.; Zhang, D. Towards accessible shared autonomous electric mobility with dynamic deadlines. IEEE Trans. Mob. Comput. 2024, 23, 925–940. [Google Scholar] [CrossRef]
  5. Fan, G.; Yang, Z.; Jin, H.; Gan, X.; Wang, X. Enabling optimal control under demand elasticity for electric vehicle charging systems. IEEE Trans. Mob. Comput. 2022, 21, 955–970. [Google Scholar] [CrossRef]
  6. Yuan, W.; Huang, J.; Jun, Y. Competitive charging station pricing for plug-in electric vehicles. IEEE Trans. Smart Grid 2017, 8, 627–639. [Google Scholar]
  7. Xu, Y.; Fang, M.; Chen, L.; Xu, G.; Du, Y.; Zhang, C. Reinforcement learning with multiple relational attention for solving vehicle routing problems. IEEE Trans. Cybern. 2022, 52, 11107–11120. [Google Scholar] [CrossRef]
  8. Li, X.; Luo, W.; Yuan, M.; Wang, J.; Lu, J.; Wang, J.; Lü, J.; Zeng, J. Learning to optimize industry-scale dynamic pickup and delivery problems. In Proceedings of the 2021 IEEE 37th International Conference on Data Engineering (ICDE), Chania, Greece, 19–22 April 2021; pp. 2511–2522. [Google Scholar]
  9. Imran, N.M.; Won, M. Smartpathfinder: Pushing the limits of heuristic solutions for vehicle routing problem with drones using reinforcement learning. In Proceedings of the 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Abu Dhabi, United Arab Emirates, 14–18 October 2024. [Google Scholar]
  10. Li, J.; Wang, P.; Ma, S. The impact of different transportation infrastructures on urban carbon emissions: Evidence from china. Energy 2024, 295, 131041. [Google Scholar] [CrossRef]
  11. Roselli, S.F.; Götvall, P.-L.; Fabian, M.; Åkesson, K. A compositional algorithm for the conflict-free electric vehicle routing problem. IEEE Trans. Autom. Sci. Eng. 2022, 19, 1405–1421. [Google Scholar] [CrossRef]
  12. Ding, Z.; Teng, F.; Sarikprueck, P.; Hu, Z. Technical review on advanced approaches for electric vehicle charging demand management, part ii: Applications in transportation system coordination and infrastructure planning. IEEE Trans. Ind. Appl. 2020, 56, 5695–5703. [Google Scholar] [CrossRef]
  13. Pan, Y.A.; Song, Y.; Yang, T.; Ding, Y.; Hu, X. Equitable urban electric vehicle charging: Feasibility and benefits of streetlight charging in kansas city right-of-way. J. Urban Plan. Dev. 2025, 151, 04025066. [Google Scholar] [CrossRef]
  14. Li, J.; Tian, S.; Zhang, N.; Liu, G.; Wu, Z.; Li, W. Optimization strategy for electric vehicle routing under traffic impedance guidance. Appl. Sci. 2023, 13, 11474. [Google Scholar] [CrossRef]
  15. Verma, A. Electric vehicle routing problem with time windows, recharging stations and battery swapping stations. EURO J. Transp. Logist. 2018, 7, 415–451. [Google Scholar] [CrossRef]
  16. Zhang, W.; Fang, X.; Sun, C. The alternative path for fossil oil: Electric vehicles or hydrogen fuel cell vehicles? J. Environ. Manag. 2023, 341, 118019. [Google Scholar] [CrossRef] [PubMed]
  17. Yang, B.; Ren, T.; Yu, H.; Chen, J.; Wang, Y. An evolutionary algorithm driving by dimensionality reduction operator and knowledge model for the electric vehicle routing problem with flexible charging strategy. Swarm Evol. Comput. 2025, 92, 101814. [Google Scholar] [CrossRef]
  18. Moradi, N.; Boroujeni, N.M. Prize-collecting electric vehicle routing model for parcel delivery problem. Expert Syst. Appl. 2025, 259, 125183. [Google Scholar] [CrossRef]
  19. Goeke, D.; Schneider, M. Routing a mixed fleet of electric and conventional vehicles. Eur. J. Oper. Res. 2015, 245, 81–99. [Google Scholar] [CrossRef]
  20. Sistig, H.M.; Sauer, D.U. Metaheuristic for the integrated electric vehicle and crew scheduling problem. Appl. Energy 2023, 339, 120915. [Google Scholar] [CrossRef]
  21. Mao, H.; Shi, J.; Zhou, Y.; Zhang, G. The electric vehicle routing problem with time windows and multiple recharging options. IEEE Access 2020, 8, 114864–114875. [Google Scholar] [CrossRef]
  22. Zhang, S.; Gajpal, Y.; Appadoo, S.S.; Abdulkader, M.M.S. Electric vehicle routing problem with recharging stations for minimizing energy consumption. Int. J. Prod. Econ. 2018, 203, 404–413. [Google Scholar] [CrossRef]
  23. Kool, W.; van Hoof, H.; Welling, M. Attention, learn to solve routing problems! In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
  24. Wang, M.; Wei, Y.; Huang, X.; Gao, S. An end-to-end deep reinforcement learning framework for electric vehicle routing problem. IEEE Internet Things J. 2024, 11, 33671–33682. [Google Scholar] [CrossRef]
  25. Danihelka, I.; Guez, A.; Schrittwieser, J.; Silver, D. Policy improvement by planning with gumbel. In Proceedings of the International Conference on Learning Representations, Virtual, 25–29 April 2022. [Google Scholar]
  26. Pirnay, J.; Göttl, Q.; Burger, J.; Grimm, D.G. Policy-based self-competition for planning problems. In Proceedings of the Eleventh International Conference on Learning Representations (ICLR), Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
  27. Tang, M.; Zhuang, W.; Li, B.; Liu, H.; Song, Z.; Yin, G. Energy-optimal routing for electric vehicles using deep reinforcement learning with transformer. Appl. Energy 2023, 350, 121711. [Google Scholar] [CrossRef]
  28. Silver, D.; Hubert, T.; Schrittwieser, J.; Antonoglou, I.; Lai, M.; Guez, A.; Lanctot, M.; Sifre, L.; Kumaran, D.; Graepel, T.; et al. Mastering chess and shogi by self-play with a general reinforcement learning algorithm. arXiv 2017, arXiv:1712.01815. [Google Scholar] [CrossRef]
  29. Guez, A.; Weber, T.; Antonoglou, I.; Simonyan, K.; Vinyals, O.; Wierstra, D.; Munos, R.; Silver, D. Learning to search with mctsnets. In Proceedings of the International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018. [Google Scholar]
  30. Drori, I.; Kharkar, A.; Sickinger, R.; Kates, B.; Ma, Q.; Ge, S.; Dolev, E.; Dietrich, B.; Williamson, D.P.; Udell, M. Learning to solve combinatorial optimization problems on real-world graphs in linear time. In Proceedings of the 2020 19th IEEE International Conference on Machine Learning and Applications (ICMLA), Miami, FL, USA, 14–17 December 2020. [Google Scholar]
  31. Silver, D.; Hubert, T.; Schrittwieser, J.; Antonoglou, I.; Lai, M.; Guez, A.; Lanctot, M.; Sifre, L.; Kumaran, D.; Graepel, T.; et al. A general reinforcement learning algorithm that masters chess, shogi, and go through self-play. Science 2018, 362, 1140–1144. [Google Scholar] [CrossRef]
  32. Jaderberg, M.; Czarnecki, W.M.; Dunning, I.; Marris, L.; Lever, G.; Castañeda, A.G.; Beattie, C.; Rabinowitz, N.C.; Morcos, A.S.; Ruderman, A.; et al. Human-level performance in 3d multiplayer games with population-based reinforcement learning. Science 2019, 364, 859–865. [Google Scholar] [CrossRef]
  33. Vinyals, O.; Babuschkin, I.; Czarnecki, W.M.; Mathieu, M.; Dudzik, A.; Chung, J.; Choi, D.H.; Powell, R.; Ewalds, T.; Georgiev, P.; et al. Grandmaster level in starcraft ii using multi-agent reinforcement learning. Nature 2019, 575, 350–354. [Google Scholar] [CrossRef]
  34. Zhang, Y.; Li, M.; Chen, Y.; Chiang, Y.-Y.; Hua, Y. A constraint-based routing and charging methodology for battery electric vehicles with deep reinforcement learning. IEEE Trans. Smart Grid 2023, 14, 2446–2459. [Google Scholar] [CrossRef]
  35. Wang, H.; Preuss, M.; Plaat, A. Adaptive warm-start mcts in alphazero-like deep reinforcement learning. In Proceedings of the Pacific Rim International Conference on Artificial Intelligence, Hanoi, Vietnam, 8–12 November 2021. [Google Scholar]
  36. Bansal, T.; Pachocki, W.; Sidor, S.; Sutskever, I.; Mordatch, I. Emergent complexity via multi-agent competition. arXiv 2017, arXiv:1710.03748. [Google Scholar]
  37. Laterre, A.; Fu, Y.; Jabri, M.K.; Cohen, A.-S.; Kas, D.; Hajjar, K.; Dahl, T.S.; Kerkeni, A.; Beguir, K. Ranked reward: Enabling self-play reinforcement learning for combinatorial optimization. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018. [Google Scholar]
  38. Wang, H.; Preuss, M.; Emmerich, M.; Plaat, A. Tackling morpion solitaire with alphazero-like ranked reward reinforcement learning. In Proceedings of the 2020 22nd International Symposium on Symbolic and Numeric Algorithms for Scientific Computing (SYNASC), Timisoara, Romania, 1–4 September 2020; pp. 149–152. [Google Scholar]
  39. Wang, Q.; Hao, Y.; Cao, J. Learning to traverse over graphs with a monte carlo tree search-based self-play framework. Eng. Appl. Artif. Intell. 2021, 105, 104422. [Google Scholar] [CrossRef]
  40. Hao, X.; Hao, J.; Xiao, C.; Li, K.; Li, D.; Zheng, Y. Multiagent gumbel muzero: Efficient planning in combinatorial action spaces. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024. [Google Scholar]
Figure 1. The basic framework of proposed Two-Stage Self-Play Gumbel AlphaZero (TSS GAZ PTP). The red part represents the action selection made by the learning player with Gumbel MCTS, and the blue part represents the action selection made by the competitor player with either greedy strategy in stage 1 or Gumbel MCTS in stage 2.
Figure 1. The basic framework of proposed Two-Stage Self-Play Gumbel AlphaZero (TSS GAZ PTP). The red part represents the action selection made by the learning player with Gumbel MCTS, and the blue part represents the action selection made by the competitor player with either greedy strategy in stage 1 or Gumbel MCTS in stage 2.
Smartcities 09 00021 g001
Figure 3. The comparison of Gumbel MCTS for Stage 1 and 2.
Figure 3. The comparison of Gumbel MCTS for Stage 1 and 2.
Smartcities 09 00021 g003
Figure 4. Vanilla Transformer block (left) and our Transformer block (right) that adds gate aggregation.
Figure 4. Vanilla Transformer block (left) and our Transformer block (right) that adds gate aggregation.
Smartcities 09 00021 g004
Figure 5. Results of comparison experiments among GAZ PTP, GAZ PTP (fine-tuned) and TSS GAZ PTP on category C10-S4 for EVRP.
Figure 5. Results of comparison experiments among GAZ PTP, GAZ PTP (fine-tuned) and TSS GAZ PTP on category C10-S4 for EVRP.
Smartcities 09 00021 g005
Figure 6. TSS GAZ PTP training curves on instances of category C10-S4, category C20-S4, and category C50-S8.
Figure 6. TSS GAZ PTP training curves on instances of category C10-S4, category C20-S4, and category C50-S8.
Smartcities 09 00021 g006
Figure 7. Visualization of solutions generated by GAZ PTP, GAZ PTP (fine-tuned) and TSS GAZ PTP on EVRP50. (a) GAZ PTP. (b) GAZ PTP (fine-tuned). (c) TSS GAZ PTP.
Figure 7. Visualization of solutions generated by GAZ PTP, GAZ PTP (fine-tuned) and TSS GAZ PTP on EVRP50. (a) GAZ PTP. (b) GAZ PTP (fine-tuned). (c) TSS GAZ PTP.
Smartcities 09 00021 g007
Table 1. The parameters and descriptions of the multi-constrained EVRP.
Table 1. The parameters and descriptions of the multi-constrained EVRP.
ParameterDescription
LCapacity of the vehicle
T m a x Driver’s maximum serving time
QBattery capacity of the vehicle
c i Demand of the customer i
l i Remaining capacity while reaching node i
τ i Travel time while reaching node i
g i Serving time while reaching node i
e i Remaining battery capacity while reaching node i
d i j Distance between node i and node j
t i j Travel time of the node i from the node j
E i j Energy consumption from the node i to j
u i j Cargo load from the node i to the node j
v i j Average speed from the node i to the node j
Table 2. TSS GAZ PTP VS Baselines for TSP, where each size of the test set consists of 10,000 instances.
Table 2. TSS GAZ PTP VS Baselines for TSP, where each size of the test set consists of 10,000 instances.
Methodn = 20n = 50n = 100
Obj. Gap Obj. Gap Obj. Gap
Optimal Solver (Concorde)3.840.00%5.700.00%7.760.00%
Kool (Attention)3.850.34%5.801.76%8.124.53%
GAZ PTP3.840.17%5.781.55 %8.013.16%
TSS GAZ PTP (ours)3.840.15%5.761.23%7.972.71%
Table 3. The parameter values and description for multi-constrained EVRP setup.
Table 3. The parameter values and description for multi-constrained EVRP setup.
ParameterDescriptionValue
LCapacity of the vehicle4000 kg
m c Unladen load4100 kg
QBattery capacity of the vehicle80 kwh
AFrontal surface area 3.912  m2
ρ Atmospheric density 1.2 kg/m3
gGravitational constant 9.81 m/s2
c r Resistance coefficient0.01
c d Aerodynamic drag coefficient0.7
ϕ d Propulsion efficiency1.18
ϕ r Regenerative braking efficiency0.85
φ d Charging efficiency1.11
φ r Discharging efficiency0.93
Table 4. TSS GAZ PTP VS Baselines for multi-constrained EM-EVRP and DM-EVRP, where each category of the test set has 512 instances.
Table 4. TSS GAZ PTP VS Baselines for multi-constrained EM-EVRP and DM-EVRP, where each category of the test set has 512 instances.
MethodEM-EVRP10EM-EVRP20EM-EVRP50
Obj. (kwh) Gap Obj. (kwh) Gap Obj. (kwh) Gap
Gurobi145.250%225.520%T/OT/O
ALNS151.154.04%241.547.19%480.643.61%
ACO151.984.61%240.156.49%476.462.71%
AM (Greedy)152.875.22%247.659.81%486.544.88%
AM (Sample1280)149.622.99%238.635.81%471.541.65%
AM (Sample12800)149.282.75%237.565.34%468.460.99%
DRL (Greedy)151.354.18%244.548.43%482.273.96%
DRL (Sample1280)148.772.40%237.435.28%466.700.61%
DRL (Sample12800)148.432.17%236.334.79%464.690.18%
GAZ PTP160.1410.25%252.2711.86%570.1822.92%
GAZ PTP (fine-tuned)155.236.84%245.859.01%550.7518.73%
TSS GAZ PTP147.791.72%234.684.06%463.850.00%
Gurobi346.590%542.550%T/OT/O
ALNS355.012.43%572.215.47%1147.314.26%
ACO353.852.09%568.154.72%1140.743.66%
AM (Greedy)361.764.37%584.147.67%1153.054.78%
AM (Sample1280)357.233.07%571.455.33%1118.941.68%
AM (Sample12800)355.012.43%568.344.75%1112.041.05%
DRL (Greedy)357.243.07%577.386.42%1139.563.55%
DRL (Sample1280)352.721.77%565.174.17%1108.740.75%
DRL (Sample12800)352.351.66%562.163.61%1104.030.32%
GAZ PTP380.339.83%599.8410.56%1328.1420.69%
GAZ PTP (fine-tuned)366.075.62%582.057.28%1283.4916.63%
TSS GAZ PTP351.841.51%560.493.30%1100.470.00%
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wang, H.; Zhang, X.; Mu, C. TSS GAZ PTP: Towards Improving Gumbel AlphaZero with Two-Stage Self-Play for Multi-Constrained Electric Vehicle Routing Problems. Smart Cities 2026, 9, 21. https://doi.org/10.3390/smartcities9020021

AMA Style

Wang H, Zhang X, Mu C. TSS GAZ PTP: Towards Improving Gumbel AlphaZero with Two-Stage Self-Play for Multi-Constrained Electric Vehicle Routing Problems. Smart Cities. 2026; 9(2):21. https://doi.org/10.3390/smartcities9020021

Chicago/Turabian Style

Wang, Hui, Xufeng Zhang, and Chaoxu Mu. 2026. "TSS GAZ PTP: Towards Improving Gumbel AlphaZero with Two-Stage Self-Play for Multi-Constrained Electric Vehicle Routing Problems" Smart Cities 9, no. 2: 21. https://doi.org/10.3390/smartcities9020021

APA Style

Wang, H., Zhang, X., & Mu, C. (2026). TSS GAZ PTP: Towards Improving Gumbel AlphaZero with Two-Stage Self-Play for Multi-Constrained Electric Vehicle Routing Problems. Smart Cities, 9(2), 21. https://doi.org/10.3390/smartcities9020021

Article Metrics

Back to TopTop