Next Article in Journal
Mean Field Approaches to Lattice Gauge Theories: A Review
Next Article in Special Issue
InfoMat: Leveraging Information Theory to Visualize and Understand Sequential Data
Previous Article in Journal
Mean Squared Error Representative Points of Pareto Distributions and Their Estimation
Previous Article in Special Issue
Dual-Regularized Feature Selection for Class-Specific and Global Feature Associations
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

AlphaRouter: Bridging the Gap Between Reinforcement Learning and Optimization for Vehicle Routing with Monte Carlo Tree Searches

1
Hyundai Glovis, Seoul 685-700, Republic of Korea
2
Department of Industrial Engineering, College of Engineering, Hanyang University, Seoul 133-791, Republic of Korea
*
Author to whom correspondence should be addressed.
Entropy 2025, 27(3), 251; https://doi.org/10.3390/e27030251
Submission received: 16 January 2025 / Revised: 19 February 2025 / Accepted: 24 February 2025 / Published: 27 February 2025
(This article belongs to the Special Issue Information-Theoretic Methods in Data Analytics)

Abstract

:
Deep reinforcement learning (DRL) as a routing problem solver has shown promising results in recent studies. However, an inherent gap exists between computationally driven DRL and optimization-based heuristics. While a DRL algorithm for a certain problem is able to solve several similar problem instances, traditional optimization algorithms focus on optimizing solutions to one specific problem instance. In this paper, we propose an approach, AlphaRouter, which solves routing problems while bridging the gap between reinforcement learning and optimization. Fitting to routing problems, our approach first proposes attention-enabled policy and value networks consisting of a policy network that produces a probability distribution over all possible nodes and a value network that produces the expected distance from any given state. We modify a Monte Carlo tree search (MCTS) for the routing problems, selectively combining it with the routing problems. Our experiments demonstrate that the combined approach is promising and yields better solutions compared to original reinforcement learning (RL) approaches without MCTS, with good performance comparable to classical heuristics.

1. Introduction

In NP-hard combinatorial optimization (CO) problems, finding global optimum solutions is computationally infeasible. Instead of finding global optima, numerous heuristics have shown promising results. Despite the high effectiveness of heuristics, their application in real-life industries is often hindered by a variety of problems and uncertain information. Indeed, heuristics, which have mathematical origins, are dependent on problem formulations for proper application. However, an exact formulation of constraints is quite challenging in reality, as some constraints are rapidly changing or highly stochastic in a distributional sense. For instance, a few constraints vanish at a time, and other constraints, e.g., a newly unknown coefficient, should be estimated by an assumed distribution and a certain procedure.As real-life domains are entangled with various participants and requirements, some constraints are too complex to formulate. In such situations, particularly when simulation is possible as in a game, the approach of RL has recently attracted attention in the literature and industry.
Mostly, heuristics aim to solve one specific problem. That is to say, heuristics made for capacitated vehicle routing problems (CVRP), for example, cannot apply to bin-packing problems. To deal with versatile constraints and complex problems, the use of deep neural network architectures coupled with RL, called DRL, has recently been considered effective [1,2]. The DRL approach is flexible, as translating a problem into a reinforcement learning framework is straightforward by appropriately defining both state and reward and running computational simulations. In the long run, the ultimate goal of the DRL approach is to find a new and computational way to solve a complex problem that surpasses the performance of mathematically exact algorithms and their heuristics [3].
Nonetheless, the current stage of the performance of DRL has not achieved the performance of heuristic solvers, and there are ongoing research studies to improve its performance. Our goal is to improve DRL performance by attempting to reduce the gap between heuristic solvers and DRL. Motivated by the AlphaGo series [4,5], we propose a deep-layered network for RL equipped with the selective application of a Monte Carlo Tree Search (MCTS), a general framework applicable to various types of CO problems. We modify some components of the MCTS for application to routing domains that are different from a game.
Unlike the AlphaGo series, we observe that applying MCTS to every action choice is inefficient. To address this, we propose an entropy-based strategy for selectively applying MCTS, which is justified from an information-theory perspective and designed to enhance performance in routing problems. To the best of our knowledge, this represents a novel attempt to refine network architectures in the context of customized MCTS for routing tasks.
The introduced RL framework is quite beneficial if the resulting network is applicable to other similar problems, using the same network architecture as suggested in [1,2,6].
The contributions of our paper are threefold: (1) We propose a deep-layered neural network architecture, fitted to the routing problem, with an exact definition of states and rewards and a policy gradient using a value network. (2) We propose a new MCTS strategy, demonstrating that the integration and entropy-based selective application of our MCTS into the neural network architecture improves the solution quality. (3) We also demonstrate the usefulness of an activation function to numerically improve the solution quality. In short, the main focus of this paper is to propose an effective RL architecture with a modified MCTS strategy for routing problems and to improve search performance. Notably, although we have proposed a neural network architecture specialized to routing problems, any neural network architecture containing a policy and a value, which will be described in later sections, can be integrated into MCTS [7], and the domain, a class of routing problems in this paper, can be extended to other problems.
We organize the rest of the paper as follows: In Section 2, we provide an overview of previous works related to combinatorial optimization, routing, and MCTS. In Section 3, we briefly introduce a general formulation of capacitated vehicle routing problems and our problem’s objective. In Section 4, we expound on our proposed approach, AlphaRouter. In Section 5, we present our experimental results.

2. Related Works

Routing problems are among the most well-known set of problems in combinatorial optimization. A traveling salesman problem (TSP), one of the simplest routing problems, seeks to find the sequence of nodes with the shortest distance. On the other hand, a vehicle routing problem (VRP), similar to TSP, is a routing problem with the concept of depots and demand nodes. Numerous variants of VRP exist in the literature such as VRP with time windows and VRP with pickup and delivery [8]. In this paper, we focus on the capacitated VRP (CVRP), where a vehicle has a limit on its loading amount. Although some variants handle multiple vehicles, we only consider one vehicle for simplicity.
Traditionally, solutions for these problems mainly belong to two types: using math-based approaches like mixed integer programming or applying carefully designed heuristics for a specific type of problem. An example of the latter is the Lin–Kernighan heuristic [9]. In the past decade, hybrid genetic searches with advanced diversity controls have been introduced and applied to various CVRP variants successfully, greatly improving computation time and performance [10,11]. We agree that the current stage of DRL does not surpass the performance of the analytically driven heuristics but emphasize that there should be efforts to solve problems using fewer mathematics-entangled methods, such as dynamic programming and stochastic optimization.
Recent research on neural networks for routing problems can be broadly categorized into two approaches based on the type of input they use: graph modules and sequential modules. Graph modules take a graph’s adjacency matrix as an input and employ graph neural networks to solve routing problems, which are naturally suited to graph structures of routing problems [1,12]. In contrast, sequential models use a list of node coordinates as input and are designed to be compatible with certain types of exact solution inputs. In this paper, we focus on the sequential module approach.
The pointer network [6] was an early model for solving routing problems. It suggested a supervised way to train a modified Seq2Seq (sequence-to-sequence) network [13] with an early attention mechanism [14] by producing an output that is a pointer to the input token. However, a significant disadvantage of the pointer network is that one cannot obtain enough true labels to train large problems since routing problems are NP-hard. To overcome this limitation, the approach in [15] introduced the RL method for training neural networks using the famous and simple policy gradient theorem [16].
Similar to machine translation evolving from Seq2Seq to Transformers [17], routing also adopted the Transformer architecture in [18], using both the attention layer to encode the relationships between nodes and the encoding in decoder to produce a probability distribution for the most promising nodes. Replacing the internal neural network only, they kept the training of RL the same as in [15]. Due to its effectiveness in capturing complex relationships between nodes and generating high-quality solutions, attention-based DRL methods have become one of the most commonly used approaches in DRL for routing problems [18,19,20].
In addition to designing neural architectures, some works focus on the search process itself. For example, the work [2] introduced a parallel in-training search process, named POMO, based on attention network designs. The POMO algorithm assigns a different start node for several rollouts and executes multiple episodes concurrently. An episode or rollout can be understood as a process in which a vehicle travels to the next customer until all customers are visited. Among the many episodes, they selected one best solution as the final solution. Although it does not introduce any more parameters to the model, the size of the input is bound to O N 2 usually, where N is the number of total nodes in the problem. Another approach is to adjust the weights of a pre-trained model during inference to fit the model into a single problem instance as proposed in [21].
MCTS is a decision-making algorithm commonly used in games, such as chess, Go, and poker. The algorithm selects the next actions by simulating the game and updates the decision policy using data from the simulation. The original algorithm consists of four phases: selection, expansion, rollout, and backpropagation. In the selection phase, the algorithm starts at the root node and recursively goes down to a child node, maximizing the upper confidence bound (UCB) scores. This score balances exploration and exploitation in the selection process by considering the visit counts and the average value (i.e., average winning rate) gained on that node. A more detailed explanation of UCB is presented in [22].
When the selection phase ends and the current node is a leaf node, the expansion phase is executed, in which a new child node is appended to the leaf node following a specific policy named “expansion policy”. Then, the rollout phase, using “rollout policy”, simulates the game until it ends and gathers the result (i.e., win or lose). In the backpropagation phase, by backtracking the path (the sequence of selected nodes), the evaluated result from the rollout policy is updated for each node. For example, the updates increase visit counts by one and update the average win rate with the result from the rollout phase. We note that there are two different policies used in the original MCTS, but in our MCTS implementation, only the expansion policy exists, and the rollout policy is replaced by the value network. We further describe this in Section 4.
Since MCTS shares similarities with routing problems, in that sequential decision-making is involved, we adopt MCTS as an additional search strategy for routing problems. Indeed, there have been some efforts to integrate MCTS into CVRP solvers, demonstrating its potential to enhance decision-making in combinatorial optimization tasks. Upper Confidence Bounded Tree (UCT; which is an extension of MCTS)-based vehicle routing solver was suggested in [23]. An extension to HF-CVRP (Hybrid Fleet CVRP) was shown in [24]. However, these approaches applied MCTS in a non-selective manner without utilizing DRL methods.
In the game of Go, the next move is selected sequentially based on the board situation. In routing, the next node to visit is selected sequentially based on the current position and other external information, such as nodes location and demands. Thus, the similarity between Go and routing is obvious.
Neural networks have been successfully integrated with MCTS in AlphaGo [4], and AlphaGo Zero achieved even better results by introducing a self-play methodology for training the network [5]. The first AlphaGo used two different neural networks for “expansion” and “rollout”, inducing a computational burden because of many recursive calls for the “rollout” network until the end of the game. This problem is solved by AlphaGo Zero, in which one call for the value network predicts the expected result from any given state (or node). This was originally gathered in the rollout phase. In short, only one call for the value network replaced numerous calls for the rollout network, saving substantial computations. Our work adopts this idea for efficient MCTS simulation. Neural network architectures have previously been explored in reinforcement learning (RL) contexts [2,12,18].
Building on these efforts, as well as prior work applying AlphaZero [4,5], similar DRL-MCTS combinations in the pre-marshalling problem [25] or coordinated route planning [26] and some work that integrated neural network with MCTS exist not only in the area of games but also in the Q-bit routing challenge [27,28]. As far as we know, there is no research on modified MCTS specifically for routing. Thus, we aim to bridge the gap between DRL and heuristic solvers by selectively applying MCTS to VRP tasks using entropy, making it more effective for addressing various routing challenges.

3. Preliminaries

Before explaining our work, we introduce the formulation of CVRP with one vehicle to connect with our routing problem. We notice that TSP can be easily formulated from CVRP by modifying some conditions.
We start with a set of n customers, and each customer i, i = 1 , , n , has a predefined positive demand quantity q i . To fulfill the demands of the customers, the vehicle starts its route at the depot node, indexed as 0. The vehicle must visit each customer only once, and its capacity cannot be more than Q m a x . In conventional settings, presumably, Q m a x is set to be sufficiently large to fulfill all customers’ demands. In reality, however, a vehicle may start with small Q m a x due to lack of information, and its load should be refilled. We observe that the problem formulation itself below cannot reflect vehicle refilling, but we aim to handle this situation as an example of the dynamic routing problem [29], using our RL approach in the next section. For example, as described in Figure 1, we present a scenario where the vehicle refills with a smaller load after its first subroute is created and solved with DRL. In the figure, dots represent customer nodes, and red dots denote the nodes visited, which are also connected by the red line. Figure 1a represents the initial setting with Q m a x = 1 , Figure 1b shows the situation after the vehicle and is refilled with a slightly different Q m a x = 0.9 , and Figure 1c shows the routing result with Q m a x = 0.9 . Usually, to handle these dynamics of the environment, a complicated mathematical formulation or expert engineering techniques are needed [29]. However, with DRL, one can just adjust Q m a x , which changes only one line in our implementation.
As a graph representation of the problem is common in the literature, we also represent the problem using a graph G ( V , E ) where V = { 0 , 1 , , n , n + 1 } , meaning all the nodes in the problem, 0 and n + 1 , are the same depot node. The last node n + 1 is just an extra term for the ease of the formulation as the final depot of a tour. We define π t to be the node visited at time t , t 0 with π 0 = 0 and a tour, π t 1 , t 2 , from time t 1 up to t 2 is defined as a sequence of visited nodes: for example, π 0 , T = [ π 0 = 0 , π 1 = 2 , π 2 = 7 , , π T = n + 1 ] , in which T is the last time point in the tour. The terms route and tour are used interchangeably. Additionally,
E = ( i , j ) i , j V
refers to all the edges from all node combinations. Note that the demand of depot node q 0 is 0, meaning q 0 = q n + 1 = 0 . We also introduce a binary decision variable x i j which is 1 if there is a direct route from customer i to j, and 0 otherwise. The distance of edge ( i , j ) is denoted by c i j . The cost C ( π 0 , t ) is the cumulative distance calculated so far at t, given the sequence of visited nodes: C ( π 0 , t ) = c π 0 , π 1 + c π t 1 , π t , and C ( π 0 , 0 ) = 0 . We formulate the one-vehicle CVRP as follows:
min x i , j ( i , j ) E c i j x i j
Subject to
j = 1 , j i n + 1 x i j = 1 , i = 1 , , n ,
i = 0 , i h n x i h j = 1 , j h n + 1 x h j = 0 , h = 1 , , n ,
j = 1 n x 0 j = 1 ,
y i + q j x i j Q m a x 1 x i j y j , i , j = 0 , , n + 1 ,
q i y i Q m a x , i = 0 , , n + 1 ,
x i j { 0 , 1 } , i , j = 0 , , n + 1 .
We briefly present a list of equations as follows: Equation (1) is the objective of the problem, the minimization of the distance traveled by the vehicle; Equation (2) is a constraint to regularize all customers being visited only once; Equation (3) controls the correct flow of a tour, the visit sequence, by ensuring the number of times a vehicle enters a node is equal to the number of times it leaves the node; Equation (4) imposes that only one vehicle leaves the depot; and Equations (5) and (6) jointly express the condition of vehicle capacity. Note that variants for the constraints are possible, and the main reference to the above formulation is Borcinova [30]. We also notice that the finding of solution x i , j is equivalent to the construction of tour π 0 , T : for example, π 1 = 2 , π 2 = 7 represent x 2 , 7 = 1 . Noticeably, in the formulation, the finding of x i j leads to the construction of y i . On the route up to a visit to node j V , a continuous variable y j represents the accumulated demands and is dependent on the decision variable x i j : for instance, on tour π 0 , 3 = [ π 0 , π 1 , π 2 , π 3 ], y π 3 = q π 0 + q π 1 + q π 2 + q π 3 . To migrate this formulation into TSP, one only needs to remove constraints regarding the capacity of a vehicle and the demands of customers, so that only decision variable x i j remains.

4. Proposed Network Model, AlphaRouter

In this section, we present our approach, named AlphaRouter, to solving the routing problem using both reinforcement learning and MCTS frameworks. We revise the above routing problem by adding the possibility of refilling the vehicle to reflect realistic situations. We notice that the above routing problem is unable to include the refilling action. We begin by defining the components to bring the environment into our RL problem, followed by neural network models of policy and value. We then outline our idea and implementation to adapt MCTS to the routing problem. Our overall process consists of two stages: training the neural network using reinforcement learning and combining the pre-trained network with the modified MCTS strategy to search for a better solution, meaning tour π 0 , T or x i , j in the CVRP formulation. Due to the computational demands associated with the application of MCTS, we adopt a selective application of our MCTS when ambiguity arises in choosing the next customer node that is proposed by the output distribution of the policy network, where the output distribution refers to the distribution of possible next nodes. This selective application enhances computational efficiency while maintaining the effectiveness of the MCTS strategy.

4.1. Reinforcement Learning Formulation

The input is denoted by x i R 2 , which represents a set of coordinates for customer i. The demand of a node can be included in the vector if the problem is a type of CVRP, i.e., the input for CVRP is then [ x i ; q i ] R 3 , where the semicolon ; represents a concatenation operation. Also, with n customers, the total number of nodes is N = n for TSP, and N = n + 1 for CVRP as one depot node exists. Thus, the input matrix is denoted as Ø = [ x i ] i = 1 , N R N × 2 for TSP problems, and Ø = [ x i ; q i ] i = 1 , N R N × 3 for CVRP.
To bring the problem into a reinforcement learning framework, we define the state, action, and cost (inversely convertible to reward). In our work, the observation state at timepoint t, denoted by s t , is a collection of the node data χ , containing coordinates and demands; the currently positioned node π t ; a set of available nodes to visit, denoted by V t ; and a masking vector for unavailable nodes m t R N , of which the p t h element in the vector is filled with 0 if p V t and if p V t : s t = ( χ , π t , V t , m t ) . Though masking vector m t stems from the available-node set V t in our formulation, we intentionally add both m t and V t to the state s t so that the masking vector can be adjusted and redefined to reflect domain requirements just as several masking techniques are possible in Transformer [17,31,32].
We omit t for χ , as the node data are invariant over time in this problem: for all time points, χ stays unchanged. However, one could make node data χ varying in time depending on the domain requirement, and the proposed network model is able to handle time-varying χ . For CVRP, the current vehicle’s load, denoted by load t = Q m a x y π t , is also added to s t : s t = ( χ , π t , V t , m t , load t ) . The node set V t V holds nodes, not visited yet, that are able to fulfill the demands considering load t .
The action, denoted by a t , is to choose the next customer node and move to it. The action in an episode, a sequence of possible states, is chosen by our policy neural network, as shown in Figure 2, which outputs a probability distribution over all the nodes given the state at t, s t . We use p θ ( . | s t ) to describe the policy network output at time t during the episode rollout. In the training phase, the action is sampled from action distribution p θ ( . | s t ) , a t p θ ( . | s t ) , as the next node to visit, meaning π t + 1 = a t , t 0 , with π 0 = 0 . The sampling operation aims to give the vehicle (or agent) a chance to explore a better solution space in our training phase. In the inference phase, however, we choose the action with the maximum probability, meaning a ^ t = arg max i V t p θ ( i | s t ) if unvisited nodes exist, and a ^ t = n + 1 otherwise.
A value network is designed to predict the overall cost (or distance) in the episode at state s t . This is later used in updating the MCTS tree’s statistics. We describe in detail how other components work in Section 4.2. Specifically, an episode, τ , is a rollout process in which the state and action are interleaved over t = 0 , 1 , , T until the terminal state s T is reached: τ = ( s 0 , a 0 , C ( π 0 , 0 ) , , s T , a T , C ( π 0 , T ) ) . In this problem, the terminal state is the state in which all customers are visited and the vehicle has returned to the depot if it is CVRP. Because of the possibility of multiple refillings of the vehicle, the last time point T can vary in episodes of CVRP problems. For example, even when the problems have the same size (for example, n = 50 ), the optimal solution path can vary due to different customer locations and demands. Upon reaching the terminal state, no more transitions are made, and the overall distance, C ( π 0 , T ) , is calculated.

4.2. Architecture of the Proposed Network Model

The neural network architecture of our policy network for calculating the probability distribution p θ ( · | s t ) is similar to the one used in previous studies [2,18]. However, to solve the routing problem, we modify the decoder part, relying on the transformer [17]. We aim to extract meaningful, possibly highly time-dependent and complex, features that are specific to the current state while maintaining the whole node structure. We make the two networks share the same embedding vector, transformed by the current input, s t at time t. The design of the shared input transformation is a deep-layered network, consisting of an encoder and decoder, to take advantage of both the whole node structure and the current node. The structure of the two networks with the shared feature transformation is reminiscent of the architecture from the AlphaGo series [4,5] and previous related works [2,18]. In essence, the input s t produces the estimated probability of possible next actions via the policy network, p θ ( · | s t ) , and the predicted cost via the value network, v θ ( s t ) . For simplicity, we denote all learnable parameters as θ , which consists of parameters from the shared transformation, those from the policy network, and those from the value network.
In detail, we explain the proposed network, dividing it into three parts: encoder in the feature transformation, decoder in the feature transformation, and policy and value. The objective of the encoder is to capture the inter-relationship between nodes. The encoder takes only the node data input χ from the s t , passing it to a linear layer to align with a new dimensionality, d e , via the multi-head attention layers, expressed by M H A ( Q , K , V ) with input tensors of query Q , key K , and value V . The output of the multi-head attention is an encoding matrix, denoted by e R N × d e . Each row vector represents the i t h node in the encoding matrix, denoted by e i R d e . So, the currently positioned node at time t’s encoding is e π t , the embedding vector reflecting the complex and interweaved relationship with the other nodes. In summary, the encoder process is self-attention to the input node data expressed as e = M H A ( L i n e a r ( χ ) , L i n e a r ( χ ) , L i n e a r ( χ ) ) . This is repeated over several layers in the model. Relying on the idea of hidden states and current inputs in recurrent networks, we execute the encoder process once per episode, thereby reducing the computational burden, and use the current-node embedding and the current loading load t as inputs for the decoder in a sequential manner. We provide a detailed explanation later in this section.
The decoder is responsible for revealing the diluted relationships in the encoding matrix e with additional information if it is given. Specifically, the decoder captures the relationships between the current node χ and the others. For example, let us assume that the vehicle is currently on node i and the current node’s embedding is e i . Notice that we ignore time t in the encoding matrix since it does not change in an episode as the output of the encoder is reused over the episode once it has been executed. By using this e i as the query and the whole encoding matrix e as the key and value, the decoder can reveal the relationships between the current node and the others. When passing the query, key, and value, we apply linear transformations to each of them. One should note that TSP and CVRP have different inputs for the query. In CVRP, the current load, load t s t , is appended to the query input, while TSP is not. While there are several layers for the encoder, we only use one layer of MHA for the decoder. A summarization of the decoder is as follows:
d = M H A ( Q , K , V ) R d e , Q = { Linear ( e π t ) for TSP , Linear ( [ e π t ; l o a d t ] ) for CVRP , K = Linear ( e ) , V = Linear ( e ) .
The policy layer and value layer are responsible for calculating the final policy p θ ( · | s t ) , a probability distribution on all nodes given s t , and the predicted distance v θ ( s t ) output, respectively. We compute p θ ( · | s t ) as follows with a given hyper-parameter C that regulates the clipping:
p θ ( · | s t ) = softmax ( tanh ( d e T / d e ) C + m t ) .
To compute p θ ( · | s t ) , we multiply the decoder output d by the transposed encoding matrix e T and divide it by d e . The output goes through the tanh function, and we add the mask for the unavailable nodes using m t . Finally, we apply a softmax operator to this result.
For v θ ( s t ) , we pass the same decoder output d to two linear layers of which the shape is similar to the usual feed-forward block in the transformer: v θ ( s t ) = Linear ( σ ( Linear ( d ) ) ) , in which σ ( · ) is an activation function such as ReLU and SwiGLU [33,34]. A diagram for each neural network design is presented in Figure 3.
When training the model for an episode, the encoding process is only required once as the input of the encoder (the coordinates of nodes) is fixed along the rollout steps. The decoder, on the other hand, takes the inputs that change over time, i.e., the current node and current load. Thus, on first execution of the model, we execute both the encoder and the decoder. After the first execution, we execute only the decoder and policy and value parts, saving considerable computations. The encoder and decoder share the same parameters, while the policy and value networks do not. Figure 2 explains the overall process.
Additionally, we intentionally exclude residuals in the encoder layers, as we have observed that, unlike the original transformer and its variants, residual connections greatly harm the performance of the model. Another variation we have added to the previous model is the activation functions. Recent studies on large language models (LLMs) exploited different activation functions for their work. We take this into account and test SwiGLU activation, just as Google’s PaLM did in [35]. We report the results in Section 5.

4.3. Training the Neural Network

To train the policy network θ , we use the well-known policy gradient algorithm, ‘reinforce with the baseline’ [36]. This algorithm deals with high-variance problems prevalent in policy gradient methods by subtracting a specially calculated value, called the baseline. This algorithm collects data during an episode and updates the parameters after each episode ends. For C ( π 0 , T ) , the distance traveled by the vehicle following the sequence π 0 , T , the policy network aims to learn a stochastic policy that outputs a visit sequence with a small distance over all problem instances. The gradient of the objective function for the policy network is formulated as follows:
J θ ( π ) E π p θ ( · | s ) [ ( C ( π 0 , T ) b ( s ) ) log p θ ( π | s ) ] ,
where p θ ( π | s ) = p θ ( π 0 | s 0 ) k = 1 T 1 p θ ( π k | s k , π k 1 )
in which b ( s ) is a deterministic greedy rollout from the best policy trained so far as a baseline in order to reduce the variance of the original formulation [18]. After training model parameter θ for an epoch, we evaluate it with a validation problem set, setting b ( s ) as the evaluated cost in the validation. One can think of this procedure as the training-validation mechanism in general machine learning.
The mere use of a baseline incurs additional computational costs arising from the rollouts of several episodes, being an expensive procedure. To alleviate this burden, we introduce a value network, b ( s ) = v θ ( s ) , instead of the greedy rollout baseline.
The value network’s objective is to learn the expected cost at the end of the episode from any state during episode rollout. We keep track of the value network’s output throughout a rollout and train the network with the loss function
L v t = 0 t = T ( C ( π 0 , T ) v θ ( s t ) ) 2
As in the POMO approach [2], we test the baseline using the average cost over a batch of episodes in addition to the baseline using value network v θ ( s t ) . For instance, we calculate the baseline as the mean of all 64 episodes as a batch size, representing the number of concurrent episode runs. This value network is also used in the MCTS process described in the next section. Since our model shares the parameters in the encoder and decoder between the policy network and the value network, an update in the value network affects the parameters in the policy network with the gradient of the final loss as follows:
L J θ ( π ) + L v .

4.4. Proposed MCTS for the Routing

The main idea of MCTS is to improve the solutions, good in general, of trained policy and value networks to be problem specific by further investigating possible actions. In essence, without MCTS, we make a transition from s t to s t + 1 by taking action a t , which is the output from the policy network only. However, in our proposed MCTS as described in Figure 2, we select the next node by considering costs, which is the output of the value network, in addition to the prior probabilities from the policy network. In addition, we selectively apply the MCTS at time t when the highest probability from the current policy network fails to dominate, meaning actions other than the highest-probability action need to be considered. In practice, when the difference between the highest probability and the 5 t h highest probability is less than 0.75 , we apply the MCTS, expounded below.
MCTS comprises three distinct phases: selection, expansion, and backpropagation. They iterate with a tree, initialized by the current node π t and updated as iterations continue, for a given number of simulations, denoted by n s as the total number of the MCTS iterations. At each iteration, the tree keeps expanding, and the statistics of some nodes in the tree are updated. As a result, a different set of tree node paths is explored throughout the MCTS iterations. Figure 4 describes an MCTS procedure in which a few MCTS iterations are run. Given time t, we use s k | t = ( χ , π k | t , V k | t , m k | t ) to represent a tree node positioned at level k. The definition of s k | t is the same as s t with the only difference being that s k | t represents inner time step k temporarily used in MCTS selection. Thus, in an MCTS iteration, with fixed t, level k advances as different levels are selected in the selection phase.
In the beginning, we initialize the root tree node s 0 | t with s t , meaning that MCTS starts from s t , therefore the vehicle position in s 0 | t is the same as the position at t, π 0 | t = π t . To describe the MCTS phases, we introduce new notations: for the i t h customer (or depot) node, H k | t ( i ) denotes an accumulated visit count, and W k | t ( i ) an accumulated total cost, both at the k t h level of the tree. Then, we compute the ratio Q k | t ( i ) = W k | t ( i ) / H k | t ( i ) , called the Q-value. The Q-value, Q k | t ( i ) , for the i node represents an averaged cost at the level k. We normalize all Q-values in the simulation by min-max normalization.
In the selection phase, given the current MCTS tree, we recursively choose child nodes until we reach a leaf node in the tree. For instance, at the k t h level of the tree node, among possible nodes, denoted by V k | t , we select the next node at s k | t according to Equation (13), thereby moving to a tree node at the k + 1 t h level:
π k + 1 | t = a ^ k | t = arg max i V k | t Q k + 1 | t ( i ) + c puct H k + 1 | t ( i ) 1 + j V k | t H k + 1 | t ( j ) p θ ( i | s k | t ) ,
in which hyper-parameter c puct adjusts the contribution of the policy-network evaluation p θ ( · | s k | t ) in comparison with the negative of averaged cost Q k | t ( i ) for node i. Let us use to denote the leaf level in the tree in the selection phase. We obtain an inner state path s 0 , | t = [ s 0 | t , , s | t ] and an inner node path π 0 , | t = [ π 0 | t , π 1 | t , , π | t ] . Then, the total node path from time 0 to the level becomes a concatenation of outer-path nodes π t 1 and inner-path nodes π 0 : | t : [ π t 1 ; π 0 : | t ] . The selection phase continues until no more child nodes are available to traverse from the current position, meaning that the node is a leaf node in the tree. In Figure 2, for instance, node 4 is selected, highlighted in red, from the root node in the first selection phase, and = 1 . Note that, in the next MCTS iteration, the selection phase starts again from the root node s 0 | t again, not from the leaf node selected from the previous iteration.
After the selection phase, the expansion phase starts, updating the MCTS tree by expanding new child nodes in V | t at node π | t and moving to the backpropagation phase. Note that in the early stages of the MCTS iterations, the tree may not have expanded enough to select a terminal node, meaning V | t , < T t . As the MCTS iteration advances, the tree expands enough so that the final selected node from the selection phase, ϕ | t , becomes the terminal node, V | t = , = T t , meaning that routing has ended with no available node to move to. In the latter case, the MCTS iteration continues until it reaches n s in order to explore a variety of possible node paths.
Finally, in the backpropagation phase, tracing back π 0 : | t , we update H k | t ( i ) and W k | t ( i ) for all selected tree nodes in k [ , 1 , , 0 ] and all selected customer nodes i π 0 : | t . Specifically, the update follows the rule below:
H k | t ( i ) = H k | t ( i ) + 1 ,
W k | t ( i ) = { W k | t ( i ) + C ( [ π 0 , t 1 ; π 0 : | t ] ) , = T t , W k | t ( i ) + v θ ( ϕ | t ) , < T t .
As the MCTS iteration continues, the selected leaf node can be either a terminal node ( = T t ), meaning that the routing has ended, or a non-terminal node ( < T t ). In the former case, W k | t ( i ) determines the cost by evaluating the selected path of customer nodes, C ( [ π 0 , t 1 ; π 0 : | t ] ) . However, in the latter, we use the predicted distance v θ ( ϕ | t ) . This is possible, as we train the value network v θ ( · ) to predict the final distance at any state following Equation (11). In updating accumulated total cost W k | t ( i ) as in Equation (15), we obtain the predicted cost using v θ ( ϕ | t ) at the final selected node ϕ | t , then by greedily selecting the next customer node until routing finishes.
When finishing all simulations, we collect a visit-count distribution from the s 0 | t ’s child nodes and choose the most visited node as a t for the next node to visit in the rollout:
π t + 1 = a ^ t = arg max i V 0 | t H 1 | t ( i ) .
Algorithm 1 summarizes the overall process of our MCTS. Additionally, application of the MCTS is computationally expensive, making it impractical for real-world use. For each moment in time t, the entropy of the probability distribution p θ ( · | s t ) is computed by the formula j = 0 l p π j | t | s t log p π j | t | s t . We find that most p θ ( · | s t ) outputs have low entropy, meaning the highest probability, max i p θ ( · | s t ) , dominates other values. Our idea is that we selectively apply our MCTS to the rollout when max i p θ ( · | s t ) fails to dominate, i.e., when the difference between the highest probability and the fifth highest probability is less than 0.75 . We empirically obtain a strategy to improve solution quality via computation time trade-off.
Algorithm 1 Overall simulation flow in MCTS
Require:  s 0 | t : root state initialized by s t , p θ : trained policy network, v θ : trained value network, n s : number of simulations to run
1: 
Initialize the MCTS tree by s 0 | t
2: 
while  i < n s  do
3: 
     ϕ | t , s 0 : | t , π 0 : | t = Select( s 0 | t )          ▹ A leaf node in the MCTS tree is chosen
4: 
    Expand( p θ , ϕ | t ) ▹ Expand the MCTS tree from the leaf node using available nodes V | t
5: 
    if  V | t =  then                            ▹ The selection reached the terminal node
6: 
                    c = C ( [ π 0 , t 1 ; π 0 : | t ] )                                        ▹ (Equation (15))
7: 
    else
8: 
         c = v θ ( ϕ | t )        ▹ Use the predicted cost for non-terminal leaf nodes
9: 
    end if
10: 
    Backpropagate( s 0 : | t , c)
11: 
     i = i + 1
12: 
end while
13: 
return arg max i V 0 | t H 1 | t ( i )
We present the pseudo-code for each MCTS phase in Algorithm 2. We highlight the modifications made to adapt MCTS to the routing problems. Firstly, we apply min-max normalization to the Q-value calculated during the entire search phase. Since the Q-value range is in [ 0 , ) , which is equal to the range of cost (distance), this can cause a computational issue as the term c p u c t H k + 1 | t ( i ) 1 + H k + 1 | t p θ ( i | s k | t ) typically falls within the range [ 0 , 1 ] . Using a naïve Q-value could lead to a heavy reliance on the Q-value when selecting the child node because of the scale difference. To apply min-max normalization to the Q-value in the implementation, we record the maximum and minimum values in the backpropagation phase. Secondly, to minimize the distance, we negate the Q-value so that the search strategy aims to minimize distance. In the pseudo-code, the STEP procedure, which we do not include in the paper due to its complexity, accepts the chosen action as input and processes the state to transit to the next state. Internally, we update the current position of the vehicle as the chosen action in addition to the current load of the vehicle if the problem is CVRP. In addition, the mask for unavailable nodes, m t , is updated to prevent the vehicle from returning to visited nodes.
Algorithm 2 List of functions in MCTS.
Require:  c p u c t = 1.1: hyper-parmeter
1: 
function select( s 0 | t )
2: 
     n o d e s 0 | t
3: 
     s 0 : | t = [ s 0 | t ]
4: 
     π 0 : | t = [ π 0 | t ]
5: 
     k = 0
6: 
    while  n o d e has child do
7: 
         π k + 1 | t = arg max i V k | t Q k + 1 | t ( i ) + c puct H k + 1 | t ( i ) 1 + j V k | t H k + 1 | t ( j ) p θ ( i | s k | t ) ▹ (Equation (13))
8: 
        Append π k + 1 | t to π 0 : | t
9: 
         n o d e = s k + 1 | t is updated with the child node selected
10: 
        Append n o d e to s 0 : | t
11: 
         k = k + 1
12: 
    end while
13: 
     ϕ | t = n o d e                                                                              ▹ Also, = k
14: 
    return  ϕ | t , s 0 : | t , π 0 : | t
15: 
end function
16: 
function expand( p θ , ϕ | t )
17: 
    for all  i V | t  do
18: 
        s, c o s t , d o n e = step(i, ϕ | t ) ▹ Run STEP with the leaf node’s state for the given i
19: 
        create a new child node and assign s as the state
20: 
        append the child node to ϕ | t
21: 
    end for
22: 
end function
23: 
function backpropagate( s 0 : | t , π 0 : | t , c)
24: 
    get [ , 1 , , 0 ] from s 0 : | t
25: 
    for all k∈ [ , 1 , , 0 ] and i π 0 : | t  dok denotes a level from the leaf to the root
26: 
         H k | t ( i ) + = 1
27: 
         W k | t ( i ) + = c
28: 
         Q k | t ( i ) = W k | t ( i ) H k | t ( i )
29: 
        Normalize Q k | t ( i )
30: 
    end for
31:
end function

5. Experiments

At first, we generate problems by constructing N, the number of all nodes, random coordinates of which each coodinate is uniformly distributed in range [ 0 , 1 ] . For CVRP, we set the first node as the depot node. In addition, the demand of each customer, q i , is assigned an integer between 1 and 10, scaled by 30, 40, and 50 for the problem size (n) 20, 50, and 100, respectively. We also apply POMO [2] in our training, setting the pomo size as the number of customer nodes. However, in the inference phase, we exclude POMO, as the utilization of MCTS is infeasible. Our implementation, available at the github https://github.com/glistering96/AlphaRouter (accessed on 19 February 2025), is built on Pytorch Lightning [37] and Pytorch [38]. For the setting of MCTS, we set c puct as 1.1 , and vary the total number of simulations, n s , by 100, 500, and 1000. We measure the performance on 100 randomly generated problems as described above. In all tables presented in this section, “dist” refers to the average total distance traveled across all instances, while “time” represents the average inference time. For the selective MCTS approach, the “time” includes both the DRL inference time and the additional computational cost incurred by the selective application of MCTS.
For the encoder and decoder settings, the size of each head’s dimension is 32 with 4 heads, summed up to the embedding dimension d e = 128 , and 6 encoder layers are used. We train the model for 300 epochs with batch size 64, and 100,000 episodes. Note that this batch size is the number of parallel rollouts in training, meaning that 64 episodes are simultaneously executed. We fix the clipping parameter C to 10. We use Adam [39] with a learning rate of 1 × 10 4 , eps of 1 × 10 7 , and betas  { 0.9 , 0.95 } without any warm-up or schedulings. For fast training, we use 16-bit mixed precision.
We conduct the experiment on a machine equipped with i5-13600KF CPU, RTX 4080 16 GB GPU, and 32 GB RAM on Windows 11. For heuristic solvers in the experiment, we use the same machine except with a WSL setting on Ubuntu 22.04.

5.1. Performance Comparison

In this section, we compare the performance of the proposed MCTS-equipped model with that of some heuristics for the two routing problems, TSP and CVRP. The baseline models for TSP are heuristic solvers, LKH3 [40], Concorde [41], Google’s OR Tools [42], and Neareset Insertion strategy [43]. These heuristic solvers, developed by optimization experts, serve as benchmarks for assessing optimization capabilities in solving routing problems. For Google’s OR Tools, we add a guided-search option, and the result reported here is better than the result reported in the previous research [2,18]. For a fair comparison, the time limit for OR Tools is set to be similar to the longest MCTS runtime, n s = 1000 , for each n.
For comparison, additionally, we denote the proposed model without the MCTS strategy as an attention model (AM) that leverages solely the proposed neural network without MCTS. When integrating the MCTS strategy with the network model, we vary the number of simulations to investigate its impact on performance. We evaluate one case using 100 simulations randomly generated by the same generation strategy employed during training. The comparative results of TSP and CVRP problems are summarized in Table 1 and Table 2, respectively. The column named n s represents the number of simulations in the MCTS strategy, and n s = 0 denotes the AM result without the MCTS strategy. Thus, the ‘DRL’ method includes the proposed method and the AM method. The column named ‘baseline’ represents the different baseline b ( s ) used in (10), in which “mean” represents the mean-over-batches baseline, and ‘value’ represents the baseline with the value network, v θ . The baseline term here differs from the heuristic baselines reported in the tables. The best results in the experiment cases of the DRL methods are presented in bold.
For TSP, none of the records from AM, denoted by n s = 0 , are presented in bold, meaning that the MCTS application improves solution quality. For CVRP, some records using AM are bold, but the best records are all from the cases with MCTS. We provide the visualization of the two different methods results in Figure 5 for a clearer understanding of the effectiveness of MCTS. Visually and quantitatively, the solution in Figure 5b of the proposed model is better than that in Figure 5a of the AM. As for the scalability analysis, our results indicate that as the problem size increases, the runtime exhibits an upward trend. As shown in our table, the computational cost increases substantially when scaling from smaller instances (e.g., 20 nodes) to larger instances (e.g., 100 nodes). This trend is expected, as larger problem sizes inherently introduce greater complexity in both the DRL-based inference process and the selective MCTS execution. Furthermore, our results demonstrate that as the number of simulations (ns) increases, the runtime also increases correspondingly. This observation aligns with the expected behavior, as a higher number of MCTS iterations leads to a more extensive search process, thereby incurring an additional computational overhead.
The results reveal that while the application of the MCTS contributes to performance enhancement compared to the ones without the MCTS, it still falls short of the performance achieved by the heuristic models, as other research shows [2,18]. Contrary to our expectations, an increase in the number of simulations does not consistently lead to solution improvement, i.e., a decrease in distance. The analysis indicates the lack of a discernible relationship between the number of simulations and the resulting distance. Specifically, for problems with a size of 50, Pearson’s correlation coefficient between the two is −0.72, and for the case of CVRP with a size of 100, it is −0.47. In other cases, correlation scores are generally low, below 0.2 .
In addition, the MCTS strategy induces little runtime overhead compared to the method without MCTS, AM. For CVRP problems, the average runtimes of the proposed method are considerably shorter than those of the LKH3 method. However, the runtime increases according to the problem size n. Our explanation is that, as the problem gets bigger, some large problems are hard to solve only by the probability outputs from p θ , therefore utilizing MCTS more. We argue that to improve the solution quality of AM, numerous samples of solutions, which take a huge amount of time to generate, are required. This result is also shown in the experiment results from [2]. Also, we point out that training the network with more learning epochs to lower a small amount of distance takes quite a long time. For example, to lower about 0.015 in distance by training the network after 300 epochs, we need approximately 24 h as described in Figure 6, in which the orange line is the regressed line over the observations. However, with MCTS, we can consistently obtain better results within a few seconds, and the method is deterministic, unlike sampling methods. Nonetheless, we believe there still is room to improve the runtime of MCTS. The heuristic solvers are written in C, while our MCTS is written in Python 3.10.12, which is very slow. Implementing MCTS in C++ with parallelism should decrease the runtime.
To statistically confirm the effectiveness of the proposed MCTS, we include paired t-test results for two different cases. Table 3 reports the test results with the same conditions, including activation function and baselines. The records in the AM column follow the format “activation-baseline” and the records in the MCTS column follow “activation-baseline-MCTS simulation number”. In this setting, the test shows that applying the MCTS improves the solution except for TSP-20 and CVRP-50. We suspect that for relatively small-sized problems, relying on policy network only (AM) can be good enough, but if the problem gets bigger, introducing the MCTS can result in better solutions. Table 4 shows the test results regardless of the conditions, and thus the lowest p-value for each problem type and size. Application of the MCTS appears worthwhile for a given problem type, even when change in the activation and baseline is allowed.
We also show the entropy of p θ ( · ) for all test cases in TSP with size 100 in Figure 7. We find that 86% of entropies are below 0.1 , meaning that the outcomes of p θ ( · ) are dominated by a few suggestions and mere application of p θ ( · ) might lead to local optimal. Therefore, while controlling time overhead, applying MCTS selectively is a good strategy. We show the evaluation result for selective MCTS application in the ablation analysis in Section 5.2.

5.2. Ablation Study

In this section, we present some ablation results for the activation function and baseline. We aggregate the results based on the activation function used and then calculate the mean and standard deviation of the score over all settings, i.e., AM, MCTS-100, MCTS-500, and MCTS-1000, and the results are described in Table 5.
We can easily see that as the problem size increases, SwiGLU produces shorter distances overall than ReLU. Also, we notice that as the problem size gets bigger, the difference between ReLU and SwiGLU becomes more apparent. For example, for CVRP problem types, when the problem size is n = 20, the difference in the distances is about 0.01, while at n = 100, the value reaches around 0.07. The scalability of SwiGLU is much better than ReLU.
For the baseline, we suggest two different approaches: mean over batches and a value-net-based approach. This baseline is used in (10) as b ( s ) . We calculate the mean and standard deviation of the distances over all settings based on the type of baseline used, like we did in the activation function analysis, and the result is described in Table 6. Surprisingly, the mean baseline approach dominates over the value net baseline except for CVRP with a problem size 100, which is the hardest problem in the settings. We presume that if the problem becomes complex, the value net baseline may perform better than the mean baseline. For instance, CVRP with a time window and pick-up may be solved better with the value net baseline.
We also report the non-selective MCTS experiment results here in Table 7 and Table 8. The readers should be aware that the hardware environment here is a little different from the results in the experiment section. For non-selective MCTS, we run the experiment on a workstation shared among others that runs on Linux with RTX 4090 24GB and Intel Xeon w7-2475X. Therefore, the runtime recorded in the tables below cannot be directly compared to the results in the table from the experiment, but they do indicate the trends in runtime. Despite different hardware settings and computing settings, the difference in runtime between selective MCTS and non-selective MCTS is substantial, and suggests that selective MCTS is meaningful.
We can see that compared to the selective MCTS results from Table 1, a huge amount of runtime is required for MCTS. Also, the bold values for each group are almost the same or a little bit better with the selective MCTS strategy.
We can also observe similar trends on CVRP. One can easily see the huge deviation in runtime for n s = 1000 records in both TSP and CVRP. Application of MCTS to every transition does not fully account for this deviation, and the shared resource characteristic of the workstation may have also affected it. Therefore, considering the runtime and performance, selective MCTS is the better approach.

5.3. Mcts Application Rule

In this section, we report two different methods for applying MCTS. The first is the introduced method, which uses the difference between the highest probability and the fifth-highest probability (diff_cut), and the second is to use the entropy of the policy network directly (ent_cut). For each method, we compare how different pivot numbers affect the performance and runtime. By adopting the results from activation, we only use the SwiGLU activation for this test, and for better visibility of the changes, we focus on the problem with size 100. The difference method’s results are reported in Table 9 and Table 10, and the entropy method’s results are reported in Table 11 and Table 12.
Table 9 and Table 10 show the results of different pivot values selected in the MCTS application. For example, diff_cut = 0.25 denotes that we apply MCTS when the difference between the highest probability and the fifth-highest probability is less than 0.25. If the difference is large, suggestions from p θ are convincing, and if the difference is small, relying on p θ only is not enough. Therefore, the higher the diff_cut is, the more MCTS is applied, and vice versa. For both problem types, not much difference is observed when the diff_cut < 0.5. However, if the diff_cut ≥ 0.5, we observe that applying MCTS does make a greater difference. We also find that SwiGLU coupled with the value baseline outputs generally produces better solutions than with the mean baseline outputs. Moreover, for the sake of intuitive visualization, we illustrate the results of SwiGLU coupled with the value baseline outputs for both TSP and CVRP problems in Figure 8 and Figure 9.
Table 11 and Table 12 show the results of the different pivot values selected for MCTS application using entropy value of p θ . For example, we only apply MCTS when the entropy of p θ exceeds the given pivot value. Thus, the higher the ent_cut is set, the less MCTS is applied. Overall, the performance of ent_cut is not better than the diff_cut method in general. In addition, we can observe that the value baseline generally works better with MCTS than the mean baseline in CVRP. We suspect that the diff_cut method leads to a narrower search boundary than the ent_cut method, which is similar to the trust region in gradient descents.
In summary, the heavy application of MCTS may result in unsatisfactory performance, increasing the runtime significantly. Finding the optimal pivot for MCTS application may be an important issue. Therefore, we choose the diff_cut method with 0.75 as our final choice in the proposed work.

5.4. Modified CVRP Problems

In this section, we report the performance results for a modified CVRP to demonstrate the flexibility of our proposed method. For this purpose, we consider two cases, vehicle refilling amount and multiple depots, in which neither the classical formulation nor its heuristic solvers apply in their original form. To solve these cases with a classical formulation, we have to reconstruct the formulation from scratch. Solving these cases with heuristic solvers also requires reformulation. Meanwhile, our proposed method only needs one pre-training of the AM per modified CVRP problem, rather than reformulating the entire problem. Therefore, only the results of our proposed methods are provided here since classical formulations and heuristic solvers require reformulating the entire problem. Obviously, the results will be different from those in previous sections. We select two difference cut hyper-parameters, 0.25 and 0.75, use SwiGLU as the activation function, and set the total number of nodes n to 100.
In the first case, when visiting the depot node, we vary the refilling amount from 1 to 0.8 and 1.2, while the standard refilling amount for all other experiments is 1. We pretrain the AM for each refilling amount. As the refill amount decreases, the optimal solution should have a higher score because the vehicle must visit the depot node more frequently. We expect our model’s solution to exhibit this tendency. As expected, Table 13 shows that our model is able to find better solutions when the refill amount is larger, indicating that our proposed method is robust and capable of handling changes in the refill amount. Moreover, as the refill amount increases, the application of MCTS improves performance.
In the second case, we examine the performance of the modified CVRP problem with three depot nodes. We pretrain the AM for each number of depot nodes. As the number of depot nodes increases, the score of the optimal solution should decrease because the vehicle has more flexibility to choose a depot node when refilling is required. As expected, Table 14 shows that our model is able to find better solutions for more depot nodes, demonstrating that our model is robust to changes in the number of depot nodes. However, it is difficult to conclude that the application of MCTS results in better results compared to the use of AM in multi-depot CVRP problems since only a few MCTS-application cases show significant improvement.
In summary, our proposed method is robust to changes in CVRP formulation and sufficiently flexible to apply to modified problem formulations. However, the application of MCTS improves the solution for the vehicle-refill change, while its results remain unclear for multiple depot changes.

6. Conclusions and Future Works

We applied MCTS selectively in routing problems to determine whether it generated better solutions. Although the performance was still inferior to heuristic solvers, applying MCTS did generate better solutions than the case without MCTS. We also confirmed that using SwiGLU activation rather than typical ReLU can produce better solutions. The results of the baseline experiment remain controversial, but using mean-over-batches baselines generally helps generate better solutions. We believe that applying MCTS to different VRP problems with more complex situations may reveal the efficacy of the value-network-based baseline.
Future works can be developed from this paper. Firstly, the runtime of applying MCTS can be reduced if it is implemented in C++, which is much faster than Python. Also, if it is implemented in C++ (6.0), a parallelized version of MCTS [44] could help boost simulation time. Second, MCTS can be extended to other NP-hard problems in the field, e.g., bin-packing and knap-sack. Third, it is worth checking whether MCTS helps generalization across different problem sizes. Besides the changes in the refill amounts and depot numbers, it is also worth reflecting on other meaningful changes that real-world VRP may face, such as changes in traffic conditions or vehicle availability.

Author Contributions

Conceptualization, W.-J.K.; software, W.-J.K.; writing—original draft preparation, W.-J.K. and K.L.; writing—review and editing, T.K. and J.J.; supervision, K.L.; project experiments, W.-J.K., J.J. and T.K. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Ministry of Education of the Republic of Korea and the National Research Foundation of Korea (NRF-2018R1A5A7059549).

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The data will be available upon request.

Conflicts of Interest

Author Won-Jun Kim was employed by the company Hyundai Glovis. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

  1. Kwon, Y.D.; Choo, J.; Yoon, I.; Park, M.; Park, D.; Gwon, Y. Matrix encoding networks for neural combinatorial optimization. Adv. Neural Inf. Process. Syst. 2021, 34, 5138–5149. [Google Scholar]
  2. Kwon, Y.D.; Choo, J.; Kim, B.; Yoon, I.; Gwon, Y.; Min, S. POMO: Policy Optimization with Multiple Optima for Reinforcement Learning. Adv. Neural Inf. Process. Syst. 2020, 33, 21188–21198. [Google Scholar]
  3. Fawzi, A.; Balog, M.; Huang, A.; Hubert, T.; Romera-Paredes, B.; Barekatain, M.; Novikov, A.; Ruiz, F.; Schrittwieser, J.; Swirszcz, G.; et al. Discovering faster matrix multiplication algorithms with reinforcement learning. Nature 2022, 610, 47–53. [Google Scholar] [CrossRef] [PubMed]
  4. Silver, D.; Huang, A.; Maddison, C.J.; Guez, A.; Sifre, L.; van den Driessche, G.; Schrittwieser, J.; Antonoglou, I.; Panneershelvam, V.; Lanctot, M.; et al. Mastering the Game of Go with Deep Neural Networks and Tree Search. Nature 2016, 529, 484–489. [Google Scholar] [CrossRef] [PubMed]
  5. Silver, D.; Schrittwieser, J.; Simonyan, K.; Antonoglou, I.; Huang, A.; Guez, A.; Hubert, T.; Baker, L.; Lai, M.; Bolton, A.; et al. Mastering the Game of Go without Human Knowledge. Nature 2017, 550, 354–359. [Google Scholar] [CrossRef] [PubMed]
  6. Vinyals, O.; Fortunato, M.; Jaitly, N. Pointer Networks. arXiv 2017, arXiv:1506.03134. [Google Scholar]
  7. Schrittwieser, J.; Antonoglou, I.; Hubert, T.; Simonyan, K.; Sifre, L.; Schmitt, S.; Guez, A.; Lockhart, E.; Hassabis, D.; Graepel, T.; et al. Mastering atari, go, chess and shogi by planning with a learned model. Nature 2020, 588, 604–609. [Google Scholar] [CrossRef] [PubMed]
  8. Kumar, S.N.; Panneerselvam, R. A Survey on the Vehicle Routing Problem and Its Variants. Intell. Inf. Manag. 2012, 4, 66–74. [Google Scholar] [CrossRef]
  9. Lin, S.; Kernighan, B.W. An Effective Heuristic Algorithm for the Traveling-Salesman Problem. Oper. Res. 1973, 21, 498–516. [Google Scholar] [CrossRef]
  10. Vidal, T.; Crainic, T.G.; Gendreau, M.; Lahrichi, N.; Rei, W. A Hybrid Genetic Algorithm for Multidepot and Periodic Vehicle Routing Problems. Oper. Res. 2012, 60, 611–624. [Google Scholar] [CrossRef]
  11. Vidal, T. Hybrid genetic search for the CVRP: Open-source implementation and SWAP* neighborhood. Comput. Oper. Res. 2022, 140, 105643. [Google Scholar] [CrossRef]
  12. Dai, H.; Khalil, E.B.; Zhang, Y.; Dilkina, B.; Song, L. Learning Combinatorial Optimization Algorithms over Graphs. arXiv 2018, arXiv:1704.01665. [Google Scholar]
  13. Sutskever, I.; Vinyals, O.; Le, Q.V. Sequence to Sequence Learning with Neural Networks. arXiv 2014, arXiv:1409.3215. [Google Scholar]
  14. Luong, M.T.; Pham, H.; Manning, C.D. Effective Approaches to Attention-based Neural Machine Translation. arXiv 2015, arXiv:1508.04025. [Google Scholar]
  15. Bello, I.; Pham, H.; Le, Q.V.; Norouzi, M.; Bengio, S. Neural Combinatorial Optimization with Reinforcement Learning. arXiv 2017, arXiv:1611.09940. [Google Scholar]
  16. Sutton, R.S.; McAllester, D.; Singh, S.; Mansour, Y. Policy gradient methods for reinforcement learning with function approximation. Adv. Neural Inf. Process. Syst. 1999, 12, 1057–1063. [Google Scholar]
  17. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. arXiv 2017, arXiv:1706.03762. [Google Scholar]
  18. Kool, W.; van Hoof, H.; Welling, M. Attention, Learn to Solve Routing Problems! In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
  19. Kim, M.; Park, J. Learning collaborative policies to solve np-hard routing problems. Adv. Neural Inf. Process. Syst. 2021, 34, 10418–10430. [Google Scholar]
  20. Xin, L.; Song, W.; Cao, Z.; Zhang, J. Multi-decoder attention model with embedding glimpse for solving vehicle routing problems. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtually, 19–21 May 2021; Volume 35, pp. 12042–12049. [Google Scholar]
  21. Hottung, A.; Kwon, Y.D.; Tierney, K. Efficient Active Search for Combinatorial Optimization Problems. arXiv 2022, arXiv:2106.05126. [Google Scholar]
  22. Bubeck, S.; Cesa-Bianchi, N. Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems. arXiv 2012, arXiv:1204.5721. [Google Scholar]
  23. Mańdziuk, J.; Świechowski, M. Simulation-based approach to Vehicle Routing Problem with traffic jams. In Proceedings of the 2016 IEEE Symposium Series on Computational Intelligence (SSCI), Athens, Greece, 6–9 December 2016; pp. 1–8. [Google Scholar] [CrossRef]
  24. Barletta, C.; Garn, W.; Turner, C.; Fallah, S. Hybrid fleet capacitated vehicle routing problem with flexible Monte–Carlo Tree search. Int. J. Syst. Sci. Oper. Logist. 2023, 10, 2102265. [Google Scholar] [CrossRef]
  25. Hottung, A.; Tanaka, S.; Tierney, K. Deep learning assisted heuristic tree search for the container pre-marshalling problem. Comput. Oper. Res. 2020, 113, 104781. [Google Scholar] [CrossRef]
  26. Luo, G.; Wang, Y.; Zhang, H.; Yuan, Q.; Li, J. AlphaRoute: Large-Scale Coordinated Route Planning via Monte Carlo Tree Search. In Proceedings of the AAAI Conference on Artificial Intelligence 2023, Washington, DC, USA, 7–14 February 2023; Volume 37, pp. 12058–12067. [Google Scholar] [CrossRef]
  27. Cohen-Solal, Q.; Cazenave, T. Minimax strikes back. arXiv 2020, arXiv:2012.10700. [Google Scholar]
  28. Sinha, A.; Azad, U.; Singh, H. Qubit Routing Using Graph Neural Network Aided Monte Carlo Tree Search. In Proceedings of the AAAI Conference on Artificial Intelligence 2022, Online, 22 February–1 March 2022; Volume 36, pp. 9935–9943. [Google Scholar] [CrossRef]
  29. Kilby, P.; Prosser, P.; Shaw, P. Dynamic VRPs: A Study of Scenarios; University of Strathclyde Technical Report; University of Strathclyde: Glasgow, UK, 1998; Volume 1. [Google Scholar]
  30. Borcinova, Z. Two models of the capacitated vehicle routing problem. Croat. Oper. Res. Rev. 2017, 8, 463–469. [Google Scholar] [CrossRef]
  31. Cheng, B.; Misra, I.; Schwing, A.G.; Kirillov, A.; Girdhar, R. Masked-attention mask transformer for universal image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 1290–1299. [Google Scholar]
  32. Du, Y.; Xie, P.; Wang, M.; Hu, X.; Zhao, Z.; Liu, J. Full transformer network with masking future for word-level sign language recognition. Neurocomputing 2022, 500, 115–123. [Google Scholar] [CrossRef]
  33. Agarap, A.F. Deep learning using rectified linear units (relu). arXiv 2018, arXiv:1803.08375. [Google Scholar]
  34. Shazeer, N. Glu variants improve transformer. arXiv 2020, arXiv:2002.05202. [Google Scholar]
  35. Chowdhery, A.; Narang, S.; Devlin, J.; Bosma, M.; Mishra, G.; Roberts, A.; Barham, P.; Chung, H.W.; Sutton, C.; Gehrmann, S.; et al. Palm: Scaling language modeling with pathways. arXiv 2022, arXiv:2204.02311. [Google Scholar]
  36. Williams, R.J. Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning. Mach. Learn. 1992, 8, 229–256. [Google Scholar] [CrossRef]
  37. Team, T.P.L. PyTorch Lightning, Opensource, 2023. Available online: https://github.com/Lightning-AI/lightning (accessed on 19 February 2025).
  38. Paszke, A.; Gross, S.; Chintala, S.; Chanan, G.; Yang, E.; DeVito, Z.; Lin, Z.; Desmaison, A.; Antiga, L.; Lerer, A. Automatic Differentiation in Pytorch. 2017. Available online: https://openreview.net/forum?id=BJJsrmfCZ (accessed on 19 February 2025).
  39. Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
  40. Helsgaun, K. An Extension of the Lin-Kernighan-Helsgaun TSP Solver for Constrained Traveling Salesman and Vehicle Routing Problems: Technical Report; Roskilde Universitet: Roskilde, Denmark, 2017. [Google Scholar]
  41. Applegate, D.; Bixby, R.; Chvátal, V.; Cook, W. Concorde Tsp Solver. 03.12.19. Available online: https://en.wikipedia.org/wiki/Concorde_TSP_Solver (accessed on 19 February 2025).
  42. Furnon, V.; Perron, L. OR-Tools Routing Library. 2023. Available online: https://developers.google.com/optimization (accessed on 19 February 2025).
  43. Rosenkrantz, D.J.; Stearns, R.E.; Lewis, P.M., II. An Analysis of Several Heuristics for the Traveling Salesman Problem. SIAM J. Comput. 1977, 6, 563–581. [Google Scholar] [CrossRef]
  44. Chaslot, G.M.B.; Winands, M.H.; van Den Herik, H.J. Parallel Monte-Carlo Tree Search. In Proceedings of the Computers and Games: 6th International Conference, CG 2008, Beijing, China, 29 September–1 October 2008; Springer: Berlin/Heidelberg, Germany, 2008; pp. 60–71. [Google Scholar]
Figure 1. Routing results in red in dynamic load settings.
Figure 1. Routing results in red in dynamic load settings.
Entropy 27 00251 g001
Figure 2. Overall process of routing using our proposed neural networks. It auto-regressively selects the next node. The encoder is executed once per episode, and the decoder is executed at every timestep t.
Figure 2. Overall process of routing using our proposed neural networks. It auto-regressively selects the next node. The encoder is executed once per episode, and the decoder is executed at every timestep t.
Entropy 27 00251 g002
Figure 3. Components of the neural network.
Figure 3. Components of the neural network.
Entropy 27 00251 g003
Figure 4. An overall process of transition using MCTS depicts a situation in which MCTS is run at t = 2 , and some simulation iterations are performed.
Figure 4. An overall process of transition using MCTS depicts a situation in which MCTS is run at t = 2 , and some simulation iterations are performed.
Entropy 27 00251 g004
Figure 5. Visualization of the two methods’ routing results in red on the same test data.
Figure 5. Visualization of the two methods’ routing results in red on the same test data.
Entropy 27 00251 g005
Figure 6. Training score with the distances in blue and the fitting in orange according to training time after 300 epochs on the CVRP-100 problem smoothed on a window size of 2.
Figure 6. Training score with the distances in blue and the fitting in orange according to training time after 300 epochs on the CVRP-100 problem smoothed on a window size of 2.
Entropy 27 00251 g006
Figure 7. Histogram of entropies of policy network p θ ( · ) for TSP with size 100.
Figure 7. Histogram of entropies of policy network p θ ( · ) for TSP with size 100.
Entropy 27 00251 g007
Figure 8. Visualization of TSP results on runtime and score.
Figure 8. Visualization of TSP results on runtime and score.
Entropy 27 00251 g008
Figure 9. Visualization of CVRP results on runtime and score.
Figure 9. Visualization of CVRP results on runtime and score.
Entropy 27 00251 g009
Table 1. Results of TSP problems.
Table 1. Results of TSP problems.
Problem Size (n)2050100
Method Activation Baseline ns Dist Time Dist Time Dist Time
LKH3N.A.3.8402 ± 0.050.04 ± 0.015.6705 ± 0.050.411 ± 0.077.7352 ± 0.051.167 ± 0.24
OR Tools3.8402 ± 0.301.001 ± 0.005.6807 ± 0.241.00 ± 0.007.9003 ± 0.287.001 ± 0.00
Nearest Insertions4.3742 ± 0.080.001 ± 0.006.7550 ± 0.070.00 ± 0.009.4517 ± 0.080.003 ± 0.00
Concorde3.8402 ± 0.050.124 ± 0.035.6705 ± 0.051.292 ± 0.237.7352 ± 0.055.143 ± 0.69
DRLReLUmean03.8492 ± 0.090.037 ± 0.005.7361 ± 0.060.079 ± 0.007.9852 ± 0.080.153 ± 0.00
1003.8486 ± 0.090.043 ± 0.005.7375 ± 0.060.205 ± 0.017.9854 ± 0.080.627 ± 0.06
5003.8491 ± 0.090.059 ± 0.005.7345 ± 0.060.592 ± 0.237.9826 ± 0.082.588 ± 1.82
10003.8491 ± 0.090.077 ± 0.015.7339 ± 0.061.051 ± 0.857.9807 ± 0.085.509 ± 9.14
val03.8489 ± 0.080.038 ± 0.005.7386 ± 0.060.078 ± 0.008.0481 ± 0.080.154 ± 0.00
1003.8483 ± 0.080.038 ± 0.005.7310 ± 0.070.194 ± 0.018.0465 ± 0.080.648 ± 0.08
5003.8483 ± 0.080.044 ± 0.005.7291 ± 0.070.533 ± 0.218.0687 ± 0.092.897 ± 2.03
10003.8483 ± 0.080.052 ± 0.015.7301 ± 0.060.959 ± 0.888.0564 ± 0.085.915 ± 9.45
SwiGLUmean03.8464 ± 0.080.037 ± 0.005.7267 ± 0.070.080 ± 0.007.9562 ± 0.070.155 ± 0.00
1003.8462 ± 0.080.045 ± 0.005.7248 ± 0.070.173 ± 0.017.9523 ± 0.070.594 ± 0.09
5003.8461 ± 0.080.065 ± 0.015.7260 ± 0.070.458 ± 0.177.9530 ± 0.072.362 ± 2.72
10003.8461 ± 0.080.089 ± 0.025.7257 ± 0.070.799 ± 0.627.9539 ± 0.074.941 ± 13.23
val03.8482 ± 0.090.038 ± 0.005.7405 ± 0.070.081 ± 0.007.9551 ± 0.070.156 ± 0.00
1003.8474 ± 0.090.047 ± 0.005.7377 ± 0.070.192 ± 0.017.9536 ± 0.070.646 ± 0.07
5003.8478 ± 0.090.069 ± 0.015.7377 ± 0.070.550 ± 0.197.9630 ± 0.072.781 ± 1.93
10003.8478 ± 0.090.093 ± 0.025.7373 ± 0.070.961 ± 0.707.9522 ± 0.075.822 ± 9.99
Table 2. CVRP problem result.
Table 2. CVRP problem result.
Problem Size (n)2050100
Method Activation Baseline ns Dist Time Dist Time Dist Time
LKH3N.A.6.1528 ± 0.163.948 ± 0.4610.2951 ± 0.2414.617 ± 1.1515.4804 ± 0.3326.03 ± 1.87
OR Tools6.2049 ± 0.851.002 ± 0.0010.5973 ± 1.255.000 ± 0.0016.273 ± 1.7618.000 ± 0
DRLReLUmean06.4097 ± 0.750.045 ± 0.0010.8050 ± 1.650.095 ± 0.0016.4418 ± 2.990.178 ± 0.00
1006.4010 ± 0.760.151 ± 0.0110.8225 ± 1.590.473 ± 0.0316.4627 ± 3.001.326 ± 0.12
5006.4065 ± 0.750.398 ± 0.1310.8192 ± 1.601.737 ± 0.8116.4596 ± 2.996.652 ± 3.91
10006.4073 ± 0.760.685 ± 0.4510.8087 ± 1.633.258 ± 2.9516.4387 ± 2.9613.633 ± 17.61
value06.4553 ± 0.800.046 ± 0.0010.8634 ± 1.650.095 ± 0.0016.4463 ± 3.000.177 ± 0.00
1006.4452 ± 0.830.176 ± 0.0110.8718 ± 1.610.466 ± 0.0316.4750 ± 3.081.409 ± 0.13
5006.4449 ± 0.840.489 ± 0.1510.8651 ± 1.651.773 ± 0.7816.4494 ± 3.027.079 ± 3.34
10006.4479 ± 0.800.831 ± 0.4710.8774 ± 1.673.475 ± 3.6516.4468 ± 2.9814.504 ± 14.40
SwiGLUmean06.4231 ± 0.730.046 ± 0.0010.7940 ± 1.700.097 ± 0.0016.4043 ± 3.090.180 ± 0.00
1006.4228 ± 0.750.141 ± 0.0110.7747 ± 1.710.464 ± 0.0516.4190 ± 3.101.243 ± 0.14
5006.4230 ± 0.730.346 ± 0.0810.7878 ± 1.721.742 ± 1.0216.4169 ± 3.106.018 ± 4.62
10006.4292 ± 0.740.562 ± 0.2310.7847 ± 1.673.260 ± 3.7116.4117 ± 3.1112.221 ± 20.71
val06.4201 ± 0.720.046 ± 0.0010.8402 ± 1.480.096 ± 0.0016.3575 ± 3.040.178 ± 0.00
1006.4020 ± 0.740.154 ± 0.0110.8420 ± 1.530.517 ± 0.0416.3950 ± 2.971.188 ± 0.11
5006.4055 ± 0.730.422 ± 0.1010.8336 ± 1.491.959 ± 1.0216.3475 ± 3.055.671 ± 3.51
10006.4101 ± 0.720.686 ± 0.2810.8456 ± 1.533.683 ± 3.6416.3492 ± 3.0211.610 ± 14.31
Table 3. A t-test report with the same conditions.
Table 3. A t-test report with the same conditions.
Problem TypeProblem SizeAMMCTSp-Value<0.05
tsp20SwiGLU-valSwiGLU-val-1000.1FALSE
50ReLU-valReLU-val-5000.0144TRUE
100ReLU-meanReLU-mean-10000.0262TRUE
cvrp20SwiGLU-valSwiGLU-val-10000.0066TRUE
50SwiGLU-meanSwiGLUe-mean-1000.0994FALSE
100SwiGLU-valSwiGLU-val-1000.0339TRUE
Table 4. A t-test report regardless of the conditions.
Table 4. A t-test report regardless of the conditions.
Problem TypeProblem SizeAMMCTSp-Value<0.05
tsp20SwiGLU-valSwiGLU-val-1000.1FALSE
50ReLU-valReLU-val-5000.0144TRUE
100ReLU-valSwiGLU-mean-1000.0001TRUE
cvrp20SwiGLU-valSwiGLU-val-10000.0066TRUE
50ReLU-valSwiGLUe-mean-1000.0064TRUE
100SwiGLU-valReLU-mean-1000.0056TRUE
Table 5. Performance results according to activation functions.
Table 5. Performance results according to activation functions.
ProblemSize2050100
Type Activation Score Runtime Score Runtime Score Runtime
TSPReLU3.8487 ± 0.090.049 ± 0.005.7338 ± 0.060.462 ± 0.278.0192 ± 0.082.312 ± 2.82
SwiGLU3.8470 ± 0.080.060 ± 0.015.7321 ± 0.070.412 ± 0.217.9549 ± 0.072.182 ± 3.50
CVRPReLU6.4272 ± 0.790.353 ± 0.1510.8416 ± 1.631.421 ± 1.0316.4526 ± 3.005.620 ± 4.94
SwiGLU6.4170 ± 0.730.300 ± 0.0910.8128 ± 1.601.477 ± 1.1816.3876 ± 3.064.789 ± 5.43
Table 6. Aggregated results of baselines.
Table 6. Aggregated results of baselines.
ProblemSize2050100
Type Baseline Score Runtime score Runtime Score Runtime
TSPmean3.8476 ± 0.090.057 ± 0.015.7306 ± 0.060.429 ± 0.247.9687 ± 0.082.116 ± 3.38
value3.8481 ± 0.080.052 ± 0.005.7352 ± 0.070.444 ± 0.258.0054 ± 0.082.378 ± 2.94
CVRPmean6.4153 ± 0.750.297 ± 0.1110.7996 ± 1.661.391 ± 1.0716.4319 ± 3.045.181 ± 5.89
value6.4289 ± 0.770.356 ± 0.1310.8549 ± 1.571.508 ± 1.1516.4083 ± 3.025.227 ± 4.48
Table 7. Non-selective MCTS applied on TSP.
Table 7. Non-selective MCTS applied on TSP.
Problem Size2050100
Method Activation Baseline ns Dist Time Dist Time Dist Time
LKH3N.A.3.8402 ± 0.050.04± 0.015.6705 ± 0.050.411 ± 0.077.7352 ± 0.051.167 ± 0.24
OR Tools3.8402 ± 0.305.001 ± 0.005.6807 ± 0.245.00 ± 0.007.9154 ± 0.285.001 ± 0.00
Neareset Insertions4.3742 ± 0.080.001 ± 0.006.7550 ± 0.070.00 ± 0.009.4517 ± 0.080.003 ± 0.00
Concorde3.8402 ± 0.050.124 ± 0.035.6705 ± 0.051.292 ± 0.237.7352 ± 0.055.143 ± 0.69
DRLReLUmean03.8492 ± 0.090.107 ± 0.005.7361 ± 0.060.231 ± 0.007.9852 ± 0.080.450 ± 0.00
1003.8491 ± 0.090.864 ± 0.035.7356 ± 0.064.614 ± 0.557.9817 ± 0.0810.342 ± 2.31
5003.8491 ± 0.093.079 ± 0.255.7365 ± 0.0617.099 ± 2.767.9855 ± 0.0855.007 ± 24.24
10003.8491 ± 0.095.727 ± 0.345.7360 ± 0.0631.821 ± 5.267.9848 ± 0.08110.943 ± 76.95
value03.8489 ± 0.080.111 ± 0.005.7386 ± 0.060.230 ± 0.008.0481 ± 0.080.451 ± 0.00
1003.8489 ± 0.080.879 ± 0.075.7294 ± 0.074.450 ± 0.608.0459 ± 0.089.434 ± 3.04
5003.8483 ± 0.082.962 ± 0.085.7362 ± 0.0616.844 ± 3.028.0453 ± 0.0856.209 ± 35.65
10003.8483 ± 0.085.715 ± 0.285.7314 ± 0.0731.341 ± 4.818.0456 ± 0.08112.687 ± 105.85
SwiGLUmean03.8464 ± 0.080.110 ± 0.005.7267 ± 0.070.237 ± 0.007.9562 ± 0.070.454 ± 0.00
1003.8463 ± 0.080.940 ± 0.135.7263 ± 0.074.292 ± 0.387.9528 ± 0.0710.239 ± 3.44
5003.8459 ± 0.082.979 ± 0.125.7266 ± 0.0716.516 ± 1.837.9542 ± 0.0756.392 ± 29.80
10003.8461 ± 0.085.716 ± 0.325.7265 ± 0.0730.746 ± 2.697.9542 ± 0.07112.849 ± 84.50
value03.8482 ± 0.090.113 ± 0.005.7405 ± 0.070.234 ± 0.007.9551 ± 0.070.456 ± 0.00
1003.8481 ± 0.090.860 ± 0.045.7403 ± 0.074.429 ± 0.457.9509 ± 0.0710.046 ± 3.79
5003.8481 ± 0.093.040 ± 0.165.7391 ± 0.0716.917 ± 2.967.9516 ± 0.0755.425 ± 39.04
10003.8481 ± 0.095.705 ± 0.275.7391 ± 0.0731.496 ± 4.657.9516 ± 0.07111.637 ± 113.14
Table 8. Non-selective MCTS applied on CVRP.
Table 8. Non-selective MCTS applied on CVRP.
Problem Size2050100
Method Activation Baseline ns Dist Time Dist Time Dist Time
LKH3N.A.6.1528 ± 0.163.948 ± 0.4610.2951 ± 0.2414.617 ± 1.1515.4804 ± 0.3326.03 ± 1.87
OR Tools6.2049 ± 0.851.002 ± 0.0010.5973 ± 1.255.000 ± 0.0016.273 ± 1.7618.000 ± 0
DRLReLUmean06.4097 ± 0.750.131 ± 0.0010.8050 ± 1.650.279 ± 0.0016.4418 ± 2.990.523 ± 0.00
1006.4048 ± 0.771.067 ± 0.1610.8133 ± 1.603.444 ± 0.5716.4840 ± 3.066.453 ± 2.58
5006.4096 ± 0.773.258 ± 1.1010.8100 ± 1.5914.640 ± 6.7216.4517 ± 2.9935.803 ± 47.22
10006.4230 ± 0.795.631 ± 2.5410.8113 ± 1.6427.046 ± 20.5216.4454 ± 3.0173.464 ± 156.85
value06.4553 ± 0.800.132 ± 0.0010.8634 ± 1.650.276 ± 0.0016.4463 ± 3.000.525 ± 0.00
1006.4425 ± 0.831.185 ± 0.1910.8717 ± 1.623.426 ± 0.5016.4812 ± 3.086.455 ± 2.61
5006.4412 ± 0.823.515 ± 1.1810.8708 ± 1.6614.678 ± 6.8216.4507 ± 3.0235.825 ± 42.25
10006.4475 ± 0.815.890 ± 2.8510.8715 ± 1.6727.310 ± 20.2316.4530 ± 2.9873.602 ± 166.00
SwiGLUmean06.4231 ± 0.730.133 ± 0.0010.7940 ± 1.700.278 ± 0.0016.4043 ± 3.090.527 ± 0.00
1006.4226 ± 0.751.047 ± 0.1510.7908 ± 1.723.371 ± 0.5316.4193 ± 3.116.441 ± 3.31
5006.4248 ± 0.733.051 ± 0.6310.7959 ± 1.7114.491 ± 7.0016.4164 ± 3.1135.644 ± 49.24
10006.4333 ± 0.755.338 ± 1.4810.7933 ± 1.6926.904 ± 22.8316.4148 ± 3.1373.034 ± 205.68
value06.4201 ± 0.720.132 ± 0.0010.8402 ± 1.480.277 ± 0.0016.3575 ± 3.040.525 ± 0.00
1006.4047 ± 0.751.129 ± 0.2110.8389 ± 1.523.560 ± 0.6616.3928 ± 3.016.475 ± 2.65
5006.4105 ± 0.743.314 ± 0.8410.8328 ± 1.4915.411 ± 7.9016.3479 ± 3.0435.775 ± 52.90
10006.4138 ± 0.735.554 ± 1.4910.8422 ± 1.5128.610 ± 27.2116.3497 ± 3.0372.945 ± 187.09
Table 9. TSP 100 difference pivot result.
Table 9. TSP 100 difference pivot result.
ActivationBaselinensdiff_cut = 0.10diff_cut = 0.25diff_cut = 0.50diff_cut = 0.75
Score Runtime Score Runtime Score Runtime Score Runtime
SwiGLUmean07.9562 ± 0.070.121 ± 0.007.9562 ± 0.070.121 ± 0.007.9562 ± 0.070.121 ± 0.007.9562 ± 0.070.121 ± 0.00
1007.9541 ± 0.070.124 ± 0.007.9541 ± 0.070.124 ± 0.007.9541 ± 0.070.135 ± 0.007.9523 ± 0.070.621 ± 0.13
5007.9541 ± 0.070.124 ± 0.007.9541 ± 0.070.124 ± 0.007.9554 ± 0.070.198 ± 0.057.9530 ± 0.072.586 ± 3.31
10007.9541 ± 0.070.126 ± 0.007.9541 ± 0.070.126 ± 0.007.9540 ± 0.070.283 ± 0.217.9539 ± 0.075.553 ± 18.05
value07.9551 ± 0.070.156 ± 0.007.9551 ± 0.070.156 ± 0.007.9551 ± 0.070.156 ± 0.007.9551 ± 0.070.156 ± 0.00
1007.9544 ± 0.070.123 ± 0.007.9544 ± 0.070.123 ± 0.007.9547 ± 0.070.143 ± 0.007.9536 ± 0.070.679 ± 0.12
5007.9544 ± 0.070.123 ± 0.007.9544 ± 0.070.123 ± 0.007.9527 ± 0.070.244 ± 0.077.9630 ± 0.073.114 ± 3.52
10007.9544 ± 0.070.124 ± 0.007.9544 ± 0.070.124 ± 0.007.9528 ± 0.070.400 ± 0.377.9522 ± 0.076.532 ± 17.42
Table 10. CVRP 100 difference pivot result.
Table 10. CVRP 100 difference pivot result.
ActivationBaselinensdiff_cut = 0.10diff_cut = 0.25diff_cut = 0.5diff_cut = 0.75
Score Runtime Score Runtime Score Runtime Score Runtime
SwiGLUmean016.4043 ± 3.090.142 ± 0.0016.4043 ± 3.090.142 ± 0.0016.4043 ± 3.090.142 ± 0.0016.4043 ± 3.090.142 ± 0.00
10016.4043 ± 3.090.351 ± 0.0116.4043 ± 3.090.293 ± 0.0116.4132 ± 3.090.213 ± 0.0116.4190 ± 3.101.253 ± 0.15
50016.4043 ± 3.090.329 ± 0.0016.4043 ± 3.090.310 ± 0.0316.4095 ± 3.110.528 ± 0.2516.4169 ± 3.106.293 ± 5.36
100016.4043 ± 3.090.335 ± 0.0116.4043 ± 3.090.346 ± 0.1616.4115 ± 3.120.931 ± 1.0616.4117 ± 3.1112.700 ± 23.47
value016.3575 ± 3.040.142 ± 0.0016.3575 ± 3.040.142 ± 0.0016.3575 ± 3.040.142 ± 0.0016.3575 ± 3.040.142 ± 0.00
10016.3595 ± 3.050.341 ± 0.0116.3597 ± 3.050.301 ± 0.0116.3536 ± 2.980.217 ± 0.0116.3950 ± 2.971.201 ± 0.12
50016.3595 ± 3.050.327 ± 0.0016.3597 ± 3.050.318 ± 0.0416.3549 ± 3.040.541 ± 0.2716.3475 ± 3.055.947 ± 4.08
100016.3595 ± 3.050.343 ± 0.0116.3597 ± 3.050.333 ± 0.1016.3551 ± 3.040.952 ± 1.1416.3492 ± 3.0212.072 ± 16.35
Table 11. Results for TSP 100 with entropy pivots.
Table 11. Results for TSP 100 with entropy pivots.
ActivationBaselinensent_cut = 0.25ent_cut = 0.5ent_cut = 0.75ent_cut = 1
Score Runtime Score Runtime Score Runtime Score Runtime
SwiGLUmean07.9562 ± 0.070.123 ± 0.007.9562 ± 0.070.123 ± 0.007.9562 ± 0.070.123 ± 0.007.9562 ± 0.070.123 ± 0.00
1007.9532 ± 0.071.180 ± 0.217.9532 ± 0.070.499 ± 0.047.9541 ± 0.070.358 ± 0.027.9541 ± 0.070.314 ± 0.01
5007.9544 ± 0.075.739 ± 7.987.9542 ± 0.071.681 ± 1.167.9553 ± 0.070.728 ± 0.677.9554 ± 0.070.496 ± 0.25
10007.9543 ± 0.0712.532 ± 41.437.9543 ± 0.073.405 ± 5.957.9543 ± 0.071.277 ± 3.427.9547 ± 0.070.729 ± 1.12
value07.9551 ± 0.070.124 ± 0.007.9551 ± 0.070.124 ± 0.007.9551 ± 0.070.124 ± 0.007.9551 ± 0.070.124 ± 0.00
1007.9537 ± 0.071.361 ± 0.257.9541 ± 0.070.556 ± 0.047.9554 ± 0.070.394 ± 0.037.9547 ± 0.070.317 ± 0.01
5007.9578 ± 0.076.963 ± 8.417.9584 ± 0.072.081 ± 1.457.9506 ± 0.070.925 ± 0.947.9527 ± 0.070.576 ± 0.41
10007.9523 ± 0.0715.041 ± 43.927.9529 ± 0.074.243 ± 6.657.9514 ± 0.071.711 ± 4.567.9525 ± 0.070.867 ± 1.67
Table 12. Results for CVRP 100 with entropy pivots.
Table 12. Results for CVRP 100 with entropy pivots.
ActivationBaselinensent_cut = 0.25ent_cut = 0.5ent_cut = 0.75ent_cut = 1
Score Runtime Score Runtime Score Runtime Score Runtime
SwiGLUmean016.4043 ± 3.090.180 ± 0.0016.4043 ± 3.090.180 ± 0.0016.4043 ± 3.090.180 ± 0.0016.4043 ± 3.090.180 ± 0.00
10016.4242 ± 3.092.854 ± 0.6316.4164 ± 3.090.910 ± 0.1016.4145 ± 3.070.441 ± 0.0816.4143 ± 3.080.218 ± 0.01
50016.4106 ± 3.1116.915 ± 20.9716.4118 ± 3.104.370 ± 2.9916.4146 ± 3.111.578 ± 2.7116.4130 ± 3.110.547 ± 0.22
100016.4100 ± 3.1134.895 ± 87.5716.4100 ± 3.118.889 ± 13.8016.4120 ± 3.123.125 ± 11.7716.4088 ± 3.120.974 ± 0.93
value016.3575 ± 3.040.178 ± 0.0016.3575 ± 3.040.178 ± 0.0016.3575 ± 3.040.178 ± 0.0016.3575 ± 3.040.178 ± 0.00
10016.3839 ± 3.052.730 ± 0.6116.3802 ± 3.020.899 ± 0.1116.3680 ± 3.020.459 ± 0.0816.3666 ± 3.030.232 ± 0.01
50016.3561 ± 3.0615.386 ± 15.6016.3530 ± 3.054.338 ± 2.8016.3575 ± 3.061.658 ± 1.8916.3583 ± 3.050.647 ± 0.29
100016.3558 ± 3.0431.816 ± 66.7916.3531 ± 3.038.700 ± 11.0716.3573 ± 3.043.256 ± 8.1416.3572 ± 3.041.168 ± 1.18
Table 13. Performance for vehicle refills.
Table 13. Performance for vehicle refills.
Refill Amount0.811.2
Method ent_cut Baseline n s Score Runtime Score Runtime Score Runtime
SwiGLU0.25mean019.5630 ± 5.290.140 ± 0.0016.4043 ± 3.090.142 ± 0.0014.8674 ± 2.940.133 ± 0.00
10019.5774 ± 5.360.166 ± 0.0016.4043 ± 3.090.293 ± 0.0014.8581 ± 2.780.146 ± 0.00
50019.5735 ± 5.360.328 ± 0.1816.4043 ± 3.090.310 ± 0.0314.8632 ± 2.790.196 ± 0.08
100019.5733 ± 5.360.513 ± 0.8416.4043 ± 3.090.346 ± 0.1614.8567 ± 2.760.253 ± 0.26
value019.9787 ± 5.840.138 ± 0.0016.3575 ± 3.030.142 ± 0.0014.7268 ± 1.960.133 ± 0.00
10020.0338 ± 5.780.232 ± 0.0116.3597 ± 3.050.301 ± 0.0114.7260 ± 1.960.136 ± 0.00
50020.0108 ± 5.830.715 ± 0.4516.3597 ± 3.050.318 ± 0.0414.7253 ± 1.960.141 ± 0.00
100020.0072 ± 5.851.306 ± 1.9216.3597 ± 3.050.333 ± 0.1014.7253 ± 1.960.146 ± 0.01
0.75mean019.5630 ± 5.290.140 ± 0.0016.4043 ± 3.090.142 ± 0.0014.8674 ± 2.940.133 ± 0.00
10019.6385 ± 5.141.369 ± 0.1316.4190 ± 3.101.253 ± 0.1514.8840 ± 2.801.073 ± 0.17
50019.6096 ± 5.437.649 ± 4.3916.4169 ± 3.106.293 ± 5.3614.8629 ± 2.786.567 ± 5.25
100019.6208 ± 5.3916.177 ± 19.0516.4117 ± 3.1112.700 ± 23.4714.8810 ± 2.8114.262 ± 26.05
value019.9787 ± 5.840.138 ± 0.0016.3575 ± 3.040.142 ± 0.0014.7268 ± 1.960.133 ± 0.00
10020.0404 ± 6.181.768 ± 0.1816.3950 ± 2.971.201 ± 0.1214.7533 ± 2.001.117 ± 0.10
50020.0736 ± 6.2910.515 ± 6.0416.3475 ± 4.055.947 ± 4.0814.7204 ± 1.946.996 ± 3.75
100020.0147 ± 5.6722.416 ± 26.6216.3492 ± 3.0212.072 ± 16.3514.7116 ± 1.9715.164 ± 16.69
Table 14. Performance for multiple depot nodes.
Table 14. Performance for multiple depot nodes.
Number of Depot Nodes13
Method diff_cut Baseline n s Score Runtime Score Runtime
SwiGLU0.25mean016.4043 ± 3.090.142 ± 0.0015.5005 ± 2.540.136 ± 0.00
10016.4043 ± 3.090.293 ± 0.0015.5076 ± 2.560.139 ± 0.00
50016.4043 ± 3.090.310 ± 0.0315.5076 ± 2.560.164 ± 0.02
100016.4043 ± 3.090.346 ± 0.1615.5077 ± 2.560.197 ± 0.08
value016.3575 ± 3.030.142 ± 0.0015.5699 ± 2.610.135 ± 0.00
10016.3597 ± 3.050.301 ± 0.0115.5754 ± 2.620.139 ± 0.00
50016.3597 ± 3.050.318 ± 0.0415.5696 ± 2.610.163 ± 0.01
100016.3597 ± 3.050.333 ± 0.1015.5696 ± 2.610.195 ± 0.06
0.75mean016.4043 ± 3.090.142 ± 0.0015.5005 ± 2.540.136 ± 0.00
10016.4190 ± 3.101.253 ± 0.1515.5638 ± 2.811.066 ± 0.11
50016.4169 ± 3.106.293 ± 5.3615.5298 ± 2.636.209 ± 4.21
100016.4117 ± 3.1112.700 ± 23.4715.5026 ± 2.5713.324 ± 18.63
value016.3575 ± 3.040.142 ± 0.0015.5699 ± 2.610.135 ± 0.00
10016.3950 ± 2.971.201 ± 0.1215.5163 ± 2.561.072 ± 0.08
50016.3475 ± 4.055.947 ± 4.0815.5701 ± 2.736.516 ± 3.38
100016.3492 ± 3.0212.072 ± 16.3515.5553 ± 2.6714.033 ± 15.54
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Kim, W.-J.; Jeong, J.; Kim, T.; Lee, K. AlphaRouter: Bridging the Gap Between Reinforcement Learning and Optimization for Vehicle Routing with Monte Carlo Tree Searches. Entropy 2025, 27, 251. https://doi.org/10.3390/e27030251

AMA Style

Kim W-J, Jeong J, Kim T, Lee K. AlphaRouter: Bridging the Gap Between Reinforcement Learning and Optimization for Vehicle Routing with Monte Carlo Tree Searches. Entropy. 2025; 27(3):251. https://doi.org/10.3390/e27030251

Chicago/Turabian Style

Kim, Won-Jun, Junho Jeong, Taeyeong Kim, and Kichun Lee. 2025. "AlphaRouter: Bridging the Gap Between Reinforcement Learning and Optimization for Vehicle Routing with Monte Carlo Tree Searches" Entropy 27, no. 3: 251. https://doi.org/10.3390/e27030251

APA Style

Kim, W.-J., Jeong, J., Kim, T., & Lee, K. (2025). AlphaRouter: Bridging the Gap Between Reinforcement Learning and Optimization for Vehicle Routing with Monte Carlo Tree Searches. Entropy, 27(3), 251. https://doi.org/10.3390/e27030251

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop