In this section, we present our approach, named AlphaRouter, to solving the routing problem using both reinforcement learning and MCTS frameworks. We revise the above routing problem by adding the possibility of refilling the vehicle to reflect realistic situations. We notice that the above routing problem is unable to include the refilling action. We begin by defining the components to bring the environment into our RL problem, followed by neural network models of policy and value. We then outline our idea and implementation to adapt MCTS to the routing problem. Our overall process consists of two stages: training the neural network using reinforcement learning and combining the pre-trained network with the modified MCTS strategy to search for a better solution, meaning tour or in the CVRP formulation. Due to the computational demands associated with the application of MCTS, we adopt a selective application of our MCTS when ambiguity arises in choosing the next customer node that is proposed by the output distribution of the policy network, where the output distribution refers to the distribution of possible next nodes. This selective application enhances computational efficiency while maintaining the effectiveness of the MCTS strategy.
4.1. Reinforcement Learning Formulation
The input is denoted by , which represents a set of coordinates for customer i. The demand of a node can be included in the vector if the problem is a type of CVRP, i.e., the input for CVRP is then , where the semicolon ; represents a concatenation operation. Also, with n customers, the total number of nodes is for TSP, and for CVRP as one depot node exists. Thus, the input matrix is denoted as for TSP problems, and for CVRP.
To bring the problem into a reinforcement learning framework, we define the state, action, and cost (inversely convertible to reward). In our work, the observation state at timepoint
t, denoted by
, is a collection of the node data
, containing coordinates and demands; the currently positioned node
; a set of available nodes to visit, denoted by
; and a masking vector for unavailable nodes
, of which the
element in the vector is filled with 0 if
and
if
:
. Though masking vector
stems from the available-node set
in our formulation, we intentionally add both
and
to the state
so that the masking vector can be adjusted and redefined to reflect domain requirements just as several masking techniques are possible in Transformer [
17,
31,
32].
We omit t for , as the node data are invariant over time in this problem: for all time points, stays unchanged. However, one could make node data varying in time depending on the domain requirement, and the proposed network model is able to handle time-varying . For CVRP, the current vehicle’s load, denoted by , is also added to : . The node set holds nodes, not visited yet, that are able to fulfill the demands considering .
The action, denoted by
, is to choose the next customer node and move to it. The action in an episode, a sequence of possible states, is chosen by our policy neural network, as shown in
Figure 2, which outputs a probability distribution over all the nodes given the state at
t,
. We use
to describe the policy network output at time
t during the episode rollout. In the training phase, the action is sampled from action distribution
,
, as the next node to visit, meaning
, with
. The sampling operation aims to give the vehicle (or agent) a chance to explore a better solution space in our training phase. In the inference phase, however, we choose the action with the maximum probability, meaning
if unvisited nodes exist, and
otherwise.
A value network is designed to predict the overall cost (or distance) in the episode at state
. This is later used in updating the MCTS tree’s statistics. We describe in detail how other components work in
Section 4.2. Specifically, an episode,
, is a rollout process in which the state and action are interleaved over
until the terminal state
is reached:
. In this problem, the terminal state is the state in which all customers are visited and the vehicle has returned to the depot if it is CVRP. Because of the possibility of multiple refillings of the vehicle, the last time point
T can vary in episodes of CVRP problems. For example, even when the problems have the same size (for example,
), the optimal solution path can vary due to different customer locations and demands. Upon reaching the terminal state, no more transitions are made, and the overall distance,
, is calculated.
4.2. Architecture of the Proposed Network Model
The neural network architecture of our policy network for calculating the probability distribution
is similar to the one used in previous studies [
2,
18]. However, to solve the routing problem, we modify the decoder part, relying on the transformer [
17]. We aim to extract meaningful, possibly highly time-dependent and complex, features that are specific to the current state while maintaining the whole node structure. We make the two networks share the same embedding vector, transformed by the current input,
at time
t. The design of the shared input transformation is a deep-layered network, consisting of an encoder and decoder, to take advantage of both the whole node structure and the current node. The structure of the two networks with the shared feature transformation is reminiscent of the architecture from the AlphaGo series [
4,
5] and previous related works [
2,
18]. In essence, the input
produces the estimated probability of possible next actions via the policy network,
, and the predicted cost via the value network,
. For simplicity, we denote all learnable parameters as
, which consists of parameters from the shared transformation, those from the policy network, and those from the value network.
In detail, we explain the proposed network, dividing it into three parts: encoder in the feature transformation, decoder in the feature transformation, and policy and value. The objective of the encoder is to capture the inter-relationship between nodes. The encoder takes only the node data input from the , passing it to a linear layer to align with a new dimensionality, , via the multi-head attention layers, expressed by with input tensors of query , key , and value . The output of the multi-head attention is an encoding matrix, denoted by . Each row vector represents the node in the encoding matrix, denoted by . So, the currently positioned node at time t’s encoding is , the embedding vector reflecting the complex and interweaved relationship with the other nodes. In summary, the encoder process is self-attention to the input node data expressed as . This is repeated over several layers in the model. Relying on the idea of hidden states and current inputs in recurrent networks, we execute the encoder process once per episode, thereby reducing the computational burden, and use the current-node embedding and the current loading as inputs for the decoder in a sequential manner. We provide a detailed explanation later in this section.
The decoder is responsible for revealing the diluted relationships in the encoding matrix
with additional information if it is given. Specifically, the decoder captures the relationships between the current node
and the others. For example, let us assume that the vehicle is currently on node
i and the current node’s embedding is
. Notice that we ignore time
t in the encoding matrix since it does not change in an episode as the output of the encoder is reused over the episode once it has been executed. By using this
as the query and the whole encoding matrix
as the key and value, the decoder can reveal the relationships between the current node and the others. When passing the query, key, and value, we apply linear transformations to each of them. One should note that TSP and CVRP have different inputs for the query. In CVRP, the current load,
, is appended to the query input, while TSP is not. While there are several layers for the encoder, we only use one layer of MHA for the decoder. A summarization of the decoder is as follows:
The policy layer and value layer are responsible for calculating the final policy
, a probability distribution on all nodes given
, and the predicted distance
output, respectively. We compute
as follows with a given hyper-parameter
C that regulates the clipping:
To compute , we multiply the decoder output by the transposed encoding matrix and divide it by . The output goes through the tanh function, and we add the mask for the unavailable nodes using . Finally, we apply a softmax operator to this result.
For
, we pass the same decoder output
to two linear layers of which the shape is similar to the usual feed-forward block in the transformer:
, in which
is an activation function such as ReLU and SwiGLU [
33,
34]. A diagram for each neural network design is presented in
Figure 3.
When training the model for an episode, the encoding process is only required once as the input of the encoder (the coordinates of nodes) is fixed along the rollout steps. The decoder, on the other hand, takes the inputs that change over time, i.e., the current node and current load. Thus, on first execution of the model, we execute both the encoder and the decoder. After the first execution, we execute only the decoder and policy and value parts, saving considerable computations. The encoder and decoder share the same parameters, while the policy and value networks do not.
Figure 2 explains the overall process.
Additionally, we intentionally exclude residuals in the encoder layers, as we have observed that, unlike the original transformer and its variants, residual connections greatly harm the performance of the model. Another variation we have added to the previous model is the activation functions. Recent studies on large language models (LLMs) exploited different activation functions for their work. We take this into account and test SwiGLU activation, just as Google’s PaLM did in [
35]. We report the results in
Section 5.
4.3. Training the Neural Network
To train the policy network
, we use the well-known policy gradient algorithm, ‘reinforce with the baseline’ [
36]. This algorithm deals with high-variance problems prevalent in policy gradient methods by subtracting a specially calculated value, called the baseline. This algorithm collects data during an episode and updates the parameters after each episode ends. For
, the distance traveled by the vehicle following the sequence
, the policy network aims to learn a stochastic policy that outputs a visit sequence with a small distance over all problem instances. The gradient of the objective function for the policy network is formulated as follows:
in which
is a deterministic greedy rollout from the best policy trained so far as a baseline in order to reduce the variance of the original formulation [
18]. After training model parameter
for an epoch, we evaluate it with a validation problem set, setting
as the evaluated cost in the validation. One can think of this procedure as the training-validation mechanism in general machine learning.
The mere use of a baseline incurs additional computational costs arising from the rollouts of several episodes, being an expensive procedure. To alleviate this burden, we introduce a value network, , instead of the greedy rollout baseline.
The value network’s objective is to learn the expected cost at the end of the episode from any state during episode rollout. We keep track of the value network’s output throughout a rollout and train the network with the loss function
As in the POMO approach [
2], we test the baseline using the average cost over a batch of episodes in addition to the baseline using value network
. For instance, we calculate the baseline as the mean of all 64 episodes as a batch size, representing the number of concurrent episode runs. This value network is also used in the MCTS process described in the next section. Since our model shares the parameters in the encoder and decoder between the policy network and the value network, an update in the value network affects the parameters in the policy network with the gradient of the final loss as follows:
4.4. Proposed MCTS for the Routing
The main idea of MCTS is to improve the solutions, good in general, of trained policy and value networks to be problem specific by further investigating possible actions. In essence, without MCTS, we make a transition from
to
by taking action
, which is the output from the policy network only. However, in our proposed MCTS as described in
Figure 2, we select the next node by considering costs, which is the output of the value network, in addition to the prior probabilities from the policy network. In addition, we selectively apply the MCTS at time
t when the highest probability from the current policy network fails to dominate, meaning actions other than the highest-probability action need to be considered. In practice, when the difference between the highest probability and the 5
highest probability is less than
, we apply the MCTS, expounded below.
MCTS comprises three distinct phases: selection, expansion, and backpropagation. They iterate with a tree, initialized by the current node
and updated as iterations continue, for a given number of simulations, denoted by
as the total number of the MCTS iterations. At each iteration, the tree keeps expanding, and the statistics of some nodes in the tree are updated. As a result, a different set of tree node paths is explored throughout the MCTS iterations.
Figure 4 describes an MCTS procedure in which a few MCTS iterations are run. Given time
t, we use
to represent a tree node positioned at level
k. The definition of
is the same as
with the only difference being that
represents inner time step
k temporarily used in MCTS selection. Thus, in an MCTS iteration, with fixed
t, level
k advances as different levels are selected in the selection phase.
In the beginning, we initialize the root tree node with , meaning that MCTS starts from , therefore the vehicle position in is the same as the position at t, . To describe the MCTS phases, we introduce new notations: for the customer (or depot) node, denotes an accumulated visit count, and an accumulated total cost, both at the level of the tree. Then, we compute the ratio , called the Q-value. The Q-value, , for the i node represents an averaged cost at the level k. We normalize all Q-values in the simulation by min-max normalization.
In the selection phase, given the current MCTS tree, we recursively choose child nodes until we reach a leaf node in the tree. For instance, at the
level of the tree node, among possible nodes, denoted by
, we select the next node at
according to Equation (
13), thereby moving to a tree node at the
level:
in which hyper-parameter
adjusts the contribution of the policy-network evaluation
in comparison with the negative of averaged cost
for node
i. Let us use
ℓ to denote the leaf level in the tree in the selection phase. We obtain an inner
state path
and an inner
node path
. Then, the total
node path from time 0 to the level
ℓ becomes a concatenation of outer-path nodes
and inner-path nodes
:
. The selection phase continues until no more child nodes are available to traverse from the current position, meaning that the node is a leaf node in the tree. In
Figure 2, for instance, node 4 is selected, highlighted in red, from the root node in the first selection phase, and
. Note that, in the next MCTS iteration, the selection phase starts again from the root node
again, not from the leaf node selected from the previous iteration.
After the selection phase, the expansion phase starts, updating the MCTS tree by expanding new child nodes in at node and moving to the backpropagation phase. Note that in the early stages of the MCTS iterations, the tree may not have expanded enough to select a terminal node, meaning . As the MCTS iteration advances, the tree expands enough so that the final selected node from the selection phase, , becomes the terminal node, , meaning that routing has ended with no available node to move to. In the latter case, the MCTS iteration continues until it reaches in order to explore a variety of possible node paths.
Finally, in the backpropagation phase, tracing back
, we update
and
for all selected tree nodes in
and all selected customer nodes
. Specifically, the update follows the rule below:
As the MCTS iteration continues, the selected leaf node can be either a terminal node (
), meaning that the routing has ended, or a non-terminal node (
). In the former case,
determines the cost by evaluating the selected path of customer nodes,
. However, in the latter, we use the predicted distance
. This is possible, as we train the value network
to predict the final distance at any state following Equation (
11). In updating accumulated total cost
as in Equation (
15), we obtain the predicted cost using
at the final selected node
, then by greedily selecting the next customer node until routing finishes.
When finishing all simulations, we collect a visit-count distribution from the
’s child nodes and choose the most visited node as
for the next node to visit in the rollout:
Algorithm 1 summarizes the overall process of our MCTS. Additionally, application of the MCTS is computationally expensive, making it impractical for real-world use. For each moment in time
t, the entropy of the probability distribution
is computed by the formula
. We find that most
outputs have low entropy, meaning the highest probability,
, dominates other values. Our idea is that we selectively apply our MCTS to the rollout when
fails to dominate, i.e., when the difference between the highest probability and the fifth highest probability is less than
. We empirically obtain a strategy to improve solution quality via computation time trade-off.
Algorithm 1 Overall simulation flow in MCTS |
Require: : root state initialized by , : trained policy network, : trained value network, : number of simulations to run
- 1:
Initialize the MCTS tree by - 2:
while do - 3:
, , = Select() ▹ A leaf node in the MCTS tree is chosen - 4:
Expand(, ) ▹ Expand the MCTS tree from the leaf node using available nodes - 5:
if then ▹ The selection reached the terminal node - 6:
▹ (Equation ( 15)) - 7:
else - 8:
▹ Use the predicted cost for non-terminal leaf nodes - 9:
end if - 10:
Backpropagate(, c) - 11:
- 12:
end while - 13:
return
|
We present the pseudo-code for each MCTS phase in Algorithm 2. We highlight the modifications made to adapt MCTS to the routing problems. Firstly, we apply min-max normalization to the Q-value calculated during the entire search phase. Since the Q-value range is in
, which is equal to the range of cost (distance), this can cause a computational issue as the term
typically falls within the range
. Using a naïve Q-value could lead to a heavy reliance on the Q-value when selecting the child node because of the scale difference. To apply min-max normalization to the Q-value in the implementation, we record the maximum and minimum values in the backpropagation phase. Secondly, to minimize the distance, we negate the Q-value so that the search strategy aims to minimize distance. In the pseudo-code, the STEP procedure, which we do not include in the paper due to its complexity, accepts the chosen action as input and processes the state to transit to the next state. Internally, we update the current position of the vehicle as the chosen action in addition to the current load of the vehicle if the problem is CVRP. In addition, the mask for unavailable nodes,
, is updated to prevent the vehicle from returning to visited nodes.
Algorithm 2 List of functions in MCTS. |
Require: = 1.1: hyper-parmeter
- 1:
function select() - 2:
- 3:
= [] - 4:
= [] - 5:
- 6:
while has child do - 7:
▹ (Equation ( 13)) - 8:
Append to - 9:
is updated with the child node selected - 10:
Append to - 11:
- 12:
end while - 13:
▹ Also, - 14:
return , , - 15:
end function - 16:
function expand(, ) - 17:
for all do - 18:
s, , = step(i, ) ▹ Run STEP with the leaf node’s state for the given i - 19:
create a new child node and assign s as the state - 20:
append the child node to - 21:
end for - 22:
end function - 23:
function backpropagate(, , c) - 24:
get [] from - 25:
for all k∈ [] and do▹k denotes a level from the leaf to the root - 26:
- 27:
- 28:
- 29:
Normalize - 30:
end for - 31:
end function
|