A Deep Reinforcement Learning-Based Decision-Making Approach for Routing Problems

Yan, Dapeng; Guan, Qingshu; Ou, Bei; Yan, Bowen; Zhu, Zheng; Cao, Hui

doi:10.3390/app15094951

Open AccessArticle

A Deep Reinforcement Learning-Based Decision-Making Approach for Routing Problems

by

Dapeng Yan

^1,2,

Qingshu Guan

^1,2

,

Bei Ou

¹

,

Bowen Yan

¹,

Zheng Zhu

¹ and

Hui Cao

^1,2,*

¹

School of Electrical Engineering, Xi’an Jiaotong University, Xi’an 710049, China

²

State Key Laboratory of Electrical Insulation and Power Equipment, Xi’an Jiaotong University, Xi’an 710049, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(9), 4951; https://doi.org/10.3390/app15094951

Submission received: 31 March 2025 / Revised: 19 April 2025 / Accepted: 21 April 2025 / Published: 29 April 2025

Download

Browse Figures

Versions Notes

Abstract

In recent years, routing problems have attracted significant attention in the fields of operations research and computer science due to their fundamental importance in logistics and transportation. However, most existing learning-based methods employ simplistic context embeddings to represent the routing environment, which constrains their capacity to capture real-time visitation dynamics. To address this limitation, we propose a deep reinforcement learning-based decision-making framework (DRL-DM) built upon an encoder–decoder architecture. The encoder incorporates a batch normalization fronting mechanism and a gate-like threshold block to enhance the quality of node embeddings and improve convergence speed. The decoder constructs a dynamic-aware context embedding that integrates relational information among visited and unvisited nodes, along with the start and terminal locations, thereby enabling effective tracking of real-time state transitions and graph structure variations. Furthermore, the proposed approach exploits the intrinsic symmetry and circularity of routing solutions and adopts an actor–critic training paradigm with multiple parallel trajectories to improve exploration of the solution space. Comprehensive experiments conducted on both synthetic and real-world datasets demonstrate that DRL-DM consistently outperforms heuristic and learning-based baselines, achieving up to an 8.75% reduction in tour length. Moreover, the proposed method exhibits strong generalization capabilities, effectively scaling to larger problem instances and diverse node distributions, thereby highlighting its potential for solving complex, real-life routing tasks.

Keywords:

attention mechanism; deep reinforcement learning; encoder–decoder; routing

1. Introduction

Routing problems constitute a fundamental class of combinatorial optimization challenges that have been extensively explored in operations research and computer science [1]. Broadly, these problems seek to determine the shortest feasible tour for serving geographically distributed customer nodes while satisfying a set of problem-specific constraints. A range of well-known problems, such as the traveling salesman problem (TSP) [2] and the capacitated vehicle routing problem (CVRP) [3], have attracted significant attention due to their broad applicability in domains including robot route planning [4], express logistic management [5], transportation control [6], and so forth.

Traditional approaches to routing can be broadly categorized into exact and heuristic algorithms. Exact algorithms, like branch-and-price-and-cut [7], provide theoretical guarantees of finding optimal solutions, but are computationally prohibitive for large-scale problems due to their exponential complexity. In comparison, heuristic algorithms, such as ant colony optimization [8], employ manually designed rules to navigate the solution space, striking a balance between solution quality and computational efficiency. Nevertheless, these heuristic methods are highly dependent on extensive domain expertise, limiting their adaptability to complex real-world applications.

In recent times, with the promising advancements of neural networks and reinforcement learning in various domains [9,10], there has been a growing interest in leveraging deep reinforcement learning (DRL) techniques to tackle a variety of routing problems [11,12,13]. DRL-based approaches formulate the routing issue as a sequential decision-making process and exploit the parallel computing capabilities of graphics processing units (GPUs) to enhance search efficiency. Notably, DRL models are able to identify latent patterns from large-scale instances and autonomously learn effective decision-making strategies, enabling the construction of high-quality routes within reasonable time frames.

Despite their notable success, DRL-based approaches face several limitations. First of all, most learning-based methods rely on classical transformer architectures to generate high-dimensional embeddings of customer nodes, yet these conventional designs fall short of capturing effective state representations. Second, existing methods employ simplistic context embeddings to encode environmental information, limiting their ability to accurately model real-time state transitions and graph variations. Third, current methods fail to account for the inherent symmetry and circularity of routing solutions, restricting their capacity to learn more efficient decision-making strategies from a fixed number of training instances.

To address these challenges, we introduce a deep reinforcement learning-based decision-making (DRL-DM) framework, which leverages an encoder–decoder architecture to solve complex routing problems. In the encoding phase, we integrate a batch normalization fronting mechanism and a gate-like threshold block to enhance feature extraction and representation, producing more informative node embeddings for decision-making. As shown in Figure 1, in the decoding phase, we design a dynamic-aware context embedding that captures time-varying environmental information. This refined embedding constructs multiple relational structures, incorporating embeddings of visited and unvisited nodes, as well as the starting and ending points, thereby enabling a more accurate representation of state transitions. To further optimize policy learning, we account for the inherent symmetry and circularity of routing solutions by introducing multiple starting nodes per instance. This design facilitates policy optimization across multiple trajectories, promoting broader exploration of the solution space while leveraging the same set of instances.

To assess the performance of our DRL-DM model, we conduct comprehensive experiments on both synthetically generated datasets and real-world benchmarks, spanning various problem sizes and node distributions. Empirical results demonstrate that DRL-DM achieves substantial performance gains, reducing tour length by up to 8.75% compared to heuristic and learning-based baselines. Moreover, our approach exhibits superior generalization capabilities relative to existing DRL models, underscoring its robustness and scalability across diverse routing scenarios.

In summary, our key contributions are as follows:

We propose a DRL-based decision-making approach designed to address routing problems across varying problem sizes and node distributions.
We develop a dynamic-aware context embedding that explicitly captures state transitions and graph variations during route construction, enhancing the model’s ability to adapt to changing environments.
We conduct extensive experiments demonstrating that our approach achieves substantial improvements in both solution quality and computational efficiency, establishing a new benchmark for learning-based routing optimization.

The remainder of this paper is structured as follows. Section 2 reviews the related work on both traditional and DRL-based methods for routing problems. Section 3 introduces the overall framework of the proposed approach, detailing the encoder, decoder, and training algorithm. Section 4 presents the experimental setup and results. Finally, Section 5 concludes the paper and outlines potential directions for future research.

2. Related Work

2.1. Traditional Methods

Routing problems have garnered significant attention in recent decades due to their broad applicability across diverse domains. Traditional solution strategies fall into two primary categories: exact and heuristic algorithms. Exact algorithms, such as Concorde [14], branch-and-cut [15], and branch-cut-and-price [16], leverage pruning and partitioning techniques to systematically reduce the search space and identify theoretically optimal solutions. Notable advancements include the work of Pereira et al. [17], who introduce novel valid inequalities for routing problems and develop cut-pool and separation procedures to enhance solution quality. Similarly, Yang et al. [18] propose a price-cut-and-enumerate approach that establishes lower bounds for tour optimization, accelerating the overall solving process. While exact algorithms guarantee optimal solutions for small-scale instances, their computational complexity escalates exponentially with problem size, limiting their practical applicability to larger-scale routing scenarios.

By contrast, heuristic algorithms, while lacking guarantees of optimality, offer a pragmatic balance between solution quality and computational efficiency, making them well-suited for large-scale routing problems. Wang et al. [19] introduce a genetic algorithm with an extended sorting scheme to approximate Pareto-optimal solutions for TSPs. Vincent et al. [20] develop a simulated annealing approach integrated with variable neighborhood search to optimize routing for heterogeneous fleets, providing valuable decision-support tools for government agencies and logistics enterprises. Jia et al. [21] address the challenges of electric vehicle routing by incorporating charging constraints into a hierarchical ant colony optimization framework. Despite their effectiveness, heuristic algorithms require extensive trial-and-error tuning and rely heavily on domain-specific expertise, limiting their adaptability across diverse problem settings.

2.2. DRL-Based Methods

The rapid advancement of artificial intelligence and deep learning has spurred significant interest in leveraging deep reinforcement learning (DRL) to address complex routing challenges in logistics and transportation. By framing the routing problem as a sequential decision-making process, DRL learns routing strategies by identifying patterns across massive instances. Existing DRL-based approaches can be broadly categorized into construction-based and improvement-based methods, each differing in their route generation mechanisms. Construction-based methods iteratively build a solution by selecting and appending unvisited nodes to a growing partial route. A key breakthrough in this area is the attention model (AM) [22], proposed by Kool et al., which pioneers the application of transformer networks to routing problems such as TSPs and CVRPs, achieving superior performance. Building on this foundation, Kwon et al. [23] exploit the inherent circularity of routing solutions to develop POMO, a framework that trains policies with multiple optimal candidates to enhance solution diversity and robustness. In comparison, improvement-based methods start with a complete route and refine it through iterative modifications using local search operators such as node swaps and 2-opt moves. Wu et al. [24] introduce a novel encoding scheme and network structure to learn improvement policies, employing an actor–critic framework that yields strong routing performance. Similarly, Ma et al. [25] develop a cross-aspect attention mechanism for node embeddings and integrate proximal policy optimization to accelerate model convergence and enhance stability. These advances highlight the growing potential of DRL in optimizing routing solutions at scale.

3. Methodology

3.1. Mathematical Framework

A routing instance s contains a set of N order-invariant customer nodes

X = [x_{1}; \dots; x_{N}]

, spatially distributed in the Euclidean space.

x_{i}

denotes the 2-dimensional coordinate of node i in TSP, whereas in CVRP, it encapsulates both the coordinate and the associated demand. The goal of the DRL-based routing model is to construct a permutation

Π = (π_{1}, \dots, π_{N})

that defines the traversal sequence, where

π_{t}

represents the node to be visited at time t. Starting from an arbitrary depot

x_{π_{1}}

, the objective of the routing problem is to minimize the total length

L (Π | X)

while adhering to two constraints: (1) each customer node must be visited exactly once; and (2) each salesman departs from the single depot, visits all customer nodes, and ultimately returns to the depot.

L (Π | s) = {∥x_{π_{N}} - x_{π_{1}}∥}_{2} + \sum_{t = 2}^{N} {∥x_{π_{t}} - x_{π_{t - 1}}∥}_{2} .

(1)

We formulate the routing problem as a sequential decision-making process within an encoder–decoder framework, leveraging a constructive strategy to derive the optimal route

Π

. Given an instance as input, the encoder extracts and represents features by embedding raw node attributes through L attention layers. During decoding, the decoder iteratively processes N time steps to construct a complete traversal route. At each time step t, a compatibility layer determines the action

π_{t}

, selecting the next node to visit based on the current state

s_{t}

, which encapsulates both instance-specific and dynamic visitation information. The reward is defined as the negative of the tour length

R = L (Π | s)

. The objective is to learn a policy, parameterized by

θ

, that optimizes the sequential decision-making process to maximize the cumulative reward. The probability of the entire action sequence can be factorized using the chain rule as follows:

p_{θ} (Π | s) = \prod_{t = 1}^{N} p_{θ} (π_{t} | s_{t}, π_{1 : t - 1}) .

(2)

3.2. Encoder

The encoder in our framework comprises L stacked layers, each consisting of a multi-head attention (MHA) sublayer followed by a feedforward network (FFN) sublayer. While conventional DRL-based methods adopt residual connections and post-layer batch normalization within the transformer architecture to produce high-dimensional node embeddings, this design often falls short of capturing effective state representations. To address this limitation, we draw upon advances in natural language processing [26] and the gated transformer-XL (GTrXL) architecture [27] by incorporating a batch normalization fronting (BNF) mechanism and a gate-like threshold (GT) block. These components jointly enhance the encoder’s capacity for feature extraction and representation learning, leading to improved modeling of the routing environment.

The encoding process begins by projecting the raw input features into a high-dimensional space to obtain the initial node embeddings

G^{(0)} \in R^{N \times d_{g}}

through a linear projection:

G^{(0)} = X W_{linear} + b_{linear},

(3)

where

W_{linear}

and

b_{linear}

represent denote trainable weight and bias parameters, respectively. The resulting matrix

G^{(0)}

is then passed through a sequence L attention layers. For the l-th layer, where

l \in \{1, 2, \dots, L\}

, the input

G^{(l - 1)}

is firstly normalized in a batch-wise manner and subsequently processed by the MHA block to capture contextual dependencies among nodes.

G_{BN}^{(l)} = BN (G^{(l - 1)}),

(4)

\begin{matrix} {\hat{G}}^{(l)} & = MHA (G_{BN}^{(l)} W^{Q}, G_{BN}^{(l)} W^{K}, G_{BN}^{(l)} W^{V}) \\ = Concat (Z_{1}^{(l)}, \dots, Z_{Y}^{(l)}) W^{O}, \end{matrix}

(5)

Z_{y}^{(l)} = Softmax (\frac{Q_{l, y} K_{l, y}^{T}}{\sqrt{d_{g}}}) V_{l, y},

(6)

Q_{l, y} = G_{BN}^{(l)} W^{Q}, K_{l, y} = G_{BN}^{(l)} W^{K}, V_{l, y} = G_{BN}^{(l)} W^{V},

(7)

where

BN (\cdot)

denotes the batch normalization operator,

Concat (\cdot)

represents matrix concatenation, and

Softmax (\cdot)

is the softmax activation function. The trainable projection matrices for the query, key, value, and output transformations are defined as

W^{Q}, W^{K} \in R^{d_{g} \times d_{q}}

,

W^{V} \in R^{d_{g} \times d_{v}}

, and

W^{O} \in R^{d_{g} \times d_{g}}

, respectively. Here, the dimensionalities satisfy

d_{g} = d_{q} = d_{k} = Y d_{v}

, where Y denotes the number of attention heads. An overview of the MHA mechanism is provided in Figure 2a.

Building upon this design, we replace the conventional residual connection with a GT block, as illustrated in Figure 2b, to further enhance state representation learning. In particular, the GT block computes a weight matrix from the input vector

g_{in}

and the output vector

g_{out}

. This mechanism enables adaptive modulation of information flow, thereby improving the quality of the learned representations.

GT (g_{in}, g_{out}) = g_{in} + Sigmoid (g_{in} W_{gate} + b_{gate}) ⊙ g_{out},

(8)

where

Sigmoid (\cdot)

represents the sigmoid function, ⊙ denotes the element-wise product,

W_{gate}

and

b_{gate}

are trainable parameter matrices. Therefore, the output of the MHA sublayer can be expressed as:

{\tilde{G}}^{(l)} = GT (G^{(l - 1)}, MHA (BN (G^{(l - 1)}))) .

(9)

In the subsequent FFN sublayer, the input

{\tilde{G}}^{(l)}

is first processed by the BNF block, followed by transformation through the FFN block. The resulting output

G^{(l + 1)}

is then aggregated using the GT block, as defined by:

G^{(l + 1)} = GT ({\tilde{G}}^{(l)}, FFN (BN ({\tilde{G}}^{(l)}))),

(10)

where

FFN (\cdot)

denotes a two-layer feedforward network with a rectified linear unit (ReLU) activation function applied between the linear transformations. Conceptually, during the encoding phase, the initial node embeddings

G^{(0)}

are progressively refined through L attention layers. Within each layer, the representation of every node is updated by computing attention interactions with all other nodes in the instance, including itself. The final output embedding matrix

G^{(L)} = [g_{1}^{(L)}; \dots; g_{N}^{(L)}]

serves as a fixed representation of node-level features for the decoding stage. Additionally, a global graph embedding

g^{graph}

is computed by averaging the final node embeddings

{\{g_{i}^{(L)}\}}_{i = 1}^{N}

, providing a holistic summary of the entire input graph.

g^{graph} = \frac{1}{N} \sum_{i = 1}^{N} g_{i}^{(L)} .

(11)

3.3. Decoder

In the decoding process, most existing DRL-based methods utilize a static context embedding, denoted as

g^{context}

, to compute the compatibility scores for node selection. Typically, this context embedding is formed by concatenating the graph embedding

g^{graph}

, the depot node embedding

g_{π_{1}}^{(L)}

, and the last-visited node embedding

g_{π_{t - 1}}^{(L)}

. However, such a design fails to effectively capture the evolving dynamics inherent in routing state transitions. To overcome this limitation, we propose a dynamic-aware context embedding

g_{t}^{context}

, which explicitly encodes both state transitions and graph variations over time. To be more specific,

g_{t}^{context}

integrates the embeddings of the visited subgraph

g_{t}^{vis}

, the unvisited subgraph

g_{t}^{unv}

, the depot node

g_{π_{1}}^{(L)}

, and the last-visited node

g_{π_{t - 1}}^{(L)}

. This enriched and adaptive representation provides a more informative and flexible encoding of the current routing state, thereby enhancing the model’s decision-making capability during the decoding process.

g_{t}^{context} = Concatenation (g_{t}^{vis}, g_{t}^{unv}, g_{π_{1}}^{(L)}, g_{π_{t - 1}}^{(L)}),

(12)

g_{t}^{vis} = \frac{1}{t - 1} \sum_{i = 1}^{t - 1} g_{i}^{(L)},

(13)

g_{t}^{unv} = \frac{1}{N - t + 1} \sum_{i = 1}^{N} g_{i}^{(L)} - \frac{t - 1}{N - t + 1} g_{t}^{vis} .

(14)

In the decoder, we employ an MHA layer followed by a single-head attention (SHA) layer to assess the compatibility between the context embedding and node embeddings, enabling iterative node selection and route construction. Within the MHA layer, a glimpse embedding is computed to aggregate information from the context and all nodes. The query vector in MHA is derived from the dynamic-aware context embedding

g_{t}^{context}

, while the key and value vectors originate from the graph embedding. The computation process is formally defined as follows:

g_{t}^{glimpse} = MHA (g_{t}^{context} W_{g}^{Q}, g^{graph} W_{g}^{K}, g^{graph} W_{g}^{V}),

(15)

where

W_{g}^{Q}

,

W_{g}^{K}

, and

W_{g}^{Q}

are trainable parameter matrices.

Then, a SHA layer is utilized to calculate the probability distribution of selecting the next node to visit. In order to promote exploration and enhance marginal performance, a tanh function is used to clip the compatibility with the range of

[- C, C]

.

p_{θ} (π_{t} = i | s) = \frac{exp (u_{i, t})}{\sum_{i} exp (u_{i, t})},

(16)

u_{i, t} = C \cdot \tanh (\frac{Q_{t} K_{i, t}^{T}}{d_{g}}),

(17)

Q_{t} = g_{t}^{glimpse} W_{c}^{Q}, K_{i, t} = g_{i}^{(L)} W_{c}^{K},

(18)

where

W_{c}^{Q}, W_{c}^{K}

are trainable matrices and

C

is a constant. To ensure solution feasibility during the decoding process, an adaptive masking strategy is employed. Specifically, in routing problems, nodes that have already been visited are masked to prevent revisitation. For CVRPs, additional feasibility constraints are imposed by masking customer nodes whose demand exceeds the current remaining payload of the vehicle. This mechanism ensures that only serviceable nodes are considered at each decision step, thereby preserving the validity of the constructed solution.

3.4. Trainer

Following the encoding and decoding process, a complete solution

Π

is constructed to traverse all customer nodes. In order to further enhance the performance of our proposed approach, we bring in multiple starting nodes within each instance s, enabling policy training across multiple trajectories. Specifically, for a routing instance with M customer nodes, the number of trajectories is denoted as

ψ

, where

ψ < M

always holds. Our DRL-DM model consists of two networks: a policy network parameterized by

θ

and an auxiliary baseline network

b (s)

. The model is trained using the REINFORCE algorithm to maximize the cumulative reward, effectively minimizing the total route length. For each trajectory, a gradient descent strategy is employed to optimize the policy parameter

θ

as follows:

\nabla L (θ) = \frac{1}{Ψ} \sum_{ψ = 1}^{Ψ} (R (Π^{ψ}) - b (s)) \nabla_{θ} log p_{θ} (Π^{ψ} | s),

(19)

b (s) = \frac{1}{Ψ} \sum_{ψ = 1}^{Ψ} R (Π^{ψ}),

(20)

where

b (s)

is an auxiliary baseline network to reduce gradient variance and speed up convergence.

4. Experiments

4.1. Experimental Settings

The performance of the proposed DRL-DM approach is evaluated against a variety of heuristic and DRL-based baselines on both the TSP and the CVRP, using synthetic and real-world routing datasets. For the synthetic dataset, we adopt widely used protocols [22,23] to generate instances containing 20, 50, and 100 nodes, respectively, where node coordinates are uniformly sampled from the unit square

[0, 1] \times [0, 1]

. In the CVRP setting, customer demands are randomly selected from the integer set

\{1, 2, \dots, 9\}

, while vehicle capacities are set to 30, 40, and 50 for instances with 20, 50, and 100 nodes, respectively. For real-world benchmarks, we utilize instances from the TSPLIB and CVRPLIB repositories, which feature diverse node distributions and problem scales distinct from those of the synthetic dataset. To assess routing performance, we report the mean route length, optimality gap, and average running time. Specifically, the optimality gap is defined as the relative difference between the average route length obtained by the method and the corresponding optimal solution.

The proposed method is implemented within an encoder–decoder architecture, employing a node embedding dimension of

d_{g} = 128

. The encoder comprises

L = 3

stacked layers, each consisting of an MHA sublayer with

Y = 8

attention heads, followed by an FFN sublayer with a hidden dimension of 512. The decoder includes an 8-head attention block followed by a single-head attention block. To enhance training stability, the hyperbolic tangent function is clipped using a constant

C = 10

. The number of parallel rollout trajectories is fixed at

Ψ = 3

. For both TSP and CVRP tasks, a total of 128,000 instances are generated on the fly for training, with an additional 1000 independently generated instances used for evaluation. The model is trained for 100 epochs, with each epoch comprising 1000 iterations and a batch size of 512. Optimization is performed using the Adam optimizer, initialized with a learning rate of

1 \times 10^{- 4}

and a decay factor of 0.995 per epoch. For consistency and reproducibility, the random seed was fixed at 1234 across all learning-based models. An overview of the training procedure is depicted in Figure 3. All algorithms are implemented on a personal computer equipped with a GeForce RTX 4080-Super GPU (16 GB VRAM) and an Intel^® Ultra U9-285K CPU.

4.2. Comparison Analysis

The detailed descriptions of baseline methods are listed as follows:

(1): Concorde [14]: A mature exact solver tailored for TSPs, which has been applied to scenes like vehicle routing, gene mapping, and so on.
(2): Gurobi [28]: A most powerful commercial optimizer that models routing problems as mathematical programming and attains excellent solutions.
(3): LKH-3 [29]: The practical decision-making technique that achieves state-of-the-art performance in a variety of routing problems.
(4): Google OR Tools (v9.11) [30]: A fast and portable software specialized for combinatorial optimization problems.
(5): GA [31]: The genetic algorithm which utilizes crossover and mutation operations to search for optimal solutions based on natural selection.
(6): SA [32]: The simulated annealing approach that involves heating and controlled cooling strategies to approximate the global optimum in TSPs and CVRPs.
(7): ACO [33]: The ant colony optimization that uses artificial ants to explore the entire environment via pheromone-based communication, thereby ultimately finding satisfactory routing results.
(8): AM [22]: A milestone DRL-based method that, for the first time, introduces a transformer-based encoder–decoder framework to tackle different kinds of routing problems.
(9): Wu et al. [24]: A DRL-based approach that learns improvement heuristics to handle routing problems.
(10): DACT [25]: The improvement-based DRL model which exploits the circularity attribute inherent in traversal routes and designs a dual-aspect attention mechanism to find optimal solutions.
(11): Neural-2-Opt [34]: The DRL-based model that learns to select 2-opt operators for local search and exhibits near-optimal results with relatively fast solving speed.
(12): POMO [23]: A state-of-the-art DRL-based approach that learns parallel policies with multiple optima and achieves superior performance in various routing problems.
(13): NeuRewritter [35]: A neural-based method that learns an elaborated policy to iteratively rewrite local components of the current route and improves it until convergence.

Table 1 and Table 2 illustrate the routing performance between baseline methods and our approach to TSPs and CVRPs. From an overall perspective, our DRL-DM consistently outperforms other heuristic and learning-based methods by a clear margin. Its advantage becomes more pronounced especially when the problem sizes gradually increase. This achievement convincingly demonstrates the effectiveness and efficiency of the proposed model in tackling practical large-scale problems in real-life scenarios.

For TSPs, the optimal results are calculated based on the exact solver Concorde [14] with sufficient solving time. As shown in Table 1, it can be observed that DRL-DM outstrips the milestone DRL-based method, AM [22], by a gap reduction of 0.52%, 1.57%, and 2.06% in TSP20, TSP50, and TSP100, respectively. When compared with heuristic methods, the advantage becomes more significant. For example, in TSP100, the optimality gap of our approach is 3.22%, 2.96%, and 2.83% lower than that of GA [31], SA [32], and ACO [33] separately. Regarding the state-of-the-art POMO [23], our proposed approach achieves shorter route lengths while maintaining the same magnitude of running time on TSPs with various scales. These accomplishments reveal the superiority of our well-designed GT block and dynamic-aware context embedding.

Pertaining to CVRPs, the optimal solutions are achieved by the powerful approach LKH-3 [29], as shown in Table 2. In comparison with the heuristic optimizer, the mean route length of our approach is 0.34, 1.85, and 1.37 shorter than that of OR Tools [30], which shows that DRL-DM engages a powerful decision-making capability to tackle complex routing problems. Moreover, the proposed approach also exhibits superior routing performance when compared with DRL-based methods. Specifically, DRL-DM achieves remarkable optimality gap reductions of 6.65%, 3.45%, 2.62%, 2.17%, and 0.96% than AM [22], Wu et al. [24], DACT [25], NeuRewritter [35], and POMO [23], respectively. It is worth noting that our proposed approach outperforms the second-best model, POMO [23], by a gap reduction of 0.49%, 0.77%, and 0.96% on CVRPs with 20, 50, and 100 nodes, respectively. These exciting results demonstrate the excellent ability of our DRL-DM to tackle different types of routing problems.

4.3. Generalization Study

It is computationally tremendous and practically intractable to train models from scratch to tackle various complicated routing problems. Hence, we attempt to evaluate the generalization ability of our proposed approach on three types of scenarios: (1) larger-scale instances with the same uniform distribution; (2) same-scale but out-of-distribution problems as the training set; and (3) real-world benchmarks TSPLIB and CVRPLIB with varying problem sizes and node distributions.

4.3.1. Cross-Size Generalization

To assess the cross-size generalization ability, we generate 1000 instances with 20, 50, 100, 150, and 200 nodes following the uniform distribution, and employ models pretrained on TSP20 and CVRP20 to handle them. We compare our approach with compelling DRL-based methods (i.e., AM, DACT, and POMO), and the experimental results are illustrated in Figure 4. DRL-DM attains an optimality gap of 2.18% on TSP200, outstripping AM, DACT, and POMO by 4.64%, 3.53%, and 2.31%, respectively. In addition, the proposed approach consistently outperforms state-of-the-art POMO by 1.26%, 1.76%, and 2.31% separately on large-scale TSPs with 100, 150, and 200 nodes. As for CVRPs, the optimality gaps of our DRL-DM are 0.49%, 2.26%, 3.18%, 4.05%, and 5.26% on problems with 20, 50, 100, 150, and 200 nodes, which are 4.59%, 6.41%, 7.29%, 8.01%, and 8.97% lower than those of the milestone AM. These advancements verify the elaborated design of our model especially when generalizing to larger-scale in-distribution scenes.

4.3.2. Cross-Distribution Generalization

To further evaluate the cross-distribution generalization ability of the proposed approach, we randomly generate 1000 instances with 20, 50, and 100 nodes following the explosion distribution and implosion distribution, and employ six models pretrained on TSP20, TSP50, TSP100, CVRP20, CVRP50, and CVRP100 with the uniform distribution to handle them. Focusing on instances with the explosion distribution, as shown in Figure 5, the proposed approach achieves an optimality gap of 0.97%, 1.33%, and 1.75% on TSP20, TSP50, and TSP100, respectively, outstripping the state-of-the-art DRL model, POMO, by 1.54%, 1.51%, and 1.28%. In addition, the optimality gap of DRL-DM on CVRP100 is 1.72%, which is 1.83%, 1.13%, and 0.96% lower than that of AM, DACT, and POMO. As for instances following the implosion distribution, the comparative results are reported in Figure 6. It can be observed that our DRL-DM exhibits the best generalization performance and dominates DRL-based methods in terms of the optimality gap. In comparison with POMO, DRL-DM obtains gap reductions of 0.62%, 0.83%, and 1.12% on TSPs with 20, 50, and 100 nodes, respectively. While for CVRP20, CVRP50, and CVRP100, the gap reductions become 0.42%, 0.77%, and 0.94%.

4.3.3. Generalization on Real-World Benchmarks

To assess the actual routing performance of our approach on real-life scenarios, we apply models trained on synthetic datasets to tackle instances in real-world benchmarks, i.e., TSPLIB [36] and CVRPLIB [37]. To be more specific, we employ the models trained on TSP100 and CVRP100 with uniform distribution to solve instances in TSPLIB and CVRPLIB. The final experimental results are reported in Table 3 and Table 4. For TSPLIB, our proposed approach achieves an optimality gap of 7.12%, outperforming AM, DACT, and POMO by a decrease of 20.28%, 16.27%, and 13.58%, respectively. Take rat99 as an example, DRL-DM attains a route length of 1232, which is 8.93%, 6.33%, and 4.63% shorter than that of AM, DACT, and POMO. As for CVRPLIB, the superiority of our proposed approach is also significant. The optimality gap of DRL-DM is 8.91%, which is 10.52%, 6.56%, and 3.91% lower than that of AM, DACT, and POMO. These excellent results demonstrate the potential of our proposed approach to handle a series of challenging and complicated routing problems in real-world scenarios.

Furthermore, the proposed DRL-DM framework is evaluated on real-world logistics scenarios in Xi’an City, China. Specifically, the model trained on TSP50 is deployed to service 50 spatially distributed customers across the main urban district. Figure 7 presents the routing outcomes generated by DRL-DM in comparison with three representative DRL-based baselines: POMO, DACT, and AM. As illustrated, DRL-DM achieves a tour length of 97.90 km, reflecting improvements of 0.90%, 1.54%, and 4.50% over POMO, DACT, and AM, respectively. These results underscore the practical effectiveness and superior route optimization capability of DRL-DM in addressing complex, real-world vehicle routing tasks.

4.4. Ablation Study

To evaluate the effectiveness of the proposed DRL-DM architecture, we conduct an ablation study focusing on its three key components: the BNF module, the GT block, and the dynamic-aware context embedding. Table 5 and Table 6 present the routing performance resulting from the incremental integration of these components into a vanilla attention model (i.e., AM [22]) across both TSPs and CVRPs. For the TSP100 benchmark, incorporating the BNF module, GT block, and dynamic-aware context embedding individually results in average route length reductions of 0.26%, 0.78%, and 1.03%, respectively. When all components are combined, the optimality gap is reduced to 0.52%, representing a 2.06% improvement over the baseline. A similar trend is observed for CVRPs. Specifically, the complete integration of all three components yields optimality gap reductions of 4.59%, 5.30%, and 6.65% on CVRP20, CVRP50, and CVRP100, respectively, when compared with the vanilla version. These results clearly demonstrate the effectiveness and scalability of the proposed architectural enhancements across routing problems of varying complexity.

4.5. Sensitive Analysis

To enable policy training with enhanced trajectory diversity and thereby improve overall routing performance, we introduce multiple starting nodes for each instance. This design allows the model to generate multiple solution trajectories during training. The number of trajectories, denoted by

Ψ

, plays a critical role in determining both solution quality and computational efficiency. To investigate its effect, we evaluate the routing performance of our proposed DRL-DM under varying values of

Ψ

(i.e., 1 to 5) on both TSP and CVRP benchmarks. As reported in Table 7, increasing the number of trajectories results in progressively shorter tour lengths, at the cost of increased running time. Notably, setting

Ψ = 3

achieves an optimal balance between solution quality and execution efficiency. Under this configuration, DRL-DM exhibits superior performance on TSP20, TSP50, and CVRP20, while maintaining relatively low computational overhead. Based on these empirical observations, we adopt

Ψ = 3

for all subsequent comparative and generalization experiments.

5. Conclusions

In this study, we present a novel deep reinforcement learning-based decision-making framework (DRL-DM) to address routing problems across varying problem sizes and node distributions. To enhance representation learning, the encoding phase incorporates a batch normalization fronting mechanism and a gate-like threshold block to generate more informative node embeddings. In the decoding phase, we introduce a dynamic-aware context embedding to effectively capture state transitions and graph variations. Additionally, our training strategy leverages multiple starting nodes per instance, enabling broader exploration of the search space through multiple trajectories. Experimental results on two classes of routing problems demonstrate that DRL-DM consistently outperforms heuristic and learning-based baselines while exhibiting superior generalization capabilities.

Notably, recent advancements, like graph neural network (GNN) based approaches [38] and hybrid heuristic–DRL methods [39] are not contained in this study. In GNN-based approaches, the routing environment is represented as a graph where nodes correspond to customers or locations, and edges reflect the connectivity or distance between them. Through iterative message-passing mechanisms, GNNs can capture complex spatial dependencies and relational patterns among nodes, enabling more informed decision-making. Moreover, hybrid methods that combine heuristic strategies with DRL have gained traction. These methods typically employ DRL to learn high-level policies while integrating domain-specific heuristics for local search or post-processing. For example, hybrid heuristic–DRL approaches may use a learned policy to generate an initial solution, which is then refined through traditional heuristics such as 2-opt or simulated annealing. This integration capitalizes on the global learning capacity of DRL while retaining the efficiency and effectiveness of heuristics in local optimization.

In addition, we recognize that the deployment of DRL in large-scale routing also carries potential societal and technical risks that warrant careful consideration. First, the training of DRL-DM requires substantial computational resources and prolonged GPU utilization, which can translate into significant energy consumption and associated carbon emissions. Mitigating this impact will require the development of more energy-efficient training schedules, model sparsification techniques, or hardware-aware optimizations. Second, our empirical evaluation has been confined to problem sizes of up to 200 nodes; the memory footprint and inference latency of the current model may become prohibitive in larger or more dynamic environments (e.g., thousands of stops and real-time traffic fluctuations). Finally, while our benchmarks are representative, real-world routing scenarios can exhibit richer variability—seasonal demand shifts, regulatory constraints, and heterogeneous vehicle capabilities—that may limit the direct applicability of DRL-DM. Addressing these limitations in future work will be critical to ensuring that advanced DRL routing systems are both environmentally sustainable and robust in complex operational contexts.

Author Contributions

Conceptualization, D.Y. and Q.G.; methodology, Q.G.; software, B.O.; validation, D.Y., Q.G. and B.O.; formal analysis, B.Y. and Z.Z.; investigation, Z.Z.; resources, H.C.; data curation, B.O.; writing—original draft preparation, Q.G.; writing—review and editing, D.Y.; visualization, B.Y.; supervision, H.C.; project administration, H.C.; funding acquisition, D.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the National Natural Science Foundation of China under Grant No. 62306232, Natural Science Basic Research Program of Shaanxi Province under Grant No. 2023-JC-QN-0662, and the State Key Laboratory of Electrical Insulation and Power Equipment under Grant No. EIPE23416.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data available in a publicly accessible repository The data presented in this study are openly available in https://github.com/wouterkool/attention-learn-to-route.

Acknowledgments

We thank Xi’an Jiaotong University for helping us with the Article Processing Charge for publication of the article in Open Access.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

DRL	Deep reinforcement learning
DRL-DM	Deep reinforcement learning-based decision-making
TSP	Traveling salesman problem
CVRP	Capacitated vehicle routing problem
GPU	Graphics processing unit
AM	Attention model
MHA	Multi-head attention
FFN	Feedforward network
BNF	Batch normalization fronting
GT	Gate-like threshold
ReLU	Rectified linear unit
SHA	Single-head attention
GA	Genetic algorithm
SA	Simulated annealing
ACO	Ant colony optimization

References

Guan, Q.; Cao, H.; Jia, L.; Yan, D.; Chen, B. Synergetic attention-driven transformer: A Deep reinforcement learning approach for vehicle routing problems. Expert Syst. Appl. 2025, 274, 126961. [Google Scholar] [CrossRef]
Guan, Q.; Hong, X.; Ke, W.; Zhang, L.; Sun, G.; Gong, Y. Kohonen self-organizing map based route planning: A revisit. In Proceedings of the 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Prague, Czech Republic, 27 September–1 October 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 7969–7976. [Google Scholar]
Hao, Y.; Chen, Z.; Sun, X.; Tong, L. Planning of truck platooning for road-network capacitated vehicle routing problem. Transp. Res. Part E Logist. Transp. Rev. 2025, 194, 103898. [Google Scholar] [CrossRef]
Guan, Q.; Cao, H.; Zhong, X.; Yan, D.; Xue, S. Hisom: Hierarchical Self-Organizing Map for Solving Multiple Traveling Salesman Problems. Networks, 2025; early view. [Google Scholar]
Zhou, G.; Li, D.; Bian, J.; Zhang, Y. Two-echelon time-dependent vehicle routing problem with simultaneous pickup and delivery and satellite synchronization. Comput. Oper. Res. 2024, 167, 106600. [Google Scholar] [CrossRef]
Shi, W.; Wang, N.; Zhou, L.; He, Z. The bi-objective mixed-fleet vehicle routing problem under decentralized collaboration and time-of-use prices. Expert Syst. Appl. 2025, 273, 126875. [Google Scholar] [CrossRef]
Xia, Y.; Zeng, W.; Zhang, C.; Yang, H. A branch-and-price-and-cut algorithm for the vehicle routing problem with load-dependent drones. Transp. Res. Part B Methodol. 2023, 171, 80–110. [Google Scholar] [CrossRef]
Hou, Y.; Guo, X.; Han, H.; Wang, J. Knowledge-driven ant colony optimization algorithm for vehicle routing problem in instant delivery peak period. Appl. Soft Comput. 2023, 145, 110551. [Google Scholar] [CrossRef]
Tan, J.; Xue, S.; Guan, Q.; Niu, T.; Cao, H.; Chen, B. Unmanned aerial-ground vehicle finite-time docking control via pursuit-evasion games. Nonlinear Dyn. 2025, 113, 16757–16777. [Google Scholar] [CrossRef]
Tan, J.; Xue, S.; Guan, Q.; Qu, K.; Cao, H. Finite-time Safe Reinforcement Learning Control of Multi-player Nonzero-Sum Game for Quadcopter Systems. Inf. Sci. 2025, 712, 122117. [Google Scholar]
Wang, Y.; Hong, X.; Wang, Y.; Zhao, J.; Sun, G.; Qin, B. Token-based deep reinforcement learning for Heterogeneous VRP with Service Time Constraints. Knowl.-Based Syst. 2024, 300, 112173. [Google Scholar]
Wang, C.; Cao, Z.; Wu, Y.; Teng, L.; Wu, G. Deep reinforcement learning for solving vehicle routing problems with backhauls. IEEE Trans. Neural Netw. Learn. Syst. 2024, 36, 4779–4793. [Google Scholar]
Yan, D.; Ou, B.; Guan, Q.; Zhu, Z.; Cao, H. Edge-Driven Multiple Trajectory Attention Model for Vehicle Routing Problems. Appl. Sci. 2025, 15, 2679. [Google Scholar] [CrossRef]
Applegate, D.L.; Bixby, R.E.; Chvátal, V.; Cook, W.J. Concorde TSP Solver. 2020. Available online: http://www.math.uwaterloo.ca/tsp/concorde/ (accessed on 12 March 2020).
Khachai, D.; Sadykov, R.; Battaia, O.; Khachay, M. Precedence constrained generalized traveling salesman problem: Polyhedral study, formulations, and branch-and-cut algorithm. Eur. J. Oper. Res. 2023, 309, 488–505. [Google Scholar] [CrossRef]
Hintsch, T.; Irnich, S.; Kiilerich, L. Branch-price-and-cut for the soft-clustered capacitated arc-routing problem. Transp. Sci. 2021, 55, 687–705. [Google Scholar] [CrossRef]
Pereira, A.H.; Mateus, G.R.; Urrutia, S.A. Valid inequalities and branch-and-cut algorithm for the pickup and delivery traveling salesman problem with multiple stacks. Eur. J. Oper. Res. 2022, 300, 207–220. [Google Scholar] [CrossRef]
Yang, Y. An exact price-cut-and-enumerate method for the capacitated multitrip vehicle routing problem with time windows. Transp. Sci. 2023, 57, 230–251. [Google Scholar] [CrossRef]
Wang, Y.; Wei, Y.; Wang, X.; Wang, Z.; Wang, H. A clustering-based extended genetic algorithm for the multidepot vehicle routing problem with time windows and three-dimensional loading constraints. Appl. Soft Comput. 2023, 133, 109922. [Google Scholar] [CrossRef]
Vincent, F.Y.; Anh, P.T.; Gunawan, A.; Han, H. A simulated annealing with variable neighborhood descent approach for the heterogeneous fleet vehicle routing problem with multiple forward/reverse cross-docks. Expert Syst. Appl. 2024, 237, 121631. [Google Scholar]
Jia, Y.H.; Mei, Y.; Zhang, M. A bilevel ant colony optimization algorithm for capacitated electric vehicle routing problem. IEEE Trans. Cybern. 2021, 52, 10855–10868. [Google Scholar] [CrossRef]
Kool, W.; van Hoof, H.; Welling, M. Attention, Learn to Solve Routing Problems! In Proceedings of the International Conference on Learning Representations, New Orleans, LA, USA, 6–9 May 2019.
Kwon, Y.D.; Choo, J.; Kim, B.; Yoon, I.; Gwon, Y.; Min, S. Pomo: Policy optimization with multiple optima for reinforcement learning. Adv. Neural Inf. Process. Syst. 2020, 33, 21188–21198. [Google Scholar]
Wu, Y.; Song, W.; Cao, Z.; Zhang, J.; Lim, A. Learning improvement heuristics for solving routing problems. IEEE Trans. Neural Netw. Learn. Syst. 2022, 33, 5057–5069. [Google Scholar] [CrossRef]
Ma, Y.; Li, J.; Cao, Z.; Song, W.; Zhang, L.; Chen, Z.; Tang, J. Learning to Iteratively Solve Routing Problems with Dual-Aspect Collaborative Transformer. Adv. Neural Inf. Process. Syst. 2021, 34, 11096–11107. [Google Scholar]
Labatie, A.; Masters, D.; Eaton-Rosen, Z.; Luschi, C. Proxy-normalizing activations to match batch normalization while removing batch dependence. Adv. Neural Inf. Process. Syst. 2021, 34, 16990–17006. [Google Scholar]
Parisotto, E.; Song, F.; Rae, J.; Pascanu, R.; Gulcehre, C.; Jayakumar, S.; Jaderberg, M.; Kaufman, R.L.; Clark, A.; Noury, S.; et al. Stabilizing transformers for reinforcement learning. In Proceedings of the International Conference on Machine Learning. PMLR, Virtual, 13–18 July 2020; pp. 7487–7498. [Google Scholar]
Optimization, Gurobi and LLC. Gurobi Optimizer Reference Manual. 2022. Available online: https://www.gurobi.com/ (accessed on 28 June 2022).
Helsgaun, K. An extension of the Lin-Kernighan-Helsgaun TSP solver for constrained traveling salesman and vehicle routing problems. Roskilde Rosk. Univ. 2017, 12, 966–980. [Google Scholar]
Gunjan, V.K.; Kumari, M.; Kumar, A.; Rao, A.A. Search engine optimization with Google. Int. J. Comput. Sci. Issues (IJCSI) 2012, 9, 206. [Google Scholar] [CrossRef]
Hussain, A.; Muhammad, Y.S.; Nauman Sajid, M.; Hussain, I.; Mohamd Shoukry, A.; Gani, S. Genetic algorithm for traveling salesman problem with modified cycle crossover operator. Comput. Intell. Neurosci. 2017, 2017, 7430125. [Google Scholar] [CrossRef]
Ilhan, I.; Gökmen, G. A list-based simulated annealing algorithm with crossover operator for the traveling salesman problem. Neural Comput. Appl. 2022, 34, 7627–7652. [Google Scholar] [CrossRef]
Wang, Y.; Han, Z. Ant colony optimization for traveling salesman problem based on parameters optimization. Appl. Soft Comput. 2021, 107, 107439. [Google Scholar] [CrossRef]
d O Costa, P.R.; Rhuggenaath, J.; Zhang, Y.; Akcay, A. Learning 2-opt heuristics for the traveling salesman problem via deep reinforcement learning. In Proceedings of the Asian Conference on Machine Learning. PMLR, Bangkok, Thailand, 18–20 November 2020; pp. 465–480. [Google Scholar]
Chen, X.; Tian, Y. Learning to perform local rewriting for combinatorial optimization. In Proceedings of the Advances in Neural Information Processing Systems 32 (NeurIPS 2019), Vancouver, BC, Canada, 8–14 December 2019; Volume 32. [Google Scholar]
Reinelt, G. TSPLIB—A traveling salesman problem library. ORSA J. Comput. 1991, 3, 376–384. [Google Scholar] [CrossRef]
Uchoa, E.; Pecin, D.; Pessoa, A.; Poggi, M.; Vidal, T.; Subramanian, A. New benchmark instances for the capacitated vehicle routing problem. Eur. J. Oper. Res. 2017, 257, 845–858. [Google Scholar] [CrossRef]
Luo, J.; Li, C.; Fan, Q.; Liu, Y. A graph convolutional encoder and multi-head attention decoder network for TSP via reinforcement learning. Eng. Appl. Artif. Intell. 2022, 112, 104848. [Google Scholar] [CrossRef]
Liu, H.; Zong, Z.; Li, Y.; Jin, D. NeuroCrossover: An intelligent genetic locus selection scheme for genetic algorithm using reinforcement learning. Appl. Soft Comput. 2023, 146, 110680. [Google Scholar] [CrossRef]

Figure 1. Overall framework of the proposed DRL-DM approach. Our DRL-DM is built upon an encoder–decoder architecture designed to facilitate efficient routing decisions. The encoder is responsible for transforming raw input features into high-dimensional node embeddings, enabling effective extraction and representation of structural patterns within the routing instance. It comprises L stacked layers, each consisting of an MHA sublayer followed by an FFN sublayer, which jointly models complex inter-node dependencies. The decoder operates in an autoregressive manner to construct routing solutions sequentially. At each decision step, it employs an MHA sublayer and a SHA sublayer to compute a context-aware probability distribution over the candidate nodes. This distribution guides the selection of the next node to visit, progressively generating a feasible route until completion.

Figure 2. An illustration of (a) MHA mechanism and (b) gate-like threshold block. To effectively model complex interactions among nodes, the encoder incorporates an MHA mechanism within each layer. The MHA module enables the model to jointly attend to information from different representation subspaces, thereby enhancing its capacity to capture diverse relational patterns in the graph structure. Moreover, the GT block is introduced to further enhance feature representation and enable more effective control over information flow within the encoder.

Figure 3. The training procedure of our DRL-DM. The training set comprises both synthetic datasets and real-world benchmarks with diverse node distributions, including uniform, explosion, and implosion patterns. As illustrated in the training pipeline, the encoder (depicted in the gray box) is designed to extract underlying structural features from the input nodes. The decoder (shown in the light green box) then iteratively utilizes these encoded representations to determine the next node to visit, thereby constructing feasible routing solutions in a sequential manner.

Figure 4. Generalization results among DRL-based methods on TSPs and CVRPs.

Figure 5. Generalization results among DRL-based methods on TSPs and CVRPs with explosion node distribution.

Figure 6. Generalization results among DRL-based methods on TSPs and CVRPs with implosion node distribution.

Figure 7. Visualization of the routing performance between three comparative learning-based methods and our proposed DRL-DM in the real-life logistics case in Xi’an, China.

Table 1. Comparison results among baseline methods on TSP20-100.

Method	TSP20			TSP50			TSP100
Method	Length	Gap	Time	Length	Gap	Time	Length	Gap	Time
Concorde [14]	3.84	0.00%	6 min	5.70	0.00%	15 min	7.76	0.00%	1 h
Gurobi [28]	3.84	0.00%	8 s	5.70	0.00%	3 min	7.76	0.00%	19 min
LKH-3 [29]	3.84	0.00%	45 s	5.70	0.00%	8 min	7.76	0.00%	28 min
Nearest Search	4.33	12.76%	1 s	6.78	18.95%	3 s	9.47	22.04%	8 s
Random Search	4.01	4.43%	1 s	6.15	7.89%	2 s	8.55	10.18%	3 s
Farthest Search	3.95	2.86%	1 s	6.03	5.79%	4 s	8.37	7.86%	9 s
OR Tools [30]	3.87	0.78%	1 min	5.86	2.81%	6 min	8.09	4.25%	25 min
GA [31]	3.86	0.52%	1 min	5.84	2.46%	5 min	8.05	3.74%	24 min
SA [32]	3.86	0.52%	1 min	5.83	2.28%	5 min	8.03	3.48%	23 min
ACO [33]	3.86	0.52%	1 min	5.82	2.11%	4 min	8.02	3.35%	22 min
AM [22]	3.86	0.52%	2 s	5.80	1.75%	5 s	7.96	2.58%	14 s
Wu et al. [24]	3.85	0.26%	13 min	5.78	1.40%	18 min	7.94	2.32%	28 min
DACT [25]	3.84	0.00%	9 s	5.74	0.70%	23 s	7.92	2.06%	1 min
Neural-2-Opt [34]	3.84	0.00%	16 min	5.73	0.53%	32 min	7.89	1.68%	45 min
POMO [23]	3.84	0.00%	5 s	5.72	0.35%	22 s	7.83	0.90%	1 min
Ours	3.84	0.00%	5 s	5.71	0.18%	19 s	7.80	0.52%	49 s

Table 2. Comparison results among baseline methods on CVRP20-100.

Method	CVRP20			CVRP50			CVRP100
Method	Length	Gap	Time	Length	Gap	Time	Length	Gap	Time
Concorde [14]	6.10	0.00%	12 h	-	-	-	-	-	-
Gurobi [28]	6.10	0.00%	5 h	10.37	0.00%	16 h	-	-	-
LKH-3 [29]	6.10	0.00%	2 h	10.37	0.00%	7 h	15.65	0.00%	14 h
OR Tools [30]	6.47	6.07%	2 min	11.29	8.87%	15 min	17.15	9.58%	51 min
GA [31]	6.40	4.92%	2 min	11.12	7.23%	15 min	16.87	7.80%	49 min
SA [32]	6.36	0.52%	2 min	11.05	2.28%	14 min	16.78	7.22%	47 min
ACO [33]	6.32	3.61%	2 min	10.98	5.88%	13 min	16.70	6.71%	45 min
AM [22]	6.41	5.08%	3 s	10.99	5.98%	8 s	16.82	7.48%	23 s
Wu et al. [24]	6.19	1.48%	24 min	10.72	3.38%	49 min	16.32	4.28%	1 h
DACT [25]	6.17	1.15%	1 min	10.63	2.51%	4 min	16.19	3.45%	12 min
NeuRewritter [35]	6.16	0.98%	31 min	10.58	2.03%	56 min	16.12	3.00%	1 h
POMO [23]	6.16	0.98%	15 s	10.52	1.45%	1 min	15.93	1.79%	4 min
Ours	6.13	0.49%	14 s	10.44	0.68%	1 min	15.78	0.83%	3 min

Table 3. Comparison results among DRL-based methods on TSPLIB.

Instance	Opt.	AM [22]	DACT [25]	POMO [23]	Ours
eil51	426	439	438	436	427
berlin52	7542	8361	8062	7864	7549
st70	675	693	688	684	682
rat99	1211	1342	1310	1289	1232
KroA100	21,282	42,661	40,238	38,624	25,163
KroB100	22,141	36,035	34,526	33,521	26,591
KroC100	20,749	32,937	31,456	30,468	23,627
KroD100	21,294	33,826	31,011	29,135	23,668
KroE100	22,068	29,036	27,415	26,335	24,257
rd100	7910	8256	8209	8178	8034
lin105	14,379	15,153	15,016	14,926	14,658
pr107	44,303	53,849	53,252	52,855	47,562
ch150	6528	6931	6882	6849	6702
rat195	2323	2611	2576	2553	2406
KroA200	29,368	35,638	35,210	34,926	33,282
Avg. Gap	0.00%	27.40%	23.39%	20.70%	7.12%

Table 4. Comparison results among DRL-based methods on CVRPLIB.

Instance	Opt.	AM [22]	DACT [25]	POMO [23]	Ours
X-n101-k25	27,591	36,248	32,262	29,605	28,967
X-n106-k14	26,362	27,936	27,568	27,323	27,296
X-n110-k13	14,971	16,318	16,060	15,889	15,219
X-n115-k10	12,747	14,056	13,990	13,946	13,834
X-n120-k6	13,332	14,453	14,392	14,352	14,231
X-n125-k30	55,539	72,349	70,676	69,562	62,384
X-n129-k18	28,940	30,896	30,458	30,167	30,026
X-n134-k13	10,916	13,561	13,288	13,107	12,805
X-n139-k10	13,590	14,756	14,403	14,169	13,856
X-n143-k7	15,700	18,137	18,003	17,915	17,623
X-n153-k22	21,220	29,034	27,434	26,368	25,103
X-n157-k13	16,876	21,966	20,345	19,265	18,236
X-n181-k23	25,628	27,865	27,596	27,418	27,048
X-n190-k8	16,980	23,028	21,443	20,387	19,339
X-n200-k36	58,578	76,035	74,289	73,126	67,238
Avg. Gap	0.00%	19.43%	15.47%	12.82%	8.91%

Table 5. Ablation study of different components of our DRL-DM on TSPs.

Components of Our DRL-DM			TSP20			TSP50			TSP100
BNF	GT	Context	Length	Gap	Time	Length	Gap	Time	Length	Gap	Time
			3.86	0.52%	2 s	5.80	1.75%	5 s	7.96	2.58%	14 s
✓			3.86	0.52%	2 s	5.79	1.58%	8 s	7.94	2.32%	21 s
	✓		3.85	0.26%	3 s	5.77	1.23%	10 s	7.90	1.80%	26 s
		✓	3.85	0.26%	3 s	5.76	1.05%	11 s	7.88	1.55%	30 s
✓	✓		3.85	0.26%	4 s	5.76	1.05%	13 s	7.87	1.42%	33 s
✓		✓	3.85	0.26%	4 s	5.74	0.70%	14 s	7.86	1.29%	37 s
	✓	✓	3.84	0.00%	5 s	5.73	0.53%	16 s	7.83	0.90%	42 s
✓	✓	✓	3.84	0.00%	5 s	5.71	0.18%	19 s	7.80	0.52%	49 s

Table 6. Ablation study of different components of our DRL-DM on CVRPs.

Components of Our DRL-DM			CVRP20			CVRP50			CVRP100
BNF	GT	Context	Length	Gap	Time	Length	Gap	Time	Length	Gap	Time
			6.41	5.08%	3 s	10.99	5.98%	8 s	16.82	7.48%	23 s
✓			6.37	4.43%	5 s	10.91	5.21%	18 s	16.66	6.45%	54 s
	✓		6.31	3.44%	7 s	10.80	4.15%	26 s	16.46	5.18%	1 min
		✓	6.27	2.79%	8 s	10.72	3.38%	31 s	16.30	4.15%	2 min
✓	✓		6.26	2.62%	9 s	10.70	3.18%	37 s	16.28	4.03%	2 min
✓		✓	6.23	2.13%	10 s	10.63	2.51%	42 s	16.14	3.13%	2 min
	✓	✓	6.17	1.15%	12 s	10.52	1.45%	50 s	15.93	1.79%	3 min
✓	✓	✓	6.13	0.49%	14 s	10.44	0.68%	1 min	15.78	0.83%	3 min

Table 7. Sensitive analysis of diffenent number of trajectories on TSPs and CVRPs.

Value	TSP20		TSP50		TSP100		CVRP20		CVRP50		CVRP100
Value	Length	Time	Length	Time	Length	Time	Length	Time	Length	Time	Length	Time
$Ψ = 1$	3.86	4 s	5.74	15 s	7.85	42 s	6.17	11 s	10.51	1 min	15.89	2 min
$Ψ = 2$	3.85	5 s	5.72	17 s	7.82	46 s	6.15	12 s	10.47	1 min	15.82	3 min
$Ψ = 3$	3.84	5 s	5.71	19 s	7.80	49 s	6.13	14 s	10.44	1 min	15.78	3 min
$Ψ = 4$	3.84	6 s	5.71	21 s	7.80	53 s	6.13	16 s	10.44	2 min	15.78	4 min
$Ψ = 5$	3.84	7 s	5.71	24 s	7.79	58 s	6.13	19 s	10.43	2 min	15.77	5 min

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yan, D.; Guan, Q.; Ou, B.; Yan, B.; Zhu, Z.; Cao, H. A Deep Reinforcement Learning-Based Decision-Making Approach for Routing Problems. Appl. Sci. 2025, 15, 4951. https://doi.org/10.3390/app15094951

AMA Style

Yan D, Guan Q, Ou B, Yan B, Zhu Z, Cao H. A Deep Reinforcement Learning-Based Decision-Making Approach for Routing Problems. Applied Sciences. 2025; 15(9):4951. https://doi.org/10.3390/app15094951

Chicago/Turabian Style

Yan, Dapeng, Qingshu Guan, Bei Ou, Bowen Yan, Zheng Zhu, and Hui Cao. 2025. "A Deep Reinforcement Learning-Based Decision-Making Approach for Routing Problems" Applied Sciences 15, no. 9: 4951. https://doi.org/10.3390/app15094951

APA Style

Yan, D., Guan, Q., Ou, B., Yan, B., Zhu, Z., & Cao, H. (2025). A Deep Reinforcement Learning-Based Decision-Making Approach for Routing Problems. Applied Sciences, 15(9), 4951. https://doi.org/10.3390/app15094951

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Deep Reinforcement Learning-Based Decision-Making Approach for Routing Problems

Abstract

1. Introduction

2. Related Work

2.1. Traditional Methods

2.2. DRL-Based Methods

3. Methodology

3.1. Mathematical Framework

3.2. Encoder

3.3. Decoder

3.4. Trainer

4. Experiments

4.1. Experimental Settings

4.2. Comparison Analysis

4.3. Generalization Study

4.3.1. Cross-Size Generalization

4.3.2. Cross-Distribution Generalization

4.3.3. Generalization on Real-World Benchmarks

4.4. Ablation Study

4.5. Sensitive Analysis

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI