1. Introduction
Combinatorial optimization is a fundamental problem in operations’ research, computer science, and applied mathematics [
1], which has garnered significant attention in recent years. Vehicle routing problems (VRPs), as a representative combinatorial optimization problem [
2,
3], have widespread applications in real-world scenarios [
4,
5,
6]. The objective of a VRP can be succinctly described as determining the optimal or shortest route for vehicles to visit all customer nodes [
7]. Over the recent few decades, numerous effective approaches have been proposed to address VRPs and their variants [
8,
9]. Traditional methodologies for solving VRPs can be broadly classified into exact algorithms [
10] and heuristic algorithms [
11,
12]. Exact methods can guarantee the identification of the optimal path; however, they are computationally expensive when dealing with large-scale or complex graphs and are not sensitive to dynamic changes. Heuristic algorithms, such as genetic algorithms [
13], ant colony optimization [
14,
15], and Dirac delta-based methods [
16,
17,
18], leverage heuristic information to guide the search process and can yield good solutions within an acceptable time frame, leading to widespread applications. However, the performance of these algorithms is highly dependent on domain expertise and the design of heuristic functions [
19]. Inappropriately designed heuristics may result in inefficient search or suboptimal outcomes.
Recent advancements in deep reinforcement learning (DRL) have significantly transformed the approach to solving combinatorial optimization problems, including vehicle routing problems [
20,
21]. Early efforts to tackle related challenges, such as the traveling salesman problem (TSP) [
8], primarily employed supervised learning techniques [
22]. However, recent shifts in research focus have emphasized reinforcement learning, owing to its ability to optimize task objectives directly, without the dependence on large volumes of labeled data. These DRL-based approaches not only enhance solution quality but also maintain computational efficiency within acceptable bounds. In response to the increasing demand for optimal VRP solutions, substantial research has been directed toward end-to-end deep learning models. Notable architectures in this domain include recurrent neural networks (RNNs) [
23], graph neural networks (GNNs) [
24], and transformer-based models [
25,
26,
27]. Innovations in attention mechanisms have further propelled the development of reinforcement learning frameworks based on encoder–decoder architectures, such as the attention model (AM) [
25], which has demonstrated significant improvements in both performance and computational speed. A key advantage of DRL-based approaches lies in their scalability. Once trained, these models can effectively solve problems of varying sizes with identical combinatorial structures, obviating the need for retraining when the problem scale is adjusted. This scalability is particularly valuable in addressing real-world VRP instances, where problem size can fluctuate significantly.
Although notable progress has been made, existing attention-based DRL models still face several limitations. Firstly, they overlook the potential exploitation of edge-related information. The input to VRP often includes not only node information but also edge information between nodes. Traditional attention-based DRL models typically focus on node feature representations and fail to learn edge feature representations, thereby leaving the rich feature information embedded in the graph topology underutilized. Second, existing models exhibit limited sensitivity to state transitions during the decoding process. For instance, AMs [
25] leverage graph embeddings to compute context embeddings. However, these graph embeddings are fixed as the average embedding of all nodes within an episode, resulting in static information during the decoding process that fails to capture the dynamic evolution of graph embeddings. This limitation may impair the solution quality. Furthermore, these approaches often struggle to generate sufficiently diverse solution trajectories. A broader range of candidate routes could enhance exploration within the solution space, potentially yielding better routing outcomes. However, current approaches typically rely on a single strategy trained by a single decoder, and this limited variability restricts the model’s ability to explore more optimal routing configurations.
To address the aforementioned issues and limitations, we propose a novel Edge-Driven Multiple Trajectory Attention Model (E-MTAM) for vehicle routing problems, building on the existing encoder–decoder framework. First, we emphasize the integration of edge information by incorporating both edge and node embeddings as inputs to the encoder. This allows the model to capture comprehensive edge feature representations alongside node features. During the encoding phase, we enhance the multi-head attention mechanism driven by edge information, thereby improving the encoder’s capacity to model graph topology and its associated relational data. Next, in the decoding phase, we incorporate dynamic visitation information into static graph embeddings via a mask, allowing graph embeddings to be input into the decoder to evolve in real time in accordance with the visited graph, thereby endowing the decoder with an enhanced capacity to represent dynamic information. Third, our model employs a multi-decoder structure, where the decoders are identical but with non-shared parameters. We introduce regularization loss to encourage these decoders to generate diverse trajectories, which enhances the exploration capacity within the solution space and increases the likelihood of obtaining superior results. Finally, extensive experimental evaluations across three distinct VRP variants demonstrate that our proposed E-MTAM significantly outperforms a range of heuristic algorithms and DRL-based models, highlighting its effectiveness in addressing the complexities of VRP solutions. Our results indicate that E-MTAM not only effectively addresses the conventional TSP but also proves applicable to other types of vehicle routing problems (CVRP and OP), showcasing its potential for application in real-world scenarios.
In general, the main contributions of this work are as follows:
We propose a model called E-MTAM for VRPs, which effectively incorporates edge information to drive the node encoding process, thereby fully leveraging the rich edge features embedded within the graph topology.
We combine visitation information with graph embeddings to integrate dynamic information into static graph embeddings, enabling the decoder to perceive real-time changes in the visited graph at each time step.
We employ a multi-decoder framework and introduce a regularization term to encourage the decoders to generate diverse trajectories, thereby learning distinct routing strategies and fostering a more comprehensive exploration of the solution space.
We conduct extensive experiments on three distinct VRP variants, and the results demonstrate that our E-MTAM model consistently outperforms a broad spectrum of alternative methods, underscoring its superior performance and robustness.
4. Experiments
4.1. Experiment Settings
We focus on three types of VRPs in our experiments: (1) TSP; (2) CVRP; and (3) OP. Generally, the objective of the TSP is to determine the shortest path that starts and ends at a depot while visiting all nodes exactly once. The CVRP, a generalized case of TSP, requires that each vehicle departs from a central depot, visits a set of customer nodes, and completes all delivery tasks within specified capacity constraints. In the OP, each node is associated with a prize value, and the goal is to construct a single path starting and ending at the depot, aiming to maximize the total prize collected while staying within the maximum routing length constraint. Furthermore, following adjustments to the input, masking rules, and decoder context vectors, the model is also capable of addressing other variants of VRP as well as more complex real-world routing challenges.
For each problem, we follow [
25,
41] to conduct experiments with node sizes of
. For each problem instance, the coordinates of the nodes are randomly generated within the region
. The vehicle capacities of CVRP are set to
for problems with
nodes, while the demand at each customer node is assigned values from
. For OP, the maximum path length constraints for instances of different node sizes are set to
, respectively.
We train the models for 100 epochs, during which 1,280,000 instances are generated in real time with a batch size of 512 per epoch (we specifically generate 320,000 instances and use a batch size of 128 for CVRP50 and CVRP100). We use the rollout baseline as the baseline estimator and employ the Adam optimizer for model training. Additionally, we construct a test set consisting of 1000 instances, which follows the same distribution as the training data. Other relevant hyperparameter settings used in our framework are detailed in
Table 1. All our experiments are conducted on an Intel Core i5-12600KF CPU and an NVIDIA GeForce RTX 4060 GPU.
For evaluation, we consider the average tour length, optimality gap, and running time to assess the performance of a method on a given problem. The optimality gap quantifies the difference between the obtained result and the best result:
where
denotes the average tour length across all test instances, while
represents the best-known result for the given problem. Similarly, we calculate the optimality gap of OP using the average collected prize
and the maximum prize
:
We compare our proposed E-MTAM against a range of traditional and DRL-based approaches for solving VRPs. These baselines include the following:
- (1)
Concorde [
29]: a specialized exact solver for solving TSP;
- (2)
LKH [
15]: a state-of-the-art heuristic optimization solver;
- (3)
Gurobi [
44]: a commercial optimization solver;
- (4)
OR Tools [
45]: an open source software suite developed by Google for solving optimization problems such as routing, scheduling, and linear programming;
- (5)
ACO [
46]: a heuristic algorithm inspired by the foraging behavior of ants, used to solve combinatorial optimization problems;
- (6)
EMA [
13]: an evolutionary optimization method designed to solve multiple vehicle routing problems simultaneously, leveraging knowledge transfer between tasks to enhance performance;
- (7)
PtrNet [
35]: a neural model that employs attention to select elements from an input sequence, solving combinatorial problems with variable-sized outputs;
- (8)
GCN [
47]: a graph convolutional network that efficiently builds TSP graph representations and outputs tours through a parallelized beam search;
- (9)
DACT [
43]: a dual-aspect collaborative transformer that improves vehicle routing problems by separately learning node and positional embeddings with a novel cyclic positional encoding method;
- (10)
AM [
25]: a milestone DRL-based model with an attention mechanism and encoder–decoder scheme;
- (11)
POMO [
41]: a state-of-the-art DRL method that achieves competitive performance on various routing problems.
In this work, we focus solely on models that employ greedy selection, eliminating the impact of the sampling strategy on model performance to ensure a fair comparison.
4.2. Comparison Results
Table 2 presents a comparison between our proposed E-MTAM and baseline methods across three typical vehicle routing problems: (1) TSP; (2) CVRP; and (3) OP. Our results demonstrate that E-MTAM consistently outperforms existing state-of-the-art baseline methods. For TSP, E-MTAM achieves the best optimality gaps of
0.26%,
0.35%, and
0.77% for problem sizes of 20, 50, and 100 nodes, respectively. Notably, as the problem size increases, the advantage of E-MTAM becomes more pronounced. For the most challenging TSP100, E-MTAM reduces the gap by
0.65% compared to the second-best method, POMO. This significant advantage is also observed in the CVRP experiments, where E-MTAM achieves gap reductions of
4.26%,
5.11%, and
6.39% compared to the milestone method (AM) for CVRP20, CVRP50, and CVRP100, respectively. In comparison to heuristic algorithms (i.e., ACO and EMA), E-MTAM attains gap reductions of
7.93% and
6.84% for CVRP100 while requiring less computational time. For OP, E-MTAM clearly outperforms other DRL-based methods, achieving the highest reward values of
5.35,
16.06, and
32.85 for OP20, OP50, and OP100, respectively.
Regarding running time, while the encoder introduces additional computational overhead due to the aggregation of edge information, and the multi-decoder architecture of E-MTAM results in slightly slower computation compared to AM, E-MTAM still outperforms exact and heuristic algorithms by orders of magnitude in terms of speed. Furthermore, compared to other DRL-based methods, E-MTAM strikes a better balance between effectiveness and efficiency.
To facilitate a more comprehensive comparison, we present the training curves of E-MTAM and AM on CVRP20, CVRP50, and CVRP100. As shown in
Figure 4, the average tour length of both models steadily decreases during the training process and eventually converges in the later stages, demonstrating the effectiveness of our training framework. Furthermore, E-MTAM exhibits a faster learning speed, although it does not attain a shorter tour length in the early stages (especially on CVRP20 and CVRP50). E-MTAM ultimately delivers superior results across all problems, approaching the best results attained by LKH (indicated by the horizontal dashed line). Notably, as the problem size increases, the advantages of E-MTAM become more pronounced.
4.3. Generalization Analysis
In practical applications, the number of nodes in problems can vary significantly, making it impractical to train a model from scratch for each possible node configuration. Therefore, models trained on given problems should exhibit generalization capabilities, allowing it to perform well on instances with different node scales. To this end, we analyze the generalization ability of E-MTAM on TSP, CVRP, and OP problems, with AM serving as the baseline. Specifically, the performance of models trained on problems with 20 and 50 nodes is evaluated across problems with 20, 50, and 100 nodes, as presented in
Table 3.
The experimental results demonstrate that our E-MTAM consistently outperforms AM in terms of generalization ability across problems of all scales. Specifically, when models trained on TSP20 are tested on TSP50, E-MTAM achieves a reduction in the optimality gap by 0.53% compared to AM. This reduction reaches 0.87% and 1.23% on CVRP and OP, respectively. Notably, although models trained and tested on the same problem scale perform better than those on different scales, models trained on 50 nodes typically outperform those trained on 20 nodes when confronted with larger problem sizes. In tests on TSP100, CVRP100, and OP100, the best optimality gaps of 4.51%, 4.60%, and 6.51%, respectively, are all achieved by E-MTAM trained on 50 nodes. These results underscore the exceptional generalization capability of the proposed method, highlighting its effectiveness in addressing the complex routing problems encountered in real-world applications.
4.4. Ablation Analysis
4.4.1. Effect of Each Component
The proposed E-MTAM incorporates three innovative components: (1) the EDMHA block; (2) dynamic graph embedding; and (3) the multi-decoder-based JS loss. To investigate the individual and combined effects of these components on model performance, we conducted ablation experiments on TSPs, CVRPs, and OPs, as shown in
Table 4,
Table 5 and
Table 6. Specifically, we start with a vanilla model and progressively integrate the components, evaluating the model’s performance on different problems at each stage. Taking TSP100 as an example, we incorporate the EDMHA block, dynamic graph embedding, and JS loss into the vanilla model and obtain improved routing performance with gaps of
2.45%,
3.22%, and
2.84%, respectively. These correspond to absolute gap reductions of
1.80%,
1.03%, and
1.41% compared to the baseline. For CVRP100, the individual addition of each component leads to gap reductions of
4.92%,
3.77%, and
4.47%, respectively. In the case of OPs, the complete model demonstrates gap reductions of
0.97%,
0.58%, and
0.82% for instances with 20, 50, and 100 nodes, respectively, relative to the baseline. Moreover, paired combinations of these components yield further improvements in performance compared to their individual contributions. These results substantiate the efficacy of incorporating the proposed components in enhancing the performance of the E-MTAM.
4.4.2. Effect of Hyperparameters
We follow the work of [
25] to conduct a sensitivity analysis under different learning rates and random seed configurations.
Table 7 presents the experimental results for Random Seeds 1234 and 1235 under two learning rate strategies. The results show that, across TSPs, CVRPs, and OPs of varying scales, the variations in average tour lengths and optimality gaps are minimal, demonstrating the robustness and reliability of our method under different hyperparameter settings. Taking CVRP100 as an example, the difference in the optimality gap for E-MTAM across different settings does not exceed
0.19%.
5. Conclusions
In this work, we propose a novel Edge-Driven Multiple Trajectory Attention Model (E-MTAM) to address VRPs of different scales. Our model is built upon the existing encoder–decoder architecture and employs a deep reinforcement learning approach for routing, eliminating the reliance on manually designed rules. We introduce three pivotal innovations to further improve the routing performance of the model. First, we integrate edge information into the encoder via an edge-driven multi-head attention block, enhancing the model’s ability to capture the graph’s topological structure. Second, we combine visitation information with graph embeddings to incorporate dynamic updates, enabling the decoder to adapt to real-time changes in the graph. Finally, we employ a multi-decoder architecture with a regularization term, encouraging the generation of diverse trajectories and thereby fostering a more comprehensive exploration of the solution space.
We conducted extensive experiments on three types of routing problems: TSP, CVRP, and OP. We evaluate the performance of E-MTAM against a range of traditional and DRL-based methods. The comparison results demonstrate that our proposed E-MTAM significantly outperforms a variety of alternative approaches. Furthermore, generalization experiments demonstrate that our model exhibits stronger generalization capabilities than the baseline method on larger-scale problems. Finally, ablation studies validate the effectiveness of the three key improvements we introduced. Pertaining to future works, we acknowledge that there remains a gap between our E-MTAM and exact algorithms. On the one hand, the utilization of graph topology information in our model is still not fully optimized. Therefore, we will focus on exploring the underlying structure of the graph to enhance this aspect. On the other hand, we plan to incorporate other novel techniques to further optimize our encoder–decoder framework and to extend this solution to more complex variants of the VRP.