A Reinforcement Learning Framework for Scalable Partitioning and Optimization of Large-Scale Capacitated Vehicle Routing Problems

Ayachi Amar, Chaima; Bouanane, Khadra; Aiadi, Oussama

doi:10.3390/electronics14193879

Open AccessArticle

A Reinforcement Learning Framework for Scalable Partitioning and Optimization of Large-Scale Capacitated Vehicle Routing Problems

by

Chaima Ayachi Amar

^1,2,*

,

Khadra Bouanane

^1,2

and

Oussama Aiadi

^1,2

¹

Department of Computer Science and Information Technologies, University Kasdi Merbah Ouargla, Ouargla 30000, Algeria

²

Laboratory of Artificial Intelligence and Information Technologies LINATI, Ouargla 30000, Algeria

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(19), 3879; https://doi.org/10.3390/electronics14193879

Submission received: 26 August 2025 / Revised: 26 September 2025 / Accepted: 28 September 2025 / Published: 29 September 2025

(This article belongs to the Section Artificial Intelligence)

Download

Browse Figures

Review Reports Versions Notes

Abstract

The Capacitated Vehicle Routing Problem (CVRP) is a central challenge in combinatorial optimization, with critical applications in logistics and transportation. Traditional methods struggle with large-scale instances, due to the computational demands, while learned construction models often suffer from degraded solution quality and constraint violations. This work proposes SPORL, a Scalable Partitioning and Optimization via Reinforcement Learning framework for large-scale CVRPs. SPORL decomposes the problem using a learned partitioning strategy, followed by parallel subproblem solving, and employs a greedy decoding scheme at inference to ensure scalability for instances with up to 1000 customers. A key innovation is a context-based attention mechanism that incorporates sub-route embeddings, enabling more informed and constraint-aware partitioning decisions. Extensive experiments on benchmark datasets with up to 1000 customers demonstrated that SPORL consistently outperformed state-of-the-art learning-based baselines (e.g., AM, POMO) and achieved competitive performance relative to strong heuristics such as LKH3, while reducing inference time from hours to seconds. Ablation studies confirmed the critical role of the proposed context embedding and decoding strategy in achieving high solution quality.

Keywords:

capacitated vehicle routing problems; set partitioning problem; heuristics; reinforcement learning

1. Introduction

The routing problem (RP) is a fundamental challenge in combinatorial optimization, with diverse applications in logistics, communication networks, and transportation systems [1,2,3,4,5,6,7]. It involves determining the optimal routes for vehicles to deliver services to a set of locations, while minimizing operational costs, such as distance or time. Variants of the routing problem often introduce additional constraints, such as capacity limitations and time windows [8], or vehicle fleet restrictions, making these problems computationally complex and difficult to solve, especially as the problem size grows.

A prominent variant of the routing problem is the Capacitated Vehicle Routing Problem (CVRP), which requires planning routes for a fleet of vehicles to serve a set of customers, while respecting vehicle capacity constraints. The CVRP is a cornerstone of logistics optimization and has been extensively studied, due to its real-world importance and theoretical complexity. As problem instances scale to include hundreds or thousands, solving CVRP instances becomes increasingly difficult.

Exact methods [9] such as Branch and Bound [10], Branch and Cut [11], Column Generation [12], and commercial solvers like Gurobi and CPLEX, can provide globally optimal solutions for small-scale VRPs with theoretical guarantees. However, due to the NP-hard nature of the problem [1], these methods become computationally expensive and impractical for large-scale instances.

Classical heuristics and metaheuristic solvers can efficiently produce near-optimal solutions for small-scale VRPs and are more scalable. Some heuristics, such as OR-Tools [13], LKH3 [14], Clarke–Wright (CW) parallel-savings heuristic [15], and Hybrid Genetic Search (HGS) [16,17], have been adapted to tackle large-scale problems. Nonetheless, achieving high-quality solutions often requires a significant number of iterations, making real-time solving of large-scale VRPs a persistent challenge.

In recent years, machine learning (ML) has emerged as a promising alternative for solving routing problems [18,19,20,21,22,23,24,25,26,27,28]. Deep learning methods, particularly Transformer-based architectures [29], have shown potential in acting as solvers that generate solutions for various problem instances by leveraging end-to-end learning and reinforcement learning techniques, which aligns well with the VRP’s sequential decision-making nature. These approaches aim to replace traditional heuristics with data-driven models that can generalize across different VRP scenarios. They generally fall into two main categories: learning to construct and learning to improve. The latter start with an initial solution and iteratively refine it using learned modification strategies [24,25,26,27,28]. Meanwhile, the former focus on constructing feasible solutions from scratch iteratively, often using pointer-based or attention-based architectures [19,20,21,22,30].

In the learn-to-construct paradigm, two main scenarios are commonly considered. The first is a two-phase approach, where a reinforcement learning model is utilized to train a transformer model to partition the large problem into smaller, more manageable subproblems. This partitioning respects constraints such as fleet size and capacity. In the second phase, these sub-problems are solved in parallel, each treated as an independent routing problem in the form of a traveling salesman problem (TSP). This decomposition not only reduces computational complexity, but also enhances overall solution quality. The second scenario follows a sequential construction strategy, where sub-tours are built greedily by selecting and adding nodes one at a time based on the current state and learned policy.

The learn-to-improve paradigm typically yields high-quality solutions through iterative refinement. However, their increased computational demands often make them less suitable for large-scale instances, where efficiency is critical. Learn-to-construct models provide more efficient and scalable inference by generating solutions from scratch. However, when applied to large-scale instances. They often struggle to deliver high-quality solutions and frequently fail to ensure compliance with critical constraints, such as fleet-size limitations. For large-scale VRPs, existing approaches within the learn-to-construct paradigm often rely on Transformer-based architectures. However, applying these models to real-world-sized instances remains particularly challenging, due to the quadratic memory and computation complexity of self-attention mechanisms with respect to the number of nodes. Consequently, most existing works [19,20,21,22] have been evaluated on relatively small instances, typically with fewer than 200 customers.

In recent work by [30], the authors attempted to extend learned heuristics to large-scale CVRP instances. However, the model’s decisions are often based on very limited local information, which can result in suboptimal routing and poor decision-making due to the lack of comprehensive information. This narrow perspective prevents the model from capturing the broader context of the routing problem, often leading to suboptimal solutions. Without access to local structural information for each sub route, the model may overlook better routing options, resulting in inefficient tours and degraded performance, particularly in complex or large-scale scenarios.

To tackle these challenges, we integrate the strengths of traditional heuristic approaches with the fast inference capabilities of learning-based models. We propose SPORL, a Scalable Partitioning and Optimization via Reinforcement Learning framework, designed specifically for large-scale CVRP instances. SPORL follows a two-phase learning approach: the first phase focuses on partitioning the original problem into smaller subproblems, and the second phase solves them in parallel using learned policies or classical solvers. It is important to emphasize that SPORL is a solver-agnostic framework. Its core contribution is the learned partitioning policy. To demonstrate its versatility and evaluate its performance, we implemented it with different backend solvers—including the Attention Model (AM), POMO, and the LKH3 heuristic—denoted as SPORL-AM, SPORL-POMO, and SPORL-LKH3, respectively. To enhance the quality of the model’s decisions during the partitioning phase, we adopt a context-based attention mechanism that extracts relevant information from the input graph. Specifically, we introduce a subroute embedding within each partial tour, which enriches the context representation. This enables the attention mechanism to focus on meaningful local structures, leading to more informed decisions that improve both the coherence of subproblems and the overall solution quality.

Unlike previous work [30], which provided very limited information about the existing sub-routes, this additional information provides a more comprehensive representation of the current state, allowing the model to make more informed routing decisions. Specifically, it helps the model understand the spatial distribution of visited customers, which is crucial for handling large-scale instances.

To validate the effectiveness of our proposed approach, we conducted extensive experiments on large-scale CVRP benchmarks involving up to 1000 customers. Our results demonstrate that SPORL achieved a compelling balance between solution quality and computational efficiency, outperforming several state-of-the-art learning-based baselines and competitive heuristics. Furthermore, we performed an ablation study to isolate the contribution of key components, such as the context-based attention mechanism, confirming their critical role in improving partition quality and ensuring constraint satisfaction. These findings highlight the potential of SPORL as a scalable and practical solution for real-world vehicle routing applications. We summarize the main contributions of this work as follows:

We propose SPORL, a novel two-phase RL framework that effectively decomposes large-scale CVRP instances into manageable subproblems for parallel solving, addressing the critical scalability challenge.
We introduce a context-based attention mechanism enhanced with sub-route embeddings. This innovation provides a richer representation of the local solution structure during partitioning, leading to more informed and constraint-aware decisions than the previous method [30].
We demonstrate through extensive experiments that SPORL achieves a superior balance between solution quality and computation time. It significantly outperforms pure learning-based methods (AM, POMO) and enables classical solvers like LKH3 to find high-quality solutions orders of magnitude faster.
We provide comprehensive ablation studies and analyses that validate the effectiveness of our core components and offer insights into the performance characteristics of hybrid learning-optimization systems.

Our experimental results demonstrate that SPORL provides a practical solution for real-world logistics challenges. The scale of problems addressed in this work (300 to 1000+ nodes) is highly relevant to modern logistics operations. Major e-commerce and logistics firms such as Amazon, UPS, and DHL manage delivery networks where daily route planning involves serving thousands of customers across a city or region. For these companies, achieving high solution quality within a short computation time is critical for operational efficiency. SPORL’s ability to generate near-optimal solutions in seconds, as demonstrated in our experiments, directly addresses this need for scalability and speed in practical applications.

2. Related Work

The Capacitated Vehicle Routing Problem (CVRP) is a classical NP-hard combinatorial optimization problem with widespread applications in logistics and transportation. Traditional exact algorithms such as Branch-and-Bound [10], Branch-and-Cut [11], Column Generation [12], and commercial solvers like Gurobi and CPLEX [9], can compute globally optimal solutions for small-scale instances with theoretical guarantees. However, their exponential complexity renders them impractical for large-scale instances. As a result, various heuristic and metaheuristic approaches, including OR-Tools [13], LKH3 [14], the Clarke–Wright savings heuristic [15], and Hybrid Genetic Search (HGS) [16,17] offer more scalable alternatives. These methods can produce near-optimal solutions efficiently, but often require many iterations or problem-specific tuning to achieve high solution quality on large-scale problems, making real-time deployment challenging. Recent developments in machine learning (ML) have introduced a new class of heuristics, typically categorized into learn-to-construct and learn-to-improve paradigms.

Learn-to-improve methods start with an initial solution and iteratively refine it. For example, ref. [31] proposed a two-stage framework involving a region-picking policy and a rule-picking policy to guide local modifications. Similarly, ref. [24] introduced Efficient Active Search (EAS), which fine-tunes specific model parameters during inference to enhance performance. In large-scale settings, ref. [26] proposed a hybrid framework that tackles large instances by first decomposing the problem into smaller subproblems using a strategy inspired by the POPMUSIC metaheuristic [32]. A reinforcement learning-based heuristic is then applied to solve each subproblem, and the resulting sub-solutions are integrated using a set-partitioning-based MILP [33] solver. The learn-to-delegate approach in [27], inspired by POPMUSIC [32], iteratively merges or improves sub-routes using classical solvers like LKH3. While learn-to-improve approaches often yield high solution quality, they come with higher computational costs, which limits their scalability [24].

Learn-to-construct approaches aim to generate feasible solutions from scratch in a sequential manner. Early works such as the Pointer Network by [19] and the attention-based model by [21] demonstrated how policy gradient and reinforcement learning techniques could be used to solve routing problems like TSP and CVRP. Subsequent studies [20,22,23,24] extended these methods using Transformer architectures, to enhance solution quality and generalization. However, due to the quadratic memory and computation requirements of self-attention, these models have mostly been limited to small-scale instances (typically fewer than 200 nodes). To improve scalability for solving large-scale CVRPs, recent learn-to-construct methods have adopted a two-phase strategy, where the original instance is partitioned into smaller subproblems [30]. Each subinstance is then solved using classical TSP solvers or reinforcement learning policies, and the resulting partial solutions are merged into a global route. This decomposition-based approach significantly reduces computational complexity and supports parallelism, making it attractive for real-time applications. Hou et al. [30] advance this paradigm by introducing a hierarchical partitioning framework along with a Global Mask Function, a key mechanism that ensures global feasibility during route generation. This function enforces vehicle capacity constraints and prevents subtour formation by masking out invalid nodes based on global routing context. Their theoretical analysis proves that this masking strategy is both necessary and sufficient to guarantee valid CVRP solutions at inference time. While effective in enforcing constraints, their method still exhibits limitations in context modeling. The routing and partitioning decisions are largely based on general features (e.g., global graph information and the first and last visited nodes), without incorporating a deeper understanding of the local structure within the sub-routes of the instance. As a result, the merging step may miss better global configurations, especially in cases of irregular or unbalanced customer distributions.

Our contribution builds upon this line of work by proposing SPORL (Scalable Partitioning and Optimization via Reinforcement Learning), a hybrid two-phase learning framework that combines the fast inference of constructive methods with the structural advantages of graph decomposition. We focus on the partitioning of large CVRP instances using a transformer-based model guided by reinforcement learning. What distinguishes SPORL is the introduction of enriched context embeddings, particularly the incorporation of subroute embeddings, to capture the local route more effectively. This enhanced context enables more effective and constraint-compliant partitioning, resulting in high-quality solutions, while maintaining scalability. Unlike previous methods, SPORL leverages contextual information across partial routes, offering a more informed and adaptive solution construction process.

3. Problem Formulation

3.1. Capacitated Vehicle Routing Problem

The Capacitated Vehicle Routing Problem (CVRP) is defined on an undirected graph

G = (V^{'}, E)

, where

V^{'} = {0, 1, \dots, n}

represents a set of

n + 1

vertices, and E denotes the set of edges. Vertex 0 corresponds to the depot, while the remaining vertices

V = V^{'} ∖ {0}

represent n customer locations. Each edge

{i, j} \in E

is associated with a non-negative travel cost

c_{i j}

. Every customer

i \in V

has a positive demand

d_{i}

that must be fulfilled. A fleet of m identical vehicles, each with a maximum capacity Q, is stationed at the depot and is responsible for serving the customer demands. Each vehicle is required to start and end its route at the depot. A route r is defined as an elementary cycle

(0, i_{1}, \dots, i_{h}, 0)

in the graph G, where the total demand of the customers visited along the route satisfies the capacity constraint:

\sum_{i \in r} d_{i} \leq Q r \in R

(1)

where

R

is the set of all possible routes, The cost of a route

c_{r}

is determined by the total cost of traversing the edges in r, equivalent to solving a Traveling Salesman Problem (TSP) over the subset of nodes in the route. The objective of the CVRP is to determine a set of such routes that collectively serve all customers, satisfy the vehicle capacity constraints, and minimize the total cost of the route.

3.2. Set Partitioning Formulation for CVRP

The CVRP can be formulated as a set partitioning problem [8], a well-known NP-hard optimization problem, in which each feasible vehicle route is associated with a binary decision variable. Let

R

denote the set of all possible routes. For each customer vertex

i \in V

and route

r \in R

, define the binary parameter

a_{i r}

such that

a_{i r} = 1

if customer i is served in route r, and 0 otherwise. The depot node is included in every route, which implies

a_{0 r} = 1

,

\forall r \in R

. Each route

r \in R

has an associated cost

c_{r}

, representing the total travel cost of that route. Let

x_{r}

be a binary decision variable equal to 1 if route r is selected in the final solution, and 0 otherwise. The mathematical formulation of the set partitioning for CVRP is given below:

\begin{matrix} min & \sum_{r \in R} c_{r} x_{r} \end{matrix}

(2)

\begin{matrix} s . t . & \sum_{r \in R} a_{i r} x_{r} = 1 \forall i \in V, \end{matrix}

(3)

\begin{matrix} \sum_{r \in R} x_{r} \leq m, \end{matrix}

(4)

\begin{matrix} x_{r} \in {0, 1} \forall r \in R . \end{matrix}

(5)

\begin{matrix} a_{i r} \in {0, 1} \forall r \in R, \forall i \in V . \end{matrix}

(6)

Constraint (3) guarantees that each customer is included in exactly one selected route, whereas constraint (4) limits the total number of selected routes to at most m.

Although set partitioning offers a general and flexible framework for modeling the CVRP, including the incorporation of additional constraints such as capacity constraints, the formulation becomes computationally infeasible for large instances, due to the exponential growth in the number of feasible routes with respect to the number of customers. In practice, only a small fraction of these routes contribute to an optimal or near-optimal solution [34]. Therefore, solving the set partitioning problem efficiently relies on identifying a high-quality subset of feasible routes. For large-scale problems, constructing this subset effectively is essential for traceability. Our contribution addresses this challenge by introducing a learning-based heuristic to guide the construction of candidate routes. Rather than relying on handcrafted rules or random sampling, our method leverages a neural policy trained to select feasible, cost-effective routes that are likely to contribute to high-quality solutions. By learning to anticipate which routes are most valuable, we can drastically reduce the search space, while maintaining solution quality, enabling the practical use of set partitioning, even in large-scale CVRP instances.

4. Reinforcement Learning for the Partitioning Problem

The core idea of SPORL is to learn the best partitioning strategy considering fleet size constraints, while the node sequence inside each partition is handled in the second stage. The partitioning model is structured using the Attention Model, following [21]. The mask and decoder context need to be defined according to our modifications, and the encoder and the input are the same [21]. This model uses a reinforcement learning approach and is formulated as a Markov Decision Process (MDP).

4.1. MDP Definition and Alignment with CVRP

The Capacitated Vehicle Routing Problem (CVRP) can be naturally cast as a sequential decision-making process: At each step, an agent selects a customer to be assigned to the current route, such that feasibility is preserved and the global objective of minimizing travel cost is optimized. This sequential structure maps directly to a Markov Decision Process (MDP)

(S, A, P, R)

, following formulations established in [21,30], which demonstrated the suitability of MDPs for routing and combinatorial optimization tasks. Our formulation extends these foundations to explicitly incorporate the partitioning perspective.

State $s_{t} \in S$ : The state at time t encodes the set of unassigned customers, the remaining vehicle capacity of the current route, and the partial sub-routes constructed so far.
Action $a_{t} \in A (s_{t})$ : The agent selects the next unvisited customer (or the depot) to extend or close the current sub-route, thereby incrementally contributing to a partition of the original customer set.
Reward R: The episodic reward is defined as the negative total travel distance once all customers have been assigned and all routes have been completed. Infeasible solutions (e.g., exceeding the fleet size m) receive a large negative penalty.
Policy $p_{θ} (a | s)$ : The reinforcement learning agent learns a stochastic policy that maps states s to partitioning decisions, improving over time through accumulated experience. The policy over a complete solution $π$ can be factorized and parameterized as

$p_{θ} (π | s) = \prod_{t = 1}^{T} p_{θ} (π_{t} ∣ s, π_{1 : t - 1}),$

(7)

where $π_{t}$ denotes the action selected at step t, conditioned on the state s and the sequence of past actions $π_{1 : t - 1}$ .

This MDP formulation aligns with the set partitioning view of the CVRP, where feasible routes correspond to sequences of valid actions and the terminal reward represents the cost of a selected subset of routes.

4.2. Model Architecture for Partitioning Problem

SPORL consists of an attention-based encoder–decoder model that defines a stochastic policy

p_{θ} (π | s)

for selecting a partitioning solution

π

given a problem instance s. The architecture consists of the following components:

Encoder: The encoder computes the initial $d_{h}$ -dimensional node embeddings, denoted as $h_{i}^{N}$ , from the input $x_{i} \in R^{3}$ , which consist of the node’s coordinates and demand, along with the graph embedding ${\bar{h}}_{g} = \frac{1}{n + 1} \sum_{i = 0}^{n} h_{i}^{N}$ . Following the Transformer architecture [29] and the Attention Model in [21], both the node embeddings $h_{i}^{N}$ and the graph embedding ${\bar{h}}_{g}$ serve as inputs to the decoder.
Decoder: The decoder follows the same architecture as proposed in [21], generating solutions sequentially. At each time step t, it selects a node for each sub-route based on the encoder embeddings and the previously selected nodes. The decoding process utilizes a special context node c to facilitate efficient decision-making. Final selection probabilities are computed using a single-head attention mechanism.

$p (π_{t} | s_{t}) = s o f t m a x (u_{c, i})$

(8)

$u_{c, i} = \{\begin{matrix} tanh (\frac{q_{c}^{t} k_{i}}{\sqrt{d_{k}}}) & if i is unvisited and feasible \\ - \infty & O t h e r w i s e \end{matrix}$

(9)

where $q_{c}^{t} = W^{Q} h_{c}^{t}$ is the context query at time t; $k_{i} = W^{K} h_{i}$ is the key; the value $v_{i} = W^{V} h_{i}$ ; $W^{Q}, W^{K}, W^{V}$ are learning parameters; and $d_{k}$ is the dimension of hidden state. This masking strategy ensures that capacity constraints are never violated during the construction process. The fleet size constraint is enforced naturally by the process itself: the episode terminates once all nodes have been visited. If the number of routes created equals the fleet size ( $m + 1$ ) but unvisited nodes remain, the solution is invalid and receives a large negative reward during training. This incentivizes the RL agent to learn a policy that partitions the graph into at most ( $m + 1$ ) capacity-feasible clusters
Capacity constraints: To enforce the vehicle capacity constraint Q, the model must track the remaining capacity throughout the route construction process. We define a state variable $δ_{t}^{r}$ representing the normalized remaining capacity for the current vehicle at time step t, where a value of 1 represents a full capacity and 0 represents an empty vehicle. This variable is initialized and updated as follows:

$δ_{t + 1}^{r} = \{\begin{matrix} m a x (0, δ_{t}^{r} - d_{i}) & δ_{t}^{r} > 0 \\ 1 & δ_{t}^{r} = 0 or t = 1 \end{matrix}$

(10)

The update rule decreases the normalized capacity by the relative demand of the served customer ( $\frac{d_{i}}{Q}$ ). The value is reset to 1 when the vehicle’s capacity is exhausted ( $δ_{t}^{r} = 0$ ) or at the start of a new route ( $t = 1$ ), signifying a new vehicle beginning with full capacity. This check ensures that only feasible actions are available for selection, guaranteeing that no vehicle is ever overloaded.
Context Embedding: The context node ${\bar{h}}_{c}$ represents the current state or context of the decoding process. The context node c of the decoder at time t comes from the encoder and the output up to time t. In existing works [30], the context embedding typically consists of the graph embedding $\bar{h_{g}}$ , and the embeddings of the first and last selected nodes (e.g., $h_{π_{0}}, h_{π_{t - 1}}$ ) formulated as

${\bar{h}}_{c} = [\bar{h_{g}}, h_{π_{0}}, h_{π_{t - 1}}]$

However, this formulation lacks specific information about the local structure of the sub-routes being constructed. As a result, it provides insufficient contextual signals to the decoder, which may lead to suboptimal routing decisions, especially in complex scenarios with multiple interacting sub-tours.
To address this limitation, we provide sufficient information to the decoder at any time step t by proposing a more comprehensive context by extending the context node ${\bar{h}}_{c}$ . In our case, this consists of the embedding of the graph ${\bar{h}}_{g}$ , embedding of the sub_graph (containing the nodes of the sub_route) ${\bar{h}}_{t}^{r}$ , and the previous (last) node $π_{t - 1}$ . However, in cases where the vehicle capacity is saturated or $t = 1$ , the context embedding consists of the node embedding of the depot $h_{0}$ and a randomly selected unvisited node $h_{t}^{r}$ , with the remaining capacity sets as in (10). Formally, the context embedding is defined as

${\bar{h}}_{c} = \{\begin{matrix} [{\bar{h}}_{g}, {\bar{h}}_{t}^{r}, h_{π_{t - 1}}, δ_{t}^{r}] & δ_{t}^{r} > 0 \\ {\bar{h}}_{g}, h_{0}, h_{t}^{r}, δ_{t}^{r}] & δ_{t}^{r} = 0 \end{matrix}$

(11)

where $δ_{t}^{r}$ is the remaining vehicle capacity, and ${\bar{h}}_{t}^{r}$ is the embedding of the current sub-route, computed as the average of the node embeddings of the customers already visited on that route:

${\bar{h}}_{t}^{r} = \frac{1}{| R_{t} |} \sum_{i \in R_{t}} h_{i}$

(12)

The set $R_{t}$ contains all nodes assigned to the current vehicle’s tour, forming a partial solution. The sub-route embedding ${\bar{h}}_{t}^{r}$ is computed as the average of the node embeddings for all customers in $R_{t}$ . This operation produces a compact representation that encodes the structural properties of the sub-route, effectively summarizing its current state. By integrating this rich contextual information, the decoder gains awareness of the local structure of the evolving solution, leading to more adaptive decision-making and significantly improved partition quality.

4.3. Training Process

The training process follows a reinforcement learning framework, specifically using the REINFORCE algorithm [35]. The policy network is trained to maximize the expected reward by adjusting the model parameters through gradient-based updates.

4.3.1. Loss Function

The loss function is defined as

J (θ) = E_{p_{θ} (π | s)} [R (π)]

(13)

where

R (π)

represents the reward function that evaluates the quality of the solution

π

based on the total cost of the route and the satisfaction of the constraint. To ensure that the reward of a selected sub_route

r_{i}

remains invariant to the order of nodes within that sub_route, we propose using its optimal route length as the reward:

R_{i} = min_{σ \in S (r_{i})} dist (σ)

(14)

where

r_{i}

represents a selected sub-route,

S (r_{i})

denotes the set of all possible permutations of nodes within

r_{i}

, and

d i s t (σ)

calculates the total travel distance for a given sequence

σ

. This formulation ensures that the reward assigned to

r_{i}

is invariant to the order of the nodes by selecting the optimal sequence with the minimum possible distance. The optimization method used to compute

min d i s t (σ)

can be an exact solver, a traditional heuristic, or a learned heuristic, depending on the computational constraints and performance trade-offs. The accumulated reward

R (π)

is then computed as

R (π) = - \sum_{i = 1}^{m} R_{i}

(15)

where m denotes the total number of routes generated. This accumulated reward is used to train our policy.

4.3.2. Policy Optimization

Gradient updates are performed using

\nabla L (θ) = E_{p_{θ} (π | s)} [(R (π) - b (s)) \nabla_{θ} log p_{θ} (π | s)]

(16)

The baseline

b (s)

is implemented as a greedy rollout of the best policy found so far (

θ^{B L}

), a standard technique to reduce the variance in the policy gradient estimates and stabilize training [35].

4.3.3. Training Procedure

Our model SPORL is trained on a dynamically generated dataset of CVRP instances following [20,21], spanning various graph sizes (the training dataset generation capacity of CVRP

300, 500

, and 1000 is set to

100, 150, 300

respectively). The reinforcement learning agent iteratively explores solutions, evaluates rewards, and updates its parameters using stochastic gradient descent. To improve the training stability, a moving average baseline is applied to normalize rewards and mitigate variance in gradient updates. Algorithm 1 provides a detailed implementation of our training process. The algorithm begins by initializing the policy network parameters

θ

to partition an instance into a set of sub-TSPs, while respecting fleet size constraints. These sub-TSPs are optimized in parallel using Route Optimization Section 4.3.4. The reward values are computed according to Equation (15), and the policy parameters

θ

are updated using the Adam optimizer. This process is repeated iteratively until a satisfactory policy is obtained.

Algorithm 1 Training Algorithm

Require: policy network

p_{θ}

, number of epochs E, batch size B, number of training iterations T, Route Optimization TSP

Ensure: Optimized policy parameters

θ

1:: Initialize model parameters $θ$ and baseline $θ^{B L}$
2:: for each Epoch $e = 1$ to E do
3:: for each training iteration $t = 1$ to T do
4:: $s_{i}$ ← generate random instances $\forall i \in {1, \dots, B}$
5:: $π_{i}$ ← Sample partition $π_{i}$ using $p_{θ}$ $\forall i \in {1, \dots, B}$
6:: $π_{i}^{B L}$ ← Greedy partition $π_{i}$ using $p_{θ^{B L}}$ $\forall i \in {1, \dots, B}$
7:: Calculate the loss $L (π_{i})$ , $L (π_{i}^{B L})$ using Route Optimization according to Equation (15) in parallel, respectively, $\forall i \in {1, \dots, B}$
8:: $\nabla L \leftarrow \frac{1}{B} \sum_{i = 1}^{B} (L (π_{i}) - L (π_{i}^{B L})) \nabla_{θ} log p_{θ} (π | s)$
9:: $θ \leftarrow A d a m (θ, \nabla L)$
10:: if $θ$ is better than $θ^{B L}$ then
11:: $θ^{B L} \leftarrow θ$
12:: end if
13:: end for
14:: end for

4.3.4. Route Optimization

We perform route optimization within each subgraph to determine the optimal visiting sequence for the customers within each sub-route. We explored both learning-based and traditional methods for route optimization:

Learning-based methods: We utilized neural solvers inspired by the Attention Model (AM) proposed by [21] and POMO (policy optimization with multiple optima) introduced by [22]. These models are trained as TSP solvers and leverage deep reinforcement learning to learn efficient routing strategies.
Traditional Heuristics: As a baseline, we also considered classical optimization algorithms such as LKH3 (a variant of the Lin–Kernighan heuristic).

To adapt learning-based methods to the sub-problems generated during partitioning, we trained multiple TSP models on datasets of size 20, 50, and 100 nodes, respectively. During inference, for each sub-TSP, we select the model whose training size is closest to the number of nodes in the sub-problem. This strategy ensures better generalization and performance when solving subroutes of varying sizes.

5. Experiments

In this section, we present the results of the computational experiments conducted to evaluate and compare various learning-based heuristics used to solve the CVRP. Our study emphasized large and complex instances, particularly those from the benchmark datasets of [36]. The primary objective was to assess the performance of our proposed learning-based framework when combined with different route optimization strategies, denoted as SPORL-AM, SPORL-LKH3, and SPORL-POMO. We compared these approaches against state-of-the-art methods, including AM [21] and POMO [22], as well as the classical heuristic LKH3. We present full comparative results on standard benchmarks (Table 1, Table 2, Table 3 and Table 4) to ensure transparency and provide a complete resource for future comparisons. The best results are highlighted in bold. While the recent work proposed by [30] targeted similar scalability challenges, a direct empirical comparison was not feasible due to the unavailability of open-source code or pretrained models. Moreover, the experimental settings and instance formats used in their work differ significantly from those adopted in our study, which makes a fair and consistent comparison challenging. Instead, we performed an ablation study to isolate and evaluate the contribution of our proposed context embedding mechanism. This analysis helped to highlight the importance of incorporating sub_route structural information in improving partition quality and routing performance.

5.1. Experimental Setup

To ensure a fair and reproducible evaluation of our proposed SPORL framework, we designed a comprehensive experimental setup. This section details the hardware environment, hyperparameter configuration, and the training and evaluation protocols.

Hardware Environment: All experiments were conducted on a machine equipped with a single NVIDIA RTX 3090 GPU.

Training Details and Model Hyperparameters: The SPORL model was trained using the REINFORCE algorithm [35] with a moving average baseline to reduce variance. We used the Adam optimizer with a learning rate of

η = 10^{- 4}

, and a batch size of 512. The model was trained for 200 epochs to ensure a fair and consistent comparison.

Evaluation Protocol: During inference, the trained SPORL policy was applied to the CVRPLIB benchmark instances. For each instance, the partitioning policy was run once to decompose the problem into sub-TSPs, which were then solved in parallel. For SPORL-LKH3, we used the LKH-3 solver. For SPORL-AM and SPORL-POMO, we used the pre-trained models, selecting the model whose training size was closest to the sub-problem size. The final solution cost was the sum of the costs of all sub-routes. The performance was evaluated based on the optimality gap to the Best Known Solution (BKS) and the total computation time, as defined in Section 5.3. The reported results are averaged over five independent runs to ensure statistical reliability.

5.2. Datasets

We evaluated the performance of our approach using CVRPLIB benchmark datasets (http://vrp.galgos.inf.puc-rio.br/index.php/en/, (accessed on 3 September 2025)), provided by [36]. CVRPLIB is widely recognized as the most comprehensive and authoritative benchmark library for the Capacitated VRP. It is the standard for comparing algorithmic performance in leading operations research and machine learning papers [26,30]. Our experiments specifically focused on larger problem instances with more than 300 customers, where scalability becomes a critical factor. To facilitate analysis, we divided the dataset into two categories: medium-sized instances (denoted as XM), comprising 300 to 700 customers, and large instances (denoted as XL), containing 700 to 1000 customers. This partitioning enabled a systematic assessment of our approach’s effectiveness across varying levels of complexity.

5.3. Evaluation Metric

We evaluated the solutions by comparing the solving time and performance with the best known solutions (BKS) available during the study. This comparison is expressed as a gap, calculated using Equation (17), which measures the difference between the objective function value z of the current solution and the objective function value

z_{B K S}

of the best known solution, as follows:

G a p = \frac{z - z_{B K S}}{z_{B K S}} \times 100

(17)

5.4. Performance Evaluation

To assess the effectiveness of our proposed learning-based partitioning framework for solving large-scale CVRP instances, we present a summary of the experimental results in this section. Performance was evaluated in terms of the optimality gap relative to the best known solutions (BKS) (17) and the overall computation time. A comparison across different learning-based heuristics revealed significant variations in performance depending on the problem instance and the specific solution strategy employed. Key findings include

Medium-Sized Instances $X_{M}$ :
Table 1 presents the performance results for medium-sized problem instances in the benchmark set. Our proposed approach, SPORL-LKH3, achieved an average optimality gap of 8.53% on the $X_{M}$ dataset, demonstrating a substantial improvement over the AM and POMO baselines, which yielded gaps of 241.29% and 179.86%, respectively. In particular, incorporating the LKH3 heuristic enabled our method to achieve significantly smaller gaps than the other approaches. Moreover, the computation time in Table 3 is drastically reduced (0.45 s on average) compared to the standalone LKH3, which required up to 15 h. Compared to state-of-the-art solvers, SPORL-LKH3 consistently generated the best solutions in under 2 s, clearly outperforming both AM and POMO in terms of solution quality and efficiency
Large-Sized Instances $X_{L}$ :
Table 2 also reports the performance on the larger problem instances from the benchmark set. Consistently with the results observed on medium-sized instances, our method SPORL-LKH3 achieved a mean optimality gap of less than 3.3% relative to the best known solutions. This marks a significant improvement over the AM and POMO heuristics, which exhibited average gaps of 80.4% and 74%, respectively, across all instances in this category.
Comparison of Route Optimization Methods:
To understand the effect of the downstream solver, we compared the performance of SPORL-LKH3, SPORL-AM, and SPORL-POMO. The results in Table 1 and Table 2 show that SPORL-LKH3 consistently struck the best balance between solution quality and computational time. For example, on the X-n701-k44 instance, SPORL-LKH3 achieved a gap of 11.77% in 2.07 s, outperforming SPORL-AM and SPORL-POMO, which achieved gaps of 17.02% and 16.58%, respectively, with similar runtimes in Table 3 and Table 4. This highlights the effectiveness of combining our learned partitioning strategy with a powerful heuristic solver such as LKH3.

Figure 1 and Figure 2 present a comprehensive comparison of solution quality across various configurations of our proposed framework in medium- and large-sized CVRP instances. Figure 1 focuses on medium-sized instances. In Figure 1a, the box plot shows the performance of our framework when combined with the LKH3 routing solver and the performance of AM and POMO. It is evident that SPORL-LKH3 achieved the lowest gap distribution, demonstrating its superiority in producing high-quality solutions. The learning-based heuristics AM and POMO resulted in significantly higher average gaps, with wider distributions and more outliers. This validates the effectiveness of incorporating the LKH3 solver for enhanced route construction after partitioning. Figure 1b compares different variants of our framework when combined with three different routing solvers: LKH3, AM, and POMO (SPORL-LKH3, SPORL-AM, and SPORL-POMO). Although all three versions benefited from our learning-based partitioning strategy, the box plot clearly illustrates that the inclusion of LKH3 as a post-optimization solver yielded more consistent and lower gap solutions. This suggests that the quality of the underlying route optimizer was a critical factor in the final performance.

Figure 2 provides a similar comparison but for large-sized instances. Again, Figure 2a highlights the average gap distribution across the LKH3, AM, and POMO solvers. As with the medium instances, SPORL-LKH3 outperformed the other methods, exhibiting smaller gaps and less variance. In contrast, AM and POMO struggled to scale, showing significant degradation in solution quality as the problem size increased. This reinforces the scalability limitations of purely learning-based heuristics in large problem spaces. Figure 2b evaluates the impact of different route optimizers within our framework for larger instances. Consistently with previous findings, SPORL-LKH3 achieved superior performance, with both lower gaps and a tighter distribution compared to SPORL-AM and SPORL-POMO. This performance gap became more pronounced at scale, emphasizing the importance of hybrid approaches that combine data-driven partitioning with powerful classical solvers.

5.5. Ablation Study: Impact of Context Embedding

To evaluate the contribution of our proposed context embedding mechanism to the overall model performance, we conducted an ablation study by comparing two versions of the model: one that used the full context embedding as described in Equation (11) with LKH3 as route optimization, and another baseline version that excluded the sub-route embedding and used only the graph embedding and the first and last visited node, similarly to previous works such as [30]. The results are summarized in Table 5. In medium-sized instances, the removal of the sub-route embedding led to an increase in the optimality gap from 8.50% to 10.12%, while in large-scale instances, the degradation was much more pronounced, with the gap rising from 20.03% to 41.07%. This comparison highlights an important distinction between Hou et al.’s [30] approach and ours. In Hou et al. [30], the masking function is used to guarantee feasibility: capacity violations are prevented, and a global mask mechanism ensures that the number of routes does not exceed the fleet size. Our method adopts a similar masking strategy, ensuring that every partition respects both capacity and fleet constraints. However, we extend this foundation by enriching the state with sub-route embeddings. This additional local structural awareness enables the policy to make more informed choices within the feasible action space, particularly when forming sub-routes that are both compact and cost-efficient. The results confirm that this enriched context embedding substantially improved the solution quality, with only minimal impact on computation time, addressing a fundamental limitation of relying solely on global features (graph, first, and last node). Empirically, the inference time differed by less than 3% between the two variants across all tested scales, confirming that the workload overhead of maintaining sub-route embeddings was minimal. For example, the average runtime per inference was 1.39 s with sub-route embeddings versus 1.07 s without in medium instances, and the average runtimes were 2.84 and 1.20 with and without sub-route embeddings in large instances, respectively. This marginal increase was outweighed by the much lower optimality gaps achieved.

6. Discussion

The experimental results comprehensively demonstrate the effectiveness and advantages of the proposed SPORL framework. Our analysis reveals several key strengths that distinguish it from existing state-of-the-art methods.

First, regarding scalability, traditional high-performance heuristics such as LKH3 are often hindered by their exponential time complexity, becoming computationally prohibitive for instances exceeding a few hundred customers, as evidenced by runtimes exceeding 15 h. In contrast, SPORL’s reinforcement-learning-based partitioning agent efficiently decomposes these large problems into smaller, manageable subproblems. This decomposition enables parallel solving, collapsing the total computation time from hours to a matter of seconds (2.71 s on average for

X_{L}

instances), while still leveraging the power of sophisticated solvers like LKH3.

In terms of solution quality, SPORL consistently outperformed pure learning-based baselines. As shown in Table 1 and Table 2, end-to-end models like AM and POMO, while fast, failed to generalize to large-scale instances, resulting in optimality gaps exceeding

> 100 %

and

> 200 %

on medium and large instances, respectively. Their sequential, autoregressive nature leads to myopic decisions that accumulate into highly suboptimal solutions at this scale. In [22], the authors themselves showed POMO’s performance degrading as the problem size increased beyond the training size. SPORL directly addresses this by using RL to learn a partitioning policy that explicitly respects fleet-size and vehicle-capacity constraints, guaranteeing feasible solutions. Crucially, the inclusion of our novel sub-route embedding mechanism (Equation (11)) provides the model with a rich, structured representation of the current state. This allows it to make informed, context-aware decisions that lead to more coherent and compact clusters, which is the key driver behind the significant improvement in solution quality over other learning-based methods.

A pivotal advantage of SPORL is its demonstrated flexibility and its role as a force multiplier for existing solvers. The framework is agnostic to the backend optimizer, accommodating both learned solvers (AM, POMO) and classical heuristics (LKH3). This allows users to make a strategic trade-off between speed and quality based on their specific application needs. Most importantly, SPORL acts as a guiding mechanism that enhances the performance of conventional solvers on notoriously difficult instances. By providing LKH3 with well-structured, capacity-feasible subproblems, SPORL directs its powerful local search toward promising regions of the solution space, enabling it to find high-quality solutions in a fraction of the time it would require to solve the monolithic problem.

The computational complexity of SPORL is primarily driven by the attention mechanism in the Transformer model, which is

O (n^{2})

for a graph with n nodes [29]. However, the key to scalability is that the subsequent TSP solving is performed on much smaller subgraphs

(n_{s u b} < < n)

and in parallel. The practical inference time, as shown in Table 3 and Table 4, is therefore linear to sub-quadratic in practice, a drastic improvement over the exponential complexity of pure exact methods.

A direct numerical comparison with Hou et al. [30] is challenging due to substantial differences in experimental design. Nevertheless, their reported gap of 10.06% on medium instances and 11.71% on large instances demonstrates the scalability of the decomposition approach. Our SPORL-LKH3 variant achieved an average gap of 8.50% on medium instances and 20.03% on large, complex benchmark instances. The core contribution of our work is not to claim superiority over [30] on a single metric, but rather to introduce a key architectural improvement. Hou et al. ensure feasibility primarily through masking: customer nodes that would exceed vehicle capacity are blocked, and a global mask mechanism controls fleet size to guarantee that the solution does not use more than the available vehicles. Our approach similarly enforces feasibility through masking and infeasibility penalties, but extends beyond this by incorporating sub-route embeddings into the state. The ablation results in Table 5 demonstrate the importance of this design choice: when the sub-route embedding was removed and the context representation was reduced to resemble that of [30], the optimality gap on large instances more than doubled, from 20.03% to 41.07%. This strongly suggests that enriched local context is crucial for partition quality and that the limitations of Hou et al.’s global-only context can be addressed by our approach. Hou et al. [30] provided a groundbreaking proof-of-concept for RL-based decomposition at extreme scale. Our work, SPORL, builds upon this foundation by proposing a more sophisticated context representation that captures both global feasibility (through masking and penalties) and local structural properties (through sub-route embeddings). We view the two approaches as complementary: Hou et al.’s global scaling strategy, combined with our enriched local context, presents a promising avenue for future research toward solving truly massive and heterogeneous real-world routing problems.

Notwithstanding its strong overall performance, SPORL-LKH3 exhibits certain limitations. The optimality gap can increase on particularly challenging instances, such as those with highly irregular customer distributions or tight capacity constraints (e.g., X-n1001-k43, gap: 31.39%). This variability stems from the inherent difficulty of these problems and is further influenced by the fact that our RL policy is trained on randomly generated instances. While this promotes generalization, the policy may not be perfectly attuned to the specific and sometimes unique topological features of every benchmark instance. Consequently, even a near-optimal partition may inherently limit the final solution quality for these pathological cases. Furthermore, the upfront computational cost of training the RL agent is non-trivial, potentially posing a barrier to adoption for users without access to high-performance GPUs. These limitations, however, indicate clear directions for future work, including the refinement of the partitioning policy via curriculum learning or instance-specific fine-tuning, and the integration of more advanced metaheuristics.

Real-World Applicability

Figure 3 illustrates the application of our framework to real-world CVRP instances with 1000 nodes. The visualizations demonstrate how the RL-based partitioning creates capacity-feasible and well-structured clusters, and the subsequent TSP optimization generates efficient routes. This capability is particularly valuable for logistics companies dealing with large-scale delivery networks, where both scalability and solution quality are critical.

7. Conclusions

In this work, we proposed SPORL (Scalable Partitioning and Optimization via Reinforcement Learning), a novel reinforcement-learning-based framework for solving large-scale Capacitated Vehicle Routing Problems (CVRP). SPORL addresses key challenges in scalability and solution quality by decomposing the problem into two stages: a learned graph partitioning phase, followed by parallel route optimization using Traveling Salesman Problem (TSP) solvers. This two-phase strategy allows the model to efficiently handle instances with hundreds or thousands of customers, which are typically infeasible for end-to-end learning approaches. A key distinguishing feature of SPORL lies in its context embedding mechanism, which integrates richer information during decision-making. Unlike previous approaches, SPORL includes subgraph-level embeddings to better capture local route context and constraints. This enhanced context representation empowers the model to make more informed and adaptive routing decisions, particularly in environments with complex customer distributions and tight vehicle capacity limits. The empirical results confirm that this innovation contributed significantly to improved solution quality and robustness across a range of medium to large instances. Experimental evaluations demonstrated that SPORL outperformed state-of-the-art learning-based baselines such as AM and POMO in both solution quality and computational efficiency. Furthermore, when combined with classical heuristics like LKH3, SPORL acts as a guiding mechanism that enhances the effectiveness of classical solvers by generating structured, high-quality subproblems. This enables heuristics such as LKH3 to find good solutions within a reasonable time frame, even for hard, large-scale instances. In summary, SPORL effectively bridges the gap between the scalability of deep learning models and the solution quality of traditional optimization algorithms, establishing a new state-of-the-art for scalable solving of large-scale CVRPs. Looking ahead, future work will focus on developing multi-objective reward functions to optimize for route compactness and capacity utilization, integrating theoretical feasibility guarantees from complementary methods, and extending the framework to dynamic and stochastic routing environments. By tackling these challenges, we aim to further advance the practical applicability of learning-based optimization in real-world logistics and supply chain management.

Author Contributions

Conceptualization, C.A.A., K.B. and O.A.; methodology, C.A.A.; software, C.A.A.; validation, C.A.A., K.B. and O.A.; formal analysis, C.A.A.; writing—original draft preparation, C.A.A.; writing—review and editing, C.A.A., K.B. and O.A.; visualization, C.A.A.; supervision, K.B. and O.A. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The benchmark instances used in this study are publicly available through the CVRPLIB hosted by PUC-Rio’s GalgoS group (PUC-Rio Capacitated Vehicle Routing Problem Library), accessible at vrp.galgos.inf.puc-rio.br/index.php/en (accessed on 3 September 2025).

Acknowledgments

The authors would like to express their gratitude to the Laboratory of Artificial Intelligence and Information Technologies (LINATI) for their support and for providing a stimulating research environment that contributed to the completion of this work.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Golden, B.L.; Raghavan, S.; Wasil, E.A. The Vehicle Routing Problem: Latest Advances and New Challenges; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2008; Volume 43. [Google Scholar]
Bullo, F.; Frazzoli, E.; Pavone, M.; Savla, K.; Smith, S.L. Dynamic vehicle routing for robotic systems. Proc. IEEE 2011, 99, 1482–1504. [Google Scholar] [CrossRef]
Hosoda, J.; Maher, S.J.; Shinano, Y.; Villumsen, J.C. A parallel branch-and-bound heuristic for the integrated long-haul and local vehicle routing problem on an adaptive transportation network. Comput. Oper. Res. 2024, 165, 106570. [Google Scholar] [CrossRef]
Zhao, W.; Bian, X.; Mei, X. An Adaptive Multi-Objective Genetic Algorithm for Solving Heterogeneous Green City Vehicle Routing Problem. Appl. Sci. 2024, 14, 6594. [Google Scholar] [CrossRef]
Campuzano, G.; Lalla-Ruiz, E.; Mes, M. The two-tier multi-depot vehicle routing problem with robot stations and time windows. Eng. Appl. Artif. Intell. 2025, 147, 110258. [Google Scholar] [CrossRef]
Ma, H.; Yang, T. Improved Adaptive Large Neighborhood Search Combined with Simulated Annealing (IALNS-SA) Algorithm for Vehicle Routing Problem with Simultaneous Delivery and Pickup and Time Windows. Electronics 2025, 14, 2375. [Google Scholar] [CrossRef]
Wang, C.; Lan, H.; Saldanha-da Gama, F.; Chen, Y. On Optimizing a Multi-Mode Last-Mile Parcel Delivery System with Vans, Truck and Drone. Electronics 2021, 10, 2510. [Google Scholar] [CrossRef]
Baldacci, R.; Mingozzi, A.; Roberti, R. Recent exact algorithms for solving the vehicle routing problem under capacity and time window constraints. Eur. J. Oper. Res. 2012, 218, 1–6. [Google Scholar] [CrossRef]
Toth, P.; Vigo, D. The Vehicle Routing Problem; SIAM: Bangkok, Thailand, 2002. [Google Scholar]
Christofides, N.; Mingozzi, A.; Toth, P. Exact algorithms for the vehicle routing problem, based on spanning tree and shortest path relaxations. Math. Program. 1981, 20, 255–282. [Google Scholar] [CrossRef]
Naddef, D.; Rinaldi, G. 3. Branch-And-Cut Algorithms for the Capacitated VRP. In The Vehicle Routing Problem; Society for Industrial and Applied Mathematics: Philadelphia, PA, USA, 2002; pp. 53–84. [Google Scholar] [CrossRef]
Ribeiro, C.C.; Hansen, P.; Desaulniers, G.; Desrosiers, J.; Solomon, M.M. Accelerating Strategies in Column Generation Methods for Vehicle Routing and Crew Scheduling Problems; Springer: Berlin/Heidelberg, Germany, 2002. [Google Scholar]
Perron, L.; Furnon, V. OR-Tools. 2019. Available online: https://developers.google.com/optimization (accessed on 3 September 2025).
Helsgaun, K. An Extension of the Lin-Kernighan-Helsgaun TSP Solver for Constrained Traveling Salesman and Vehicle Routing Problems; Roskilde University: Roskilde, Denmark, 2017; Volume 12, pp. 966–980. [Google Scholar]
Clarke, G.; Wright, J.W. Scheduling of vehicles from a central depot to a number of delivery points. Oper. Res. 1964, 12, 568–581. [Google Scholar] [CrossRef]
Vidal, T.; Crainic, T.G.; Gendreau, M.; Lahrichi, N.; Rei, W. A hybrid genetic algorithm for multidepot and periodic vehicle routing problems. Oper. Res. 2012, 60, 611–624. [Google Scholar] [CrossRef]
Vidal, T. Hybrid genetic search for the CVRP: Open-source implementation and SWAP* neighborhood. Comput. Oper. Res. 2022, 140, 105643. [Google Scholar] [CrossRef]
Vinyals, O.; Fortunato, M.; Jaitly, N. Pointer networks. Adv. Neural Inf. Process. Syst. 2015, 28. [Google Scholar] [CrossRef]
Bello, I.; Pham, H.; Le, Q.V.; Norouzi, M.; Bengio, S. Neural Combinatorial Optimization with Reinforcement Learning. arXiv 2017, arXiv:1611.09940. [Google Scholar] [CrossRef]
Nazari, M.; Oroojlooy, A.; Snyder, L.V.; Takác, M. Deep Reinforcement Learning for Solving the Vehicle Routing Problem. arXiv 2018, arXiv:1802.04240. [Google Scholar]
Kool, W.; van Hoof, H.; Welling, M. Attention, Learn to Solve Routing Problems! arXiv 2019, arXiv:1803.08475. [Google Scholar] [CrossRef]
Kwon, Y.; Choo, J.; Kim, B.; Yoon, I.; Min, S.; Gwon, Y. POMO: Policy Optimization with Multiple Optima for Reinforcement Learning. arXiv 2020, arXiv:2010.16011. [Google Scholar]
Kool, W.; van Hoof, H.; Gromicho, J.; Welling, M. Deep Policy Dynamic Programming for Vehicle Routing Problems. arXiv 2021, arXiv:2102.11756. [Google Scholar] [CrossRef]
Hottung, A.; Kwon, Y.D.; Tierney, K. Efficient Active Search for Combinatorial Optimization Problems. arXiv 2022, arXiv:2106.05126. [Google Scholar] [CrossRef]
Hottung, A.; Tierney, K. Neural large neighborhood search for the capacitated vehicle routing problem. In ECAI 2020; IOS Press: Amsterdam, The Netherlands, 2020; pp. 443–450. [Google Scholar]
Fitzpatrick, J.; Ajwani, D.; Carroll, P. A scalable learning approach for the capacitated vehicle routing problem. Comput. Oper. Res. 2024, 171, 106787. [Google Scholar] [CrossRef]
Li, S.; Yan, Z.; Wu, C. Learning to Delegate for Large-scale Vehicle Routing. arXiv 2021, arXiv:2107.04139. [Google Scholar] [CrossRef]
Zong, Z.; Wang, H.; Wang, J.; Zheng, M.; Li, Y. RBG: Hierarchically Solving Large-Scale Routing Problems in Logistic Systems via Reinforcement Learning. In Proceedings of the KDD ’22 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Washington, DC, USA, 14–18 August 2022; pp. 4648–4658. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. arXiv 2023, arXiv:1706.03762. [Google Scholar] [PubMed]
Hou, Q.; Yang, J.; Su, Y.; Wang, X.; Deng, Y. Generalize learned heuristics to solve large-scale vehicle routing problems in real-time. In Proceedings of the Eleventh International Conference on Learning Representations, Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
Chen, X.; Tian, Y. Learning to perform local rewriting for combinatorial optimization. In Proceedings of the NIPS’19: 33rd International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; Curran Associates Inc.: Red Hook, NY, USA, 2019; Volume 32. [Google Scholar]
Ribeiro, C.C.; Hansen, P.; Taillard, E.D.; Voss, S. POPMUSIC—Partial optimization metaheuristic under special intensification conditions. In Essays and Surveys in Metaheuristics; Springer: Boston, MA, USA, 2002; pp. 613–629. [Google Scholar]
Ayachi Amar, C.; Bouanane, K.; Aiadi, O. Learning Different Separations in Branch and Cut: A Survey. In Proceedings of the International Conference on Intelligent Systems and Pattern Recognition, Budva, Montenegro, 9–11 October 2024; Springer: Cham, Switzerland, 2024; pp. 214–226. [Google Scholar]
Subramanian, A.; Uchoa, E.; Ochi, L. A hybrid algorithm for the vehicle routing problem with simultaneous pickup and delivery. Comput. Oper. Res. 2013, 40, 1050–1065. [Google Scholar] [CrossRef]
Williams, R.J. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach. Learn. 1992, 8, 229–256. [Google Scholar] [CrossRef]
Uchoa, E.; Pecin, D.; Pessoa, A.; Poggi, M.; Vidal, T.; Subramanian, A. New benchmark instances for the capacitated vehicle routing problem. Eur. J. Oper. Res. 2017, 257, 845–858. [Google Scholar] [CrossRef]

Figure 1. Average performance comparison on medium-sized instances, on the left (a), the average performance of our method combined with the LKH3 heuristic compared with learning heuristics AM and POMO, and on the right (b), the average performance of our method combined with different route optimizations.

Figure 2. Average performance comparison on larger-sized instances, on the left (a), the average performance of our method combined with the LKH3 heuristic compared with learning heuristics AM and POMO, and on the right (b), the average performance of our method combined with different route optimizations.

Figure 3. A comparative visualization of solutions for a large-scale CVRP instance with 1000 customers. (a) Solution generated by the attention model (AM). (c) Solution generated by the POMO. (e) High-quality, feasible solution generated by the SPORL-LKH3 variant, demonstrating how effective partitioning creates balanced, efficient clusters. (b) Solution from the SPORL-AM variant and (d) solution from the SPORL-POMO variant, showing improvement over pure AM and POMO but still less optimal than the LKH3-powered solver, highlighting the importance of the backend optimizer.

Table 1. Optimality gap (%) on

X_{M}

instances.

Table 1. Optimality gap (%) on

X_{M}

instances.

Instance	LKH3 [14]	AM [21]	POMO [22]	SPORL-AM	SPORL-POMO	SPORL-LKH3
X-n303-k21	2.27	28.91	18.25	6.25	7.12	5.01
X-n308-k13	0.86	39.32	45.48	12.89	16.61	2.83
X-n313-k71	7.03	176.05	57.07	5.24	9.14	6.64
X-n317-k53	8.74	186.84	73.48	8.41	9.85	4.85
X-n322-k28	0.43	66.75	57.45	7.48	8.42	5.44
X-n327-k20	4.33	678.42	301.48	9.12	7.45	6.63
X-n331-k15	1.29	823.16	153.48	10.84	6.45	4.82
X-n336-k84	0.80	71.49	84.95	5.78	7.58	2.73
X-n344-k43	2.28	399.86	48.24	9.21	9.65	4.82
X-n351-k40	4.70	94.47	147.28	7.58	7.95	5.74
X-n359-k29	1.40	621.88	86.72	11.45	9.97	7.91
X-n367-k17	8.90	711.52	502.87	13.47	14.25	9.57
X-n376-k94	1.40	188.74	80.25	8.24	8.92	7.51
X-n384-k52	5.60	30.47	59.71	10.47	8.64	7.52
X-n393-k38	4.80	66.10	30.70	9.58	9.73	7.34
X-n401-k29	1.76	20.63	34.77	20.63	10.77	9.73
X-n411-k19	2.21	58.24	89.54	19.24	13.24	8.14
X-n420-k130	1.91	42.10	48.50	5.78	4.85	4.24
X-n429-k61	2.60	65.12	73.40	12.36	8.74	4.37
X-n439-k37	5.66	55.85	81.25	14.58	9.77	8.64
X-n449-k29	3.70	52.38	80.74	12.40	13.74	9.24
X-n459-k26	5.02	29.80	67.85	13.80	11.85	9.54
X-n469-k138	0.74	85.54	64.12	4.25	8.47	7.51
X-n480-k70	3.26	440.64	89.42	9.74	10.71	15.58
X-n491-k59	-	88.42	327.15	8.97	8.12	9.54
X-n502-k39	-	410.83	772.65	11.54	9.98	10.41
X-n513-k21	-	791.83	524.78	10.89	11.86	11.75
X-n524-k153	-	201.05	120.95	5.96	7.82	6.24
X-n536-k96	-	85.91	15.32	10.48	15.32	7.15
X-n548-k50	-	479.21	130.81	8.25	6.47	7.14
X-n561-k42	-	917.54	842.37	6.12	7.48	9.27
X-n573-k30	-	234.28	425.12	12.41	11.32	10.97
X-n586-k159	-	155.88	98.42	8.24	6.14	13.08
X-n599-k92	-	36.62	59.18	7.25	10.45	9.12
X-n613-k62	-	86.12	124.48	14.82	15.48	13.50
X-n627-k43	-	624.64	325.49	20.16	20.47	16.48
X-n641-k35	-	70.86	430.12	17.86	19.47	12.27
X-n655-k131	-	179.86	283.67	13.48	16.48	10.65
X-n670-k130	-	202.12	152.96	17.56	18.01	16.48
X-n685-k75	-	52.45	184.28	13.57	10.95	9.95
Average	3.40	241.29	179.86	10.91	10.74	8.50

Table 2. Optimality gap (%) on

X_{L}

instances.

Table 2. Optimality gap (%) on

X_{L}

instances.

Instance	AM [21]	POMO [22]	SPORL-AM	SPORL-POMO	SPORL-LKH3
X-n701-k44	432.03	419.62	17.02	16.58	11.77
X-n716-k35	30.02	116.67	20.30	18.23	14.19
X-n733-k159	297.96	437.21	15.44	17.23	16.84
X-n749-k98	70.83	53.22	14.23	15.42	21.34
X-n766-k71	423.98	464.65	20.01	19.45	16.84
X-n783-k48	50.32	27.11	10.05	12.04	21.77
X-n801-k40	977.09	829.38	16.40	14.32	27.23
X-n819-k171	49.26	174.11	22.10	20.14	12.80
X-n837-k142	272.51	198.39	22.61	20.43	20.35
X-n856-k95	582.17	684.60	18.75	17.35	23.47
X-n876-k59	24.73	45.51	25.94	25.68	24.99
X-n895-k37	263.25	724.37	23.62	24.12	35.76
X-n916-k207	249.15	200.09	32.64	31.89	11.33
X-n936-k151	318.02	143.33	32.12	32.01	13.70
X-n957-k87	80.82	714.06	20.08	30.45	17.39
X-n979-k58	81.14	139.66	20.07	30.24	19.35
X-n1001-k43	1033.88	699.56	39.29	38.41	31.39
Average	308.06	357.14	21.80	22.58	20.03

Table 3. Runtime (s) comparison on

X_{M}

instances.

Table 3. Runtime (s) comparison on

X_{M}

instances.

Instance	LKH3 [14]	AM [21]	POMO [22]	SPORL-AM	SPORL-POMO	SPORL-LKH3
X-n303-k21	15 (h)	0.60	0.58	0.49	0.23	0.45
X-n308-k13	18 (h)	0.62	1.02	0.56	0.37	1.02
X-n313-k71	18 (h)	0.77	0.83	0.64	0.75	1.42
X-n317-k53	19 (h)	0.88	0.92	0.68	0.81	1.98
X-n322-k28	20 (h)	0.54	0.72	0.72	0.68	1.36
X-n327-k20	20 (h)	0.82	0.65	0.42	0.47	2.06
X-n331-k15	18 (h)	0.97	0.89	0.43	0.38	1.06
X-n336-k84	20 (h)	0.80	0.94	0.59	0.64	1.04
X-n344-k43	22 (h)	0.92	0.81	0.73	0.80	1.72
X-n351-k40	21 (h)	0.60	0.74	0.69	0.49	1.02
X-n359-k29	19 (h)	1.03	1.02	0.83	0.95	1.87
X-n367-k17	17 (h)	1.03	1.20	0.64	0.71	2.07
X-n376-k94	23 (h)	1.00	1.03	0.63	0.68	0.98
X-n384-k52	19 (h)	0.99	1.05	0.73	0.82	0.85
X-n393-k38	19 (h)	1.14	1.30	1.02	1.15	1.05
X-n401-k29	16 (h)	0.64	0.79	1.17	1.04	1.04
X-n411-k19	17 (h)	0.90	0.89	1.13	1.10	1.35
X-n420-k130	24 (h)	1.20	0.97	1.24	1.19	1.23
X-n429-k61	22 (h)	0.70	0.82	1.01	1.00	0.73
X-n439-k37	21 (h)	0.93	1.02	1.06	1.11	1.68
X-n449-k29	25 (h)	0.73	0.80	1.14	1.19	1.48
X-n459-k26	26 (h)	0.71	0.95	1.16	1.03	1.56
X-n469-k138	30 (h)	1.00	0.99	1.21	1.41	1.38
X-n480-k70	31 (h)	1.27	1.30	1.19	1.16	1.85
X-n491-k59	-	1.17	1.20	1.23	1.21	1.23
X-n502-k39	-	1.31	1.24	1.14	1.09	1.82
X-n513-k21	-	1.11	1.15	1.13	1.23	1.42
X-n524-k153	-	1.16	1.07	1.17	1.17	0.82
X-n536-k96	-	1.02	1.06	1.09	1.12	1.09
X-n548-k50	-	1.35	1.24	1.19	1.09	1.62
X-n561-k42	-	1.47	1.50	1.03	1.18	1.58
X-n573-k30	-	1.35	1.42	1.26	1.23	1.90
X-n586-k159	-	1.46	1.28	1.13	1.24	1.68
X-n599-k92	-	1.01	1.06	1.41	1.20	1.84
X-n613-k62	-	1.02	1.10	1.32	1.35	1.42
X-n627-k43	-	1.65	1.80	1.27	1.30	1.78
X-n641-k35	-	1.61	1.50	1.15	1.07	1.83
X-n655-k131	-	1.55	1.08	1.22	1.29	1.05
X-n670-k130	-	1.67	1.42	1.17	1.24	1.31
X-n685-k75	-	1.03	1.04	1.03	1.07	1.13
Average	20.83 (h)	1.04	1.05	0.97	0.98	1.93

Table 4. Runtime (s) on the problem instances

X_{L}

.

Table 4. Runtime (s) on the problem instances

X_{L}

.

Instance	AM [21]	POMO [22]	SPORL-AM	SPORL-POMO	SPORL-LKH3
X-n701-k44	1.78	1.68	1.98	1.47	2.07
X-n716-k35	1.03	1.22	2.00	1.89	2.01
X-n733-k159	1.44	1.85	2.01	2.03	1.04
X-n749-k98	1.19	1.82	1.98	1.99	2.16
X-n766-k71	1.92	1.64	1.65	1.53	2.98
X-n783-k48	1.21	1.07	1.89	1.87	2.57
X-n801-k40	2.10	1.65	2.03	1.52	2.84
X-n819-k171	1.47	1.86	2.07	2.01	1.32
X-n837-k142	2.81	1.97	2.14	2.18	1.14
X-n856-k95	1.94	2.01	2.06	2.10	3.06
X-n876-k59	1.43	1.47	2.84	2.65	3.01
X-n895-k37	1.41	2.03	3.01	2.98	3.36
X-n916-k207	2.36	2.14	3.45	3.49	3.53
X-n936-k151	2.14	2.07	4.04	3.98	3.65
X-n957-k87	2.47	1.98	4.15	4.20	4.02
X-n979-k58	1.58	2.04	5.02	5.08	4.01
X-n1001-k43	2.81	2.23	5.13	5.20	5.53
Average	1.82	1.80	2.79	2.84	2.71

Table 5. Evaluating the impact of context embedding on performance.

Benchmarks	Gap (%)		Time (s)
Benchmarks	With	Without	With	Without
$X_{M}$	8.50	10.12	1.39	1.07
$X_{L}$	20.03	41.07	2.84	1.20

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ayachi Amar, C.; Bouanane, K.; Aiadi, O. A Reinforcement Learning Framework for Scalable Partitioning and Optimization of Large-Scale Capacitated Vehicle Routing Problems. Electronics 2025, 14, 3879. https://doi.org/10.3390/electronics14193879

AMA Style

Ayachi Amar C, Bouanane K, Aiadi O. A Reinforcement Learning Framework for Scalable Partitioning and Optimization of Large-Scale Capacitated Vehicle Routing Problems. Electronics. 2025; 14(19):3879. https://doi.org/10.3390/electronics14193879

Chicago/Turabian Style

Ayachi Amar, Chaima, Khadra Bouanane, and Oussama Aiadi. 2025. "A Reinforcement Learning Framework for Scalable Partitioning and Optimization of Large-Scale Capacitated Vehicle Routing Problems" Electronics 14, no. 19: 3879. https://doi.org/10.3390/electronics14193879

APA Style

Ayachi Amar, C., Bouanane, K., & Aiadi, O. (2025). A Reinforcement Learning Framework for Scalable Partitioning and Optimization of Large-Scale Capacitated Vehicle Routing Problems. Electronics, 14(19), 3879. https://doi.org/10.3390/electronics14193879

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Reinforcement Learning Framework for Scalable Partitioning and Optimization of Large-Scale Capacitated Vehicle Routing Problems

Abstract

1. Introduction

2. Related Work

3. Problem Formulation

3.1. Capacitated Vehicle Routing Problem

3.2. Set Partitioning Formulation for CVRP

4. Reinforcement Learning for the Partitioning Problem

4.1. MDP Definition and Alignment with CVRP

4.2. Model Architecture for Partitioning Problem

4.3. Training Process

4.3.1. Loss Function

4.3.2. Policy Optimization

4.3.3. Training Procedure

4.3.4. Route Optimization

5. Experiments

5.1. Experimental Setup

5.2. Datasets

5.3. Evaluation Metric

5.4. Performance Evaluation

5.5. Ablation Study: Impact of Context Embedding

6. Discussion

Real-World Applicability

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI