Next Article in Journal
Conformal Transformations and Self-Sustaining Processes in Electric Circuits
Previous Article in Journal
Automatic Information Extraction from Scientific Publications Based on the Use Case of Additive Manufacturing
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Q-Learning-Assisted Evolutionary Optimization Method for Solving the Capacitated Vehicle Routing Problem

1
School of Mechanical Engineering, Xian Jiaotong University, Xi’an 710049, China
2
Guangzhou Institute of Technology, Xidian University, Guangzhou 510555, China
*
Author to whom correspondence should be addressed.
Appl. Sci. 2025, 15(17), 9332; https://doi.org/10.3390/app15179332
Submission received: 14 July 2025 / Revised: 12 August 2025 / Accepted: 21 August 2025 / Published: 25 August 2025

Abstract

The Capacitated Vehicle Routing Problem (CVRP) is a classic combinatorial optimization problem in logistics and distribution, with significant theoretical and practical importance. To address the limitations of traditional evolutionary algorithms—particularly their use of fixed operator selection and simplistic search strategies—this paper proposes a Q-learning-based evolutionary algorithm (QEA). By incorporating a reinforcement learning mechanism, the QEA adaptively selects among multiple neighborhood search operators, effectively balancing global exploration and local exploitation. In addition, a novel insertion-based crossover operator and a set of diverse neighborhood search strategies are designed to further enhance solution quality and search efficiency. Experimental results on a variety of standard CVRP benchmark instances show that the QEA demonstrates a superior performance and strong robustness, significantly outperforming several representative state-of-the-art algorithms for solving the CVRP. These results confirm the effectiveness and practical value of the proposed method.

1. Introduction

With the rapid development of logistics distribution, intelligent transportation, and warehouse management, how to efficiently plan vehicle routing has become an increasingly important issue [1]. The Capacitated Vehicle Routing Problem (CVRP) is representative of this challenge. Its objective is to design routes for multiple vehicles originating from a central depot to serve a set of customers, while minimizing the total travel distance and ensuring that each vehicle’s load does not exceed its capacity [2]. As a well-known NP-hard problem [3], the complexity of the CVRP grows rapidly with the problem size, making it difficult to obtain optimal solutions within an acceptable computational time. To address this, a Q-learning-assisted evolutionary algorithm is proposed to enhance the solution quality for the CVRP by adaptively selecting effective search operators.
To tackle the CVRP, a wide range of heuristic and metaheuristic methods have been proposed. Classical heuristic algorithms such as the savings algorithm [4], nearest neighbor [5], and sweep algorithm [6] are capable of generating feasible solutions quickly, but often suffer from local optima. In recent years, evolutionary algorithms (EAs) have been widely applied to the CVRP due to their powerful global search capabilities. Typical approaches include Genetic Algorithms [7], Differential Evolution [8], Ant Colony Optimization (ACO) [9], and Particle Swarm Optimization [10]. To further improve performance, many studies have integrated mechanisms such as Large Neighborhood Search (LNS) [11] and adaptive search [12], leading to sophisticated hybrid intelligent algorithms.
İlhan et al. [13] proposed an Improved Simulated Annealing algorithm for the CVRP, which incorporates crossover operators from Genetic Algorithms and employs a hybrid selection strategy to enhance convergence speed. Li et al. [14] developed an Adaptive Genetic Algorithm with self-adaptive crossover and mutation operators, achieving a promising performance on CVRP instances. Altabeeb et al. [15] addressed the local optimum issue of the Firefly Algorithm (FA) by integrating two local search strategies and genetic components, resulting in a hybrid method named CVRP-FA with an enhanced solution quality. Souza et al. [16] introduced novel mutation operators within a discrete DE framework, augmented with multiple local search operators to solve the CVRP effectively. Xiao et al. [17] proposed a Variable Neighborhood Simulated Annealing algorithm by combining Variable Neighborhood Search and Simulated Annealing, which performs well on large-scale CVRP instances. Akpinar [18] hybridized LNS and ACO to develop LNS-ACO, leveraging ACO’s solution construction mechanism to improve LNS’s performance. Additionally, Queiroga et al. [19] proposed a partial optimization metaheuristic under special intensification conditions for the CVRP, which combines neighborhood enumeration with mathematical programming to achieve high-quality solutions, particularly on large-scale instances.
Recent research has also focused on fuzzy and uncertain demand variants of the CVRP. Zacharia et al. [20] studied the Vehicle Routing Problem with Fuzzy Payloads considering fuel consumption, emphasizing energy consumption. Yang et al. [21] and Abdulatif et al. [22] investigated fuzzy demand VRPs with soft time windows, addressing uncertainty and temporal flexibility in delivery scheduling.
However, traditional EAs often rely on fixed or randomly selected operators, which limits their ability to adapt search strategies based on the current solution state or search stage. This can result in a low search efficiency and slow convergence. In recent years, reinforcement learning (RL) mechanisms have been introduced into evolutionary frameworks [23,24], enabling adaptive operator scheduling. For example, Q-learning-based local search scheduling methods [25,26] have shown potential in enhancing search intelligence. Irtiza et al. [27] incorporated Q-learning into an evolutionary algorithm (EA) to schedule local search operators, demonstrating an improved performance on CVRP with Time Windows (CVRPTW) instances. Costa et al. [28] employed deep reinforcement learning to intelligently guide two-opt operations, enabling general local search strategies to adapt automatically to specific routing problems such as the CVRP. Kalatzantonakis et al. [29] developed a reinforcement learning–variable neighborhood search method for the CVRP, achieving superior adaptability. Zong et al. [30] proposed an RL-based framework for solving multiple vehicle routing problems with time windows. Zhang et al. [31] integrated RL into multi-objective evolutionary algorithms for assembly line balancing under uncertain demand, demonstrating RL’s broader applicability. A comprehensive survey by Song et al. [32] reviews reinforcement learning-assisted evolutionary algorithms and highlights future research directions. Xu et al. [33] presented a learning-to-search approach for VRP with multiple time windows, further proving RL’s potential in routing problems.
Motivated by these developments, the objective of this study is to develop and evaluate a Q-learning-based Evolutionary Algorithm (QEA) for solving the CVRP, with the aim of improving the solution quality and adaptability through intelligent operator scheduling. The core idea of the QEA is to combine the global search ability of evolutionary algorithms with the adaptive learning capability of Q-learning. By dynamically adjusting the selection of neighborhood search operators based on individual fitness states, the QEA achieves an intelligent balancing of search direction and resource allocation, while preserving population diversity. The main contributions of this work are as follows:
(1)
A Q-learning-based evolutionary framework is proposed to solve the CVRP. By introducing a reinforcement learning mechanism for operator scheduling, the algorithm overcomes the limitations of traditional fixed operator selection, enabling adaptive strategy evolution.
(2)
A novel insertion crossover operator is designed specifically for the CVRP evolutionary framework to generate high-quality offspring. Instead of recombining individual nodes, it operates on complete route segments, allowing better preservation of feasible structures and encouraging diverse yet valid route combinations.
(3)
Three neighborhood search operators targeting different search stages were designed, focusing, respectively, on global perturbation, accelerated convergence, and local optimization. By adaptively adjusting the search scope and direction, and dynamically allocating search resources based on individual fitness, these operators enable intelligent control of the search strategy, effectively enhancing the overall search efficiency and solution quality of the algorithm.
The rest of this paper is organized as follows. Section 2 introduces the definition and mathematical model of the CVRP. Section 3 presents the detailed design of the QEA, including the encoding scheme, evolutionary framework, Q-learning-based operator selection, and algorithm components. Section 4 reports the experimental setup and numerical results, followed by performance comparisons with existing methods. Finally, Section 5 concludes this paper and outlines directions for future research.

2. Problem Background

The Capacitated Vehicle Routing Problem (CVRP) is one of the most classic and important variants of routing optimization problems, with broad theoretical significance and practical applications [34]. The problem involves a set of customers to be served by a fleet of homogeneous vehicles departing from a common depot, each vehicle having the same capacity constraint. The objective is to design delivery routes for each vehicle such that all customer demands are satisfied while minimizing the total travel distance of all vehicles.
The CVRP can be represented by a graph G = (N, E), where the node set N = {0, 1, …, n} and the edge set E = {(i, j): i, jN} connect these nodes. Here, node 0 denotes the depot, and nodes {1, 2, …, n} represent the customers. Each customer iN’(N − {0}) has a demand mi, and each edge has an associated cost dij, which corresponds to the travel distance from node i to node j. The mathematical model of the CVRP is formulated as follows [35].
The objective function:
M i n i m i z e         i = 0 N j = 0 N k = 1 K d i j x i j k
Subject to
k = 1 K i = 0 N x i j k = 1                       j { 1 , , N } : i j
k = 1 K j = 0 N x i j k = 1                       j { 1 , , N } : i j
i = 0 N j = 0 N x i j k d i j Q k                       k { 1 , , K }
j = 1 N x i j k = j = 1 N x i j k 1                     i = 0           a n d           k { 1 , , K }
k = 1 K j = 1 N x i j k K                   i = 0                                    
The variable x i j k represents a decision variable: it takes the value 1 if vehicle k travels from node i to node j, and 0 otherwise. Equation (1) is the objective function, aiming to minimize the total travel distance of all vehicles. Constraints (2) and (3) ensure that each customer node is visited exactly once by a single vehicle. Constraint (4) guarantees that the total demand of customer nodes assigned to a route does not exceed the vehicle’s capacity. Constraint (5) ensures that each vehicle starts and ends its route at the depot after serving its assigned customers. Constraint (6) limits the number of routes used to serve customers to a maximum of K, corresponding to the total number of available vehicles.

3. QEA for CVRP

This section systematically presents the structure and key components of the QEA. First, the overall framework of the QEA is outlined. Then, the encoding and decoding strategies for individuals, along with the initialization method for the population, are described. Subsequently, the selection, crossover, and mutation operators are introduced. This is followed by a detailed explanation of the Q-learning-driven operator selection strategy and three specially designed neighborhood search operators. Through the seamless integration of these components, QEA effectively balances solution quality and search efficiency, demonstrating strong global exploration and local exploitation capabilities.

3.1. QEA Framework Overview

As illustrated in Figure 1, the QEA integrates the global search capability of evolutionary algorithms with the adaptive learning mechanism of reinforcement learning, aiming to efficiently solve the CVRP. The QEA represents solutions using a path-based encoding scheme, and it evolves high-quality solutions through population initialization, insertion crossover, neighborhood search, and a Q-learning-driven operator selection strategy. The core idea is to leverage the Q-learning mechanism during each local search step to adaptively select the most appropriate neighborhood search operator, thereby dynamically adjusting the search strategy and enhancing the algorithm’s robustness and adaptability.

3.2. Individual Encoding and Decoding

Encoding: In the QEA, a solution is represented using serial number coding, which offers a compact structure, ease of manipulation, and compatibility with crossover and mutation operations. Each individual is encoded as an integer sequence that includes all customer node indices, interspersed with several separators (represented by the depot index, set as 0 to divide the routes among multiple vehicles. For example, a sample individual {0, 5, 6, 0, 1, 2, 7, 0, 3, 4, 0} can be illustrated in Figure 2, which indicates that the depot dispatches three vehicles to serve seven customer nodes.
Decoding: For each individual, the sequence is read from left to right, starting from the depot. Each occurrence of the depot marks the end of a vehicle route. The customer nodes between two depot entries are assigned to a vehicle, and the route load is updated in real time to ensure capacity constraints are considered. For example, the individual can be decoded into three vehicle routes as follows: Vehicle 1: 0–5–6–0, Vehicle 2: 0–1–2–7–0, Vehicle 3: 0–3–4–0. These three routes together form a feasible solution to CVRP instance.

3.3. Initialization Phase

In the initialization phase of the QEA, a set of individuals is randomly generated, each representing a feasible or partially feasible solution to the CVRP. The generation process is based on a random permutation of customer nodes, followed by capacity-constrained route segmentation. The procedure is as follows:
First, all customer nodes are randomly shuffled. The algorithm then sequentially scans this shuffled sequence, adding each customer to the current route until adding another customer would violate the vehicle’s capacity constraint. At that point, the depot 0 is inserted at the beginning and end of the route. A new route is then initiated, and the process continues until all customers have been assigned.
If all customer nodes are successfully assigned to routes that respect capacity limits, the individual is considered feasible. However, in some cases, a subset of customers may remain unassigned due to insufficient remaining capacity. These remaining nodes are grouped into the final route, which may violate capacity constraints. In this case, the individual is treated as an infeasible solution, and a penalty term is introduced into the fitness function to penalize the total overload in the violated routes. The fitness of an individual is defined as
f i t n e s s = F + τ P e n a l t y ,                         i f       P e n a l t y   >   0   F ,                                                               o t h e r w i s e  
where F is the total travel distance of all routes, Penalty is the total overload across all routes violating the capacity constraint, τ is the penalty coefficient, set to 20 in our experiments to effectively distinguish infeasible individuals, and we will analyze the influence of τ on the performance of QEA in the following section.

3.4. Evolution Phase

3.4.1. Binary Tournament Selection

To select high-quality individuals from the current population for crossover and subsequent search operations, the QEA adopts the classical Binary Tournament Selection operator [36]. This method maintains an appropriate selection pressure while preserving population diversity. Specifically, in each iteration, two distinct individuals are randomly selected from the population. Their fitness values are compared, and the individual with the better fitness is selected to enter the next-generation population. This process is repeated until a new population of the predefined size is generated.

3.4.2. Insertion Crossover

Traditional crossover operators, such as Order Crossover (OX) [37] and Cycle Crossover (CX) [38], were originally designed for the Traveling Salesman Problem (TSP) [39]. However, these operators are not well suited to solving the CVRP due to the inherent differences in problem structure and constraints. Additionally, while OX has been adapted in some CVRP variants, such as those involving fuzzy constraints [1], it does not inherently preserve the feasibility of capacity-constrained solutions. In the CVRP, capacity limits must be strictly satisfied, and using OX or CX directly may lead to infeasible solutions. To address this issue, we propose a novel Insertion Crossover operator tailored specifically for the CVRP.
The proposed operator not only preserves useful path information from parent individuals but also enhances the global exploration ability of the algorithm. Specifically, it begins by randomly selecting two parent individuals and extracting a subroute (i.e., a sequence between two depots) from one parent. This subroute is then inserted into the end of the second parent’s chromosome while maintaining the node order of the receiving parent. After constructing a new chromosome, the penultimate depot 0 is reinserted at every possible location within the last two routes to reconstruct feasible solutions. Among all resulting candidates, the one with the best fitness is selected as the final offspring. The entire procedure is illustrated in Figure 3.
This strategy guarantees that all generated offspring remain feasible without requiring repair operations, which is a critical advantage over traditional operators. It effectively combines promising substructures from both parents and explores diverse path organizations, leading to an improved solution quality and search efficiency. The detailed procedure is illustrated in Algorithm 1.
Algorithm 1 Insertion Crossover
Input:
two parent individuals p1 and p2.
Output:
two offspring individuals.
1: Begin
2:   For each (P_from, P_to) in [(p1, p2), (p2, p1)] do
3:     randomly select one sub-path s from P_from
4:     remove all customers in s from P_to → get base sequence B
5:     insert sub-path s at the end of B → form extended sequence E
6:     remove second-to-last depot 0 from E
7:     For each possible position i to reinsert the depot:
8:       insert 0 at position i → form candidate sequence Ei
9:       decode Ei and evaluate fitness Fi
10:    select each sequence E_best with minimal Fi
11:    assig E_best as one offspring
12:    End for
13: End

3.4.3. Two-Swap Mutation

To enhance population diversity and avoid premature convergence, this study employs a two-swap mutation operator. For each individual, the mutation process randomly selects two customer nodes within the chromosome and exchanges their positions. This swapping procedure is repeated five times to generate a mutated individual. After mutation, if the fitness of the new individual is better than that of the original parent, the mutated individual is accepted into the next generation; otherwise, the original parent is retained. This simple yet effective operator facilitates a local search around the current solution and introduces beneficial variations while maintaining the feasibility of routes.

3.5. Q-Learning-Based Operator Selection

Q-learning is a reinforcement learning (RL) method [40] that gradually improves decision-making by learning a state–action value function, known as the Q-function, to select the optimal action in a given state. Due to its scalability and ability to update decision policies through accumulated experience, Q-learning is well-suited to adaptively selecting among multiple neighborhood search operators.
In this study, each neighborhood search operator (NS) is treated as a distinct action. The action set is defined as A = {NS1, NS2, NS3}. The state space is simplified and defined as the index of the previously chosen operator,
S = { 0 ,     1 ,     2 }
where each state corresponds to the last action taken. This concise state representation focuses learning on operator effectiveness transitions.
During each iteration, the Q-learning algorithm employs an ε-greedy strategy to select an action (a neighborhood operator) for local search perturbation, as shown in (9):
a t = r a n d o m         a c t i o n ,                                           w i t h     p r o b a b i l i t y         ε arg       max a Q ( s t , a ) ,                             w i t h     p r o b a b i l i t y         1 ε
If the Q-table is uninitialized or contains all zeros, the action is selected randomly to encourage exploration. After executing the selected operator on all individuals in the population (see algorithm implementation below), a cumulative reward is computed as the total fitness improvement:
r t = i     =     1 N ( f b e f o r e ( i ) f a f t e r ( i ) )
where f b e f o r e ( i ) and f a f t e r ( i ) represent the fitness values of the i-th individual before and after applying the operator, respectively.
The environment then transitions to the next state st+1 = at, and the Q-table is updated using the classic Q-learning update rule (11):
Q ( s t , a t ) Q ( s t , a t ) + α [ r t + γ max Q ( s t + 1 , a ) Q ( s t , a t ) ]
where α is the learning rate and γ is the discount factor that controls the importance of future rewards. Through this iterative learning and updating process, the Q-learning mechanism adaptively adjusts the probabilities of selecting each neighborhood operator, favoring those that consistently yield better solution improvements. The procedure is illustrated in Figure 4.

3.6. Neighborhood Search

To improve both search efficiency and solution quality of the QEA across different search stages, three targeted neighborhood search operators are designed, each focusing on a distinct objective: global diversification, accelerated convergence, and local exploitation. These operators dynamically allocate search resources based on individual fitness values and guide the search direction by modifying path structures, thus enhancing the QEA’s overall performance. The pseudo-code of the QEA is shown in Algorithm 2.
Neighborhood Operator 1 (Global Perturbation):
This operator is designed to enhance the global search capability. It begins by randomly deleting r nodes from the solution, where t is computed using (12):
t = min ( i n d F i t / min F i t × c u s N u m × 0.1 ,                 c u s N u m × 0.4 )
Here, indFit denotes the fitness of the current individual, minFit is the best fitness in the population, and cusNum is the total number of customers. The key idea is to assign more search resources to lower-quality individuals. After removing the nodes, they are reinserted into positions that minimally increase the total fitness.
Algorithm 2 QEA
Input:
problem instances, α, γ, ε, population size N, MaxGeneration
Output:
best solution.
1: Begin
2:   initialize population P with N individuals
3:   evaluate fitness of each individual in P
4:   initialize Q(st, at) with values 0
5:   set Q-learning parameters: α, γ, ε
6:   for generation = 1 to MaxGeneration do
7:     p’← Φ
8:      while |P’| < N do
9:        select two parents p1, p2 from p using binary tournament
10:        offspring ← Insertion Crossover(p1, p2)
11:        offspring ← 2-swap mutation
12:        s ← discretize state of offspring
13:        choose neighborhood operator using ε-greedy strategy
14:        offspring’← apply neighborhood search to offspring
15:        evaluate fitness of offspring’
16:        compute reward rt based on fitness improvement
17:        update Q(st, at) using:
              Q(st, at) ← Q(st, at) +α[rt + γmax Q(st+1, a’)- Q(st, at)]
18:        add offspring’ to P’
19:      end while
20:      P ← P’
21:   end for
22: End
Neighborhood Operator 2 (Accelerated Convergence):
This operator aims to speed up convergence by focusing on critical nodes. It deletes t nodes with the largest travel distances in the current solution (t is also calculated via Equation (11), assuming that these nodes have the greatest impact on fitness). The deleted nodes are then reinserted at positions that incur the smallest increase in total cost.
Neighborhood Operator 3 (Local Exploitation):
This operator focuses on intensifying the local search. It randomly selects a reference node and deletes t nodes that are most correlated with it, also based on Equation (12). Correlation is calculated using (13):
c o r = 1 / ( d i j + r i j )
where dij is the distance between nodes i and j. After removing the most related nodes, they are reinserted into positions that minimally affect fitness.

3.7. Computational Complexity

Before the evolution process begins, the initialization of the Q-table and the population requires a time complexity of O(N), where N is the population size. In each generation, the QEA consists of several main components: fitness evaluation, binary tournament selection, insertion-based crossover, two-swap mutation, and Q-learning-guided neighborhood search.
For each individual, the fitness evaluation’s complexity is O(n), binary tournament selection is O(1), insertion crossover is O(n2), two-swap mutation is O(n), Q-learning-guided neighborhood search is O(n2), and Q-table update is O(1). Therefore, the total computational complexity for all individuals in one generation is O(N×n2), which is dominated by the crossover and neighborhood search operations.

4. Experiments

4.1. Problem Instances and Performance Metric

To evaluate the performance of the proposed QEA, this study adopts the standard CVRP benchmark instances proposed by Augerat et al. [41]. These benchmark instances are widely recognized in the literature and cover a diverse range of problem scales and complexities, making them suitable for comprehensive performance assessment. The instances differ in the number of customers, customer locations, number of vehicles, and vehicle capacity constraints. Such diversity allows for testing the algorithm’s robustness and adaptability across different types of scenarios. All datasets are publicly available at http://vrp.atd-lab.inf.puc-rio.br/index.php/en/, accessed on 13 September 2024.
To assess algorithmic performance, we use the Relative Deviation (RD) between the Obtained Solution (OS) and the Current Best Known Solution (CS), calculated using Equation (14) as follows:
R D = O S C S C S × 100
A lower RD value indicates that the algorithm produces solutions closer to the best-known results, thereby reflecting a superior optimization performance. By evaluating RD across various instances, we can assess not only the average effectiveness of the proposed method but also its consistency and stability in solving the CVRP under different levels of difficulty.

4.2. Parameter Setting of QEA

The configuration of algorithm parameters significantly affects the performance and convergence behavior of the solution process. In the proposed QEA, three key hyperparameters play crucial roles in controlling the learning dynamics and the exploration–exploitation balance:
(1)
Learning rate α: which determines how quickly the Q-values are updated based on new experiences;
(2)
Discount factor γ: which controls the importance of future rewards relative to immediate ones;
(3)
Greedy factor ε: which balances the trade-off between exploration and exploitation.
In addition to these hyperparameters, the QEA introduces a fixed penalty coefficient τ to penalize infeasible solutions that violate vehicle capacity constraints. To evaluate the impact of this parameter, a case study is conducted on the A-n32-k5 instance. As shown in Figure 5, when τ = 10, the QEA generates routes with a slightly shorter total distance, but one of the routes exceeds the vehicle’s capacity limit, violating the constraint. In contrast, when τ = 20, all routes strictly adhere to the capacity constraint. This demonstrates that a small penalty fails to sufficiently discourage infeasible solutions, while a larger penalty effectively enforces feasibility. Furthermore, we have verified that setting τ = 20, enables the QEA to consistently produce feasible solutions that satisfy capacity constraints across all tested CVRP benchmark instances. Therefore, τ = 20, is adopted as the fixed penalty coefficient in all experiments.
To identify the optimal combination of α, γ, and ε, the Taguchi method [42] is adopted as an efficient experimental design approach. This method systematically reduces the number of experimental trials while ensuring that the main effects of each parameter are thoroughly explored. As shown in Table 1, each parameter is assigned three representative levels, and a series of controlled experiments are conducted.
For each combination of parameter values, the QEA is independently executed ten times on representative CVRP instances. The average best solution value across the runs is then recorded to mitigate the impact of stochastic variation and provide a reliable basis for comparison. Figure 6 illustrates the main effect plots for the three key parameters. Based on the experimental results and a comprehensive analysis, the optimal parameter configuration is determined to be α = 0.3, γ = 0.9, ε = 0.1.
Furthermore, the Q-table is initialized to zero for all state–action pairs, indicating that the algorithm starts without any prior knowledge of the utility of different neighborhood search operators. The initial state is defined by selecting a random action (one of the neighborhood search operators is randomly chosen at the beginning of each run). The population size in the QEA is set to 100, and the maximum number of fitness evaluations is set to 1 × 105 for all benchmark datasets. Table 2 summarizes all initialization settings and hyperparameters used in the implementation of the QEA.

4.3. Compared Algorithms and Experimental Setting

For performance evaluation, a comparative analysis was conducted against four representative algorithms that have demonstrated competitive results in solving the CVRP, including LNSi [18], a large neighborhood search algorithm that only accepts improving solutions; LNSa, which applies an acceptance criterion similar to that proposed by Ropke and Pisinger [43]; LNS-ACO [18], which combines Ant Colony Optimization with neighborhood search mechanisms; and CVRP-FA [15], a novel hybrid metaheuristic integrating two local search operators with the Firefly Algorithm.
To ensure a fair comparison, all algorithms were configured with the same population size of 100 and executed independently five times on each benchmark instance to reduce statistical bias. In each run, the maximum number of fitness evaluations was set to 1×105, and the other algorithm-specific parameters were configured according to their respective original studies. Furthermore, the Wilcoxon rank-sum test with a significance level of α = 0.05 was used to evaluate the statistical significance of performance differences between algorithms.

4.4. Experimental Results

As shown in Table 3, the proposed QEA achieves a performance that is either equal to or closer to the best-known solutions in the majority of benchmark instances. Notably, in instances such as A-n32-k5, B-n31-k5, and P-n50-k7, the QEA is capable of reaching or maintaining the current best-known solution, indicating the algorithm’s strong global search ability and effective convergence behavior. Moreover, across most B-class datasets and medium-scale P-class instances, QEA exhibits consistently low RD values. This outcome reflects the algorithm’s high level of robustness, as well as its ability to generalize across different problem characteristics.
However, in a few complex or large-scale instances, specifically A-n80-k10, B-n78-k10, and P-n101-k4, the QEA’s performance is slightly surpassed by certain comparison algorithms, most notably the CVRP-FA. This discrepancy may stem from the substantial increase in the search space and the structural complexity of large-scale CVRP instances. In such scenarios, the QEA’s Q-learning-driven operator scheduling mechanism might face difficulty in rapidly adapting to diverse and intricate local search landscapes, which can hinder the full exploitation of promising solution regions. This limitation underscores the need for more dynamic or hierarchical control strategies to maintain efficient exploration and exploitation under scaling complexity.
Despite these minor limitations, the QEA demonstrates strong overall competitiveness. Out of 20 benchmark instances, the QEA achieves a superior performance compared to LNSi in 15 instances, LNSa in 14, LNS-ACO in 11, and CVRP-FA in 11 cases. These statistics not only highlight the algorithm’s broad applicability but also reinforce its effectiveness across a wide range of CVRP scenarios.
Furthermore, the Wilcoxon signed-rank test results, as reported in Table 4, reveal that the QEA’s performance advantages over the baseline methods are statistically significant at the 5% confidence level in most cases. This suggests that the observed performance gains are not due to random chance, but rather stem from the effectiveness of the algorithm’s design, including its adaptive operator selection strategy, insertion crossover mechanism, and diversified neighborhood search operators.
To better illustrate the stability and distribution of algorithmic performance, we additionally computed standard deviations from 10 independent runs and generated boxplots for representative benchmark instances. Figure 7 presents these boxplots, which visually depict the distribution of solution quality for the QEA and competing algorithms. The boxplots highlight the median performance, variability, and presence of outliers, providing further insights into the robustness and consistency of the compared methods. Overall, the QEA demonstrates relatively lower variance and fewer extreme deviations, confirming its stable search behavior across diverse test cases. Corresponding standard deviation values are also summarized in Table 5 to complement these graphical analyses.
In summary, the QEA demonstrates a superior search efficiency, robustness, and solution quality in small- and medium-sized CVRP instances. While it already competes favorably with advanced metaheuristic algorithms, its performance on large-scale problems can be further enhanced. Future work could explore the incorporation of dynamic depth adjustment mechanisms, hierarchical reinforcement learning frameworks, or hybrid memory-based strategies to strengthen the algorithm’s scalability and convergence in high-dimensional solution spaces.

4.5. Component Analysis

To comprehensively assess the contribution of each core component within the proposed QEA framework, four algorithmic variants were constructed and tested: QEA-RND, QEA-SR, QEA-wo-IC, and QEA-wo-NS. Each variant was designed to isolate and examine the impact of a specific module in the QEA.
To evaluate the effectiveness of the Q-learning-based adaptive operator selection strategy, two variants were developed. QEA-RND randomly selects neighborhood search operators without any learning guidance, serving as a baseline for non-adaptive operator scheduling. QEA-SR employs a deterministic strategy based on the historical success rate of each operator, defined as the frequency with which an operator improves the best solution in the current population.
In addition, QEA-wo-IC refers to the version in which the proposed insertion crossover operator is omitted, and QEA-wo-NS disables all neighborhood search operations. These ablation variants were designed to investigate the role of each operator class in driving search quality, convergence speed, and solution diversity. To ensure a fair comparison, all QEA variants share the same parameter configuration and initialization strategy as the original QEA, as described in Section 4.2.
Table 6 summarizes the comparative performance of the QEA and its four variants across all benchmark instances. The Wilcoxon signed-rank test, conducted at a significance level of α = 0.05, reveals that the QEA significantly outperforms QEA-RND, QEA-SR, QEA-wo-IC, and QEA-wo-NS on 14, 12, 16, and 16 problem instances. In all test cases, the QEA demonstrates a superior or equal performance compared to its simplified counterparts.
These results clearly validate the necessity and effectiveness of each component integrated into the QEA. The adaptive operator selection mechanism enabled by Q-learning contributes to smarter decision-making during the search, while the insertion crossover and neighborhood search modules collectively enhance the algorithm’s ability to explore the solution space and refine high-quality routes. The consistent advantage over all variants confirms that the synergy among these components is critical to achieving a high performance and robustness when solving the Capacitated Vehicle Routing Problem.

5. Conclusions

This paper proposes a Q-learning-assisted evolutionary algorithm to solve the CVRP, called the QEA. Firstly, the QEA integrates a reinforcement learning mechanism into the evolutionary framework to adaptively schedule search operators according to the fitness of individuals, effectively overcoming the limitations of fixed operator strategies. Secondly, a novel insertion crossover operator is developed, which recombines complete route segments instead of individual nodes, thereby preserving feasible route structures and enhancing population diversity. Thirdly, the QEA designs three neighborhood search operators that target global exploration, convergence acceleration, and local refinement, respectively. These operators dynamically adjust their behaviors based on individual fitness to improve search efficiency. The experimental results show that the QEA achieves a competitive performance on standard CVRP benchmark instances and performs particularly well on small- and medium-scale problems.
Despite its promising performance, the proposed approach also has certain limitations. Specifically, the QEA shows a relatively slower convergence on large-scale and highly complex instances, indicating a limited search depth under certain configurations. Furthermore, the current Q-learning design relies on relatively simple state features, which may restrict the learning granularity. To address these issues, future work will explore more expressive representations of the state space and integrate deep reinforcement learning techniques to enable better generalization and scalability. We also plan to extend the QEA to handle multi-objective and dynamic variants of the CVRP, and we will investigate its applicability to other combinatorial optimization problems.

Author Contributions

W.Z.: Writing—original draft preparation, methodology; Z.Z.: Investigation, formal analysis, resources; H.Z.: Writing—review and editing, supervision; X.B.: Conceptualization, supervision. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The code used in this study and datasets are available from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Viu-Roig, M.; Alvarez-Palau, E.J. The Impact of E-Commerce-Related Last-Mile Logistics on Cities: A Systematic Literature Review. Sustainability 2020, 12, 6492. [Google Scholar] [CrossRef]
  2. Dantzig, G.B.; Ramser, J.H. The truck dispatching problem. Manag. Sci. 1959, 6, 80–91. [Google Scholar] [CrossRef]
  3. Lenstra, J.K.; Kan, A.R. Complexity of vehicle routing and scheduling problems. Networks 1981, 11, 221–227. [Google Scholar] [CrossRef]
  4. Clarke, G.; Wright, J.W. Scheduling of vehicles from a central depot to a number of delivery points. Oper. Res. 1964, 12, 568–581. [Google Scholar] [CrossRef]
  5. Laporte, G. The vehicle routing problem: An overview of exact and approximate algorithms. Eur. J. Oper. Res. 1992, 59, 345–358. [Google Scholar] [CrossRef]
  6. Gillett, B.E.; Miller, L.R. A heuristic algorithm for the vehicle-dispatch problem. Oper. Res. 1974, 22, 340–349. [Google Scholar] [CrossRef]
  7. Vidal, T. Hybrid Genetic Search for the CVRP: Open-Source Implementation and SWAP* Neighborhood. Comput. Oper. Res. 2021, 140, 105643. [Google Scholar] [CrossRef]
  8. Teoh, B.E.; Ponnambalam, S.G.; Kanagaraj, G. Differential evolution algorithm with local search for capacitated vehicle routing problem. Int. J. Bio-Inspired Comput. 2015, 7, 321–342. [Google Scholar] [CrossRef]
  9. Ahmed, Z.H.; Hameed, A.S.; Mutar, M.L.; Haron, H. An Enhanced Ant Colony System Algorithm Based on Subpaths for Solving the Capacitated Vehicle Routing Problem. Symmetry 2023, 15, 2020. [Google Scholar] [CrossRef]
  10. Ai, T.J.; Kachitvichyanukul, V. Particle swarm optimization and two solution representations for solving the capacitated vehicle routing problem. Comput. Ind. Eng. 2007, 56, 380–387. [Google Scholar] [CrossRef]
  11. Pisinger, D.; Ropke, S. A general heuristic for vehicle routing problems. Comput. Oper. Res. 2007, 34, 2403–2435. [Google Scholar] [CrossRef]
  12. Christiaens, L.; De Boeck, L. Adaptive large neighborhood search for the vehicle routing problem with time windows and stochastic travel times. Transp. Res. Part C Emerg. Technol. 2020, 120, 102784. [Google Scholar]
  13. İlhan, İ. An improved simulated annealing algorithm with crossover operator for capacitated vehicle routing problem. Swarm Evol. Comput. 2021, 64, 100911. [Google Scholar] [CrossRef]
  14. Li, J.; Liu, R.; Wang, R. Handling dynamic capacitated vehicle routing problems based on adaptive genetic algorithm with elastic strategy. Swarm Evol. Comput. 2024, 86, 101529. [Google Scholar] [CrossRef]
  15. Altabeeb, A.M.; Mohsen, A.M.; Ghallab, A. An improved hybrid firefly algorithm for capacitated vehicle routing problem. Appl. Soft Comput. 2019, 84, 105728. [Google Scholar] [CrossRef]
  16. Souza, I.P.; Boeres, M.C.S.; Moraes, R.E.N. A robust algorithm based on differential evolution with local search for the capacitated vehicle routing problem. Swarm Evol. Comput. 2023, 77, 101245. [Google Scholar] [CrossRef]
  17. Xiao, Y.; Zhao, Q.; Kaku, I.; Mladenovic, N. Variable neighbourhood simulated annealing algorithm for capacitated vehicle routing problems. Eng. Optim. 2014, 46, 562–579. [Google Scholar] [CrossRef]
  18. Akpinar, S. Hybrid large neighbourhood search algorithm for capacitated vehicle routing problem. Expert Syst. Appl. 2016, 61, 28–38. [Google Scholar] [CrossRef]
  19. Queiroga, E.; Sadykov, R.; Uchoa, E. A POPMUSIC matheuristic for the capacitated vehicle routing problem. Comput. Oper. Res. 2021, 136, 105475. [Google Scholar] [CrossRef]
  20. Zacharia, P.; Drosos, C.; Piromalis, D.; Papoutsidakis, M. The Vehicle Routing Problem with Fuzzy Payloads considering Fuel Consumption. Appl. Artif. Intell. 2021, 35, 1755–1776. [Google Scholar] [CrossRef]
  21. Yang, T.; Wang, W.; Wu, Q. Fuzzy Demand Vehicle Routing Problem with Soft Time Windows. Sustainability 2022, 14, 5658. [Google Scholar] [CrossRef]
  22. Abdulatif, N.; Shalaby, M.A.W.; Kassem, S.S.; Khalil, T. Fuzzy demand electric vehicle routing problem with soft time windows. Fuzzy Optim. Decis. Mak. 2025, 24, 457–481. [Google Scholar] [CrossRef]
  23. Zou, Y.; Hao, J.K.; Wu, Q. RP-DQN: An application of Q-Learning to Vehicle Routing Problems. Comput. Oper. Res. 2024, 170, 106758. [Google Scholar] [CrossRef]
  24. Li, R.; Gong, W.; Lu, C. A reinforcement learning based RMOEA/D for bi-objective fuzzy flexible job shop scheduling. Expert Syst. Appl. 2022, 203, 117380. [Google Scholar] [CrossRef]
  25. Karimi, M.; Mohammadi, M.; Dullaert, W.; Vigo, D.; Pirayesh, A. Dynamic operator management in meta-heuristics using reinforcement learning: An application to permutation flowshop scheduling problems. arXiv 2024, arXiv:2408.14864. [Google Scholar] [CrossRef]
  26. Qi, R.; Li, J.-Q.; Wang, J.; Jin, H.; Han, Y.-Y. QMOEA: A Q-learning-based multiobjective evolutionary algorithm for solving time-dependent green vehicle routing problems with time windows. Inf. Sci. 2022, 608, 178–201. [Google Scholar] [CrossRef]
  27. Irtiza, M.; Kalaria, R.; Kayes, A.S.M. A Reinforcement Learning-Assisted Evolutionary Computing Approach to Capacitated Vehicle Routing with Time Windows. In Proceedings of the GECCO ‘24 Companion: Proceedings of the Genetic and Evolutionary Computation Conference Companion, Melbourne, Australia, 14–18 July 2024. [Google Scholar]
  28. Costa, P.D.; Rhuggenaath, J.; Zhang, Y. Learning 2-Opt Heuristics for Routing Problems via Deep Reinforcement Learning. SN Comput. Sci. 2021, 2, 388. [Google Scholar] [CrossRef]
  29. Kalatzantonakis, P.; Sifaleras, A.; Samaras, N. A reinforcement learning-variable neighborhood search method for the capacitated vehicle routing problem. Expert Syst. Appl. 2023, 213, 118812. [Google Scholar] [CrossRef]
  30. Zong, Z.; Tong, X.; Zheng, M.; Li, Y. Reinforcement learning for solving multiple vehicle routing problem with time window. ACM Trans. Intell. Syst. Technol. 2024, 15, 1–19. [Google Scholar] [CrossRef]
  31. Zhang, Z.; Tang, Q.; Chica, M.; Li, Z. Reinforcement learning-based multiobjective evolutionary algorithm for mixed-model multimanned assembly line balancing under uncertain demand. IEEE Trans. Cybern. 2024, 54, 2914–2927. [Google Scholar] [CrossRef] [PubMed]
  32. Song, Y.; Wu, Y.; Guo, Y.; Yan, R.; Suganthan, P.N.; Zhang, Y.; Pedrycz, W.; Das, S.; Mallipeddi, R.; Ajani, O.S.; et al. Reinforcement learning-assisted evolutionary algorithm: A survey and research opportunities. Swarm Evol. Comput. 2024, 86, 101517. [Google Scholar] [CrossRef]
  33. Xu, K.; Cao, Z.; Zheng, C.; Liu, L. Learning to search for vehicle routing with multiple time windows. arXiv 2025, arXiv:2505.23098. [Google Scholar] [CrossRef]
  34. Laporte, G. Fifty years of vehicle routing. Transp. Sci. 2009, 43, 408–416. [Google Scholar] [CrossRef]
  35. Chen, A.; Yang, G.; Wu, Z. Hybrid discrete particle swarm optimization algorithm for capacitated vehicle routing problem. J. Zhejiang Univ. Sci. A 2006, 7, 607–614. [Google Scholar] [CrossRef]
  36. Miller, B.L.; Goldberg, D.E. Genetic algorithms, tournament selection, and the effects of noise. Complex Syst. 1995, 9, 193–212. [Google Scholar]
  37. Davis, L. Applying adaptive algorithms to epistatic domains. In Proceedings of the International Joint Conference on Artificial Intelligence, Los Angeles, CA, USA, 18–23 August 1985; pp. 162–164. [Google Scholar]
  38. Goldberg, D.E.; Lingle, R. Alleles, loci, and the traveling salesman problem. In Proceedings of the First International Conference on Genetic Algorithms and Their Applications; Psychology Press: London, UK, 1985; pp. 154–159. [Google Scholar]
  39. Reeves, C.R. A genetic algorithm for flowshop sequencing. Comput. Oper. Res. 1995, 22, 5–13. [Google Scholar] [CrossRef]
  40. Zhang, Z.; Wu, Z.; Zhang, H.; Wang, J. Meta-learning-based deep reinforcement learning for multiobjective optimization problems. IEEE Trans. Neural Netw. Learn. Syst. 2023, 34, 5835–5849. [Google Scholar] [CrossRef]
  41. Augerat, P.; Belenguer, J.M.; Benavent, E.; Corberan, A.; Rinaldi, G. Computational Results with a Branch and Cut Code for the Capacitated Vehicle Routing Problem; Rapport de Recherche-IMAG; No. 495; Institut National Polytechnique: Toulouse, France, 1995; pp. 1–21. [Google Scholar]
  42. Van Nostrand, R.C. Design of experiments using the taguchi approach: 16 steps to product and process improvement. Technometrics 2002, 44, 289. [Google Scholar] [CrossRef]
  43. Ropke, S.; Pisinger, D. An adaptive large neighborhood search heuristic for the pickup and delivery problem with time windows. Transp. Sci. 2006, 40, 455–472. [Google Scholar] [CrossRef]
Figure 1. The overall framework of QEA.
Figure 1. The overall framework of QEA.
Applsci 15 09332 g001
Figure 2. A CVRP solution and the individual encoding representation. Each color represents a single route, while white represents the depot used as a separator between routes.
Figure 2. A CVRP solution and the individual encoding representation. Each color represents a single route, while white represents the depot used as a separator between routes.
Applsci 15 09332 g002
Figure 3. The insertion crossover procedure. In this figure, each color except white represents a route in the optimal solution of the CVRP, white represents the depot serving as a separator between routes, and the numbers indicate the labels of the customer nodes.
Figure 3. The insertion crossover procedure. In this figure, each color except white represents a route in the optimal solution of the CVRP, white represents the depot serving as a separator between routes, and the numbers indicate the labels of the customer nodes.
Applsci 15 09332 g003
Figure 4. The Q-learning-based operator selection procedure.
Figure 4. The Q-learning-based operator selection procedure.
Applsci 15 09332 g004
Figure 5. The performance of QEA on instance A-n32-k5. (a): τ = 10; (b) τ = 20. Blue dots indicate customer nodes, orange dots indicate the depot.
Figure 5. The performance of QEA on instance A-n32-k5. (a): τ = 10; (b) τ = 20. Blue dots indicate customer nodes, orange dots indicate the depot.
Applsci 15 09332 g005
Figure 6. Main effects plots of OS. The dotted lines represent the average values of the experimental results.
Figure 6. Main effects plots of OS. The dotted lines represent the average values of the experimental results.
Applsci 15 09332 g006
Figure 7. Boxplots of solution quality distributions on four representative CVRP instances: (a) A-n69-k9, (b) B-n50-k8, (c) B-n66-k9, and (d) P-n51-k10. The orange line in each boxplot represents the median value of the solution quality.
Figure 7. Boxplots of solution quality distributions on four representative CVRP instances: (a) A-n69-k9, (b) B-n50-k8, (c) B-n66-k9, and (d) P-n51-k10. The orange line in each boxplot represents the median value of the solution quality.
Applsci 15 09332 g007
Table 1. Parameters and their levels.
Table 1. Parameters and their levels.
ParameterLevel 1Level 2Level 3
α0.10.20.3
γ0.70.80.9
ε0.050.10.2
Table 2. QEA hyperparameters and initialization settings.
Table 2. QEA hyperparameters and initialization settings.
ParameterValue
Q(s, a)0
α0.3
γ0.9
ε0.1
τ20
population size100
max fitness evaluations1 × 105
Table 3. The comparison between QEA and the other four algorithms (bold values indicate the optimal solutions for the test instances).
Table 3. The comparison between QEA and the other four algorithms (bold values indicate the optimal solutions for the test instances).
LNSiLNSaLNS-ACOCVRP-FAQEA
InstanceCSOSRDOSRDOSRDOSRDOSRD
A-n32-k57847840.007840.007840.007961.537840.00
A-n36-k57998050.707990.007990.007990.007990.00
A-n44-k69379521.499400.329370.009370.009370.00
A-n60-k9135413801.9213600.4413550.0713550.0713540.00
A-n61-k9103410854.9310723.6810673.1910451.0610391.37
A-n69-k9115911943.0211791.8111720.9611680.7811660.60
A-n80-k10176318464.7118233.4018152.9517730.5717961.87
B-n31-k56726720.006720.006720.006720.006720.00
B-n34-k57887880.007880.007880.007880.007880.00
B-n50-k8131213351.7513200.6113190.5313180.4513140.15
B-n63-k10149615322.4115201.6015141.2015181.4715141.20
B-n64-k98618923.608802.218741.518620.128610.00
B-n66-k9131613482.4313361.5213301.0613240.6113220.45
B-n78-k10122112411.6412341.0612280.5712341.0612351.15
P-n16-k84504500.004500.004500.004500.004500.00
P-n22-k22162160.002160.002160.002160.002160.00
P-n50-k75545692.715611.265590.905570.545540.00
P-n51-k107417612.707521.487470.817501.217440.40
P-n55-k106947082.026960.296960.296980.586950.14
P-n101-k46817357.937317.347226.026850.597144.85
+/-/=-15/0/514/0/611/1/811/3/6-
Table 4. Wilcoxon rank-sum test results between QEA and the other four algorithms.
Table 4. Wilcoxon rank-sum test results between QEA and the other four algorithms.
Statistical ItemLNSiLNSaLNS-ACOCVRP-FA
p-value0.00500.00130.01200.1394
Significance (α = 0.05)SignificantSignificantSignificantNot Significant
Table 5. Algorithm performance comparison based on standard deviation metrics on selected CVRP instances. (The bold font indicates the value of the standard deviation for the algorithm that performs optimally across different instances).
Table 5. Algorithm performance comparison based on standard deviation metrics on selected CVRP instances. (The bold font indicates the value of the standard deviation for the algorithm that performs optimally across different instances).
AlgorithmA-n69-k9B-n50-k8B-n66-k9P-n51-k10
LNSi4.593.313.683.86
LNSa4.392.091.351.10
LNS-ACO2.530.531.551.51
CVRP-FA1.291.340.671.26
QEA1.061.070.970.87
Table 6. Comparisons between QEA variants based on hypervolume for component analysis across all benchmark instances.
Table 6. Comparisons between QEA variants based on hypervolume for component analysis across all benchmark instances.
Statistical ItemQEA-RNDQEA-SRQEA-wo-ICQEA-wo-NS
QEA is significantly better14121616
Two algorithms are similar6844
QEA is significantly worse0000
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhao, W.; Zhang, Z.; Zhao, H.; Bian, X. A Q-Learning-Assisted Evolutionary Optimization Method for Solving the Capacitated Vehicle Routing Problem. Appl. Sci. 2025, 15, 9332. https://doi.org/10.3390/app15179332

AMA Style

Zhao W, Zhang Z, Zhao H, Bian X. A Q-Learning-Assisted Evolutionary Optimization Method for Solving the Capacitated Vehicle Routing Problem. Applied Sciences. 2025; 15(17):9332. https://doi.org/10.3390/app15179332

Chicago/Turabian Style

Zhao, Wanqiu, Zhaohui Zhang, Hong Zhao, and Xu Bian. 2025. "A Q-Learning-Assisted Evolutionary Optimization Method for Solving the Capacitated Vehicle Routing Problem" Applied Sciences 15, no. 17: 9332. https://doi.org/10.3390/app15179332

APA Style

Zhao, W., Zhang, Z., Zhao, H., & Bian, X. (2025). A Q-Learning-Assisted Evolutionary Optimization Method for Solving the Capacitated Vehicle Routing Problem. Applied Sciences, 15(17), 9332. https://doi.org/10.3390/app15179332

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop