1. Introduction
Ensuring environmental sustainability in logistics is a critical challenge, as the sector significantly contributes to global carbon emissions [
1]. Cold chain logistics, essential for food and pharmaceutical safety, requires substantial energy due to refrigeration and temperature control. The resulting operational costs and carbon footprint highlight the urgent need for greener distribution solutions.
To address these challenges in cold chain logistics, one key approach lies in optimizing delivery routes, which can significantly reduce both operational costs and carbon emissions without compromising service quality. Vehicle routing problems (VRPs) are a class of classical combinatorial optimization problems in logistics and transportation [
2,
3]. Cold chain distribution in practice often exhibits several complex features: shipments may require multiple temperature conditions (e.g., 0–4 °C for fresh produce; below −18 °C for frozen goods), fleets consist of vehicles with different capacities and energy efficiencies, and there is an increasing emphasis on reducing carbon emissions from both transportation and refrigeration. In order to address these practical challenges, various VRP variants that account for such characteristics have been proposed. The multi-compartment VRP (MCVRP) [
4,
5] addresses the need for vehicles with separate compartments. The heterogeneous VRP (HVRP) [
6,
7] considers the use of fleets with diverse vehicle types, while the low-carbon VRP (LCVRP) [
8,
9] incorporates carbon emission costs into the routing objective. Most of the existing studies address these factors in isolation. In reality, however, all of these aspects often arise together in cold chain logistics. To address this gap, we investigate the low-carbon heterogeneous multi-compartment vehicle routing problem (LC-HMCVRP).
Approaches for solving VRPs can be broadly categorized into exact techniques [
10,
11,
12], heuristic algorithms [
13,
14,
15], and deep-learning-based methods [
16,
17,
18]. Exact methods are designed to guarantee optimal solutions in theory, although their practical applicability is often limited by computational complexity. In recent years, some researchers have developed branch-and-cut algorithms and their variants to solve the MCVRP, HVRP, and LCVRP. For the MCVRP, Henke et al. [
19] introduced a branch-and-cut capable of solving MCVRP with up to 50 nodes and 9 compartments. Subsequently, Heßler [
20] proposed two exact algorithms: a branch-and-cut algorithm based on a two-index formulation and a branch-price-and-cut algorithm based on a route-indexed formulation, which successfully solved a MCVRP with up to 50 nodes and 10 compartments. For the HVRP, Şahin and Yaman [
21] proposed a branch-and-price algorithm capable of solving instances with up to 40 customers and 2 vehicle types. Li et al. [
22] developed a branch-and-Benders decomposition method, which can handle up to 40 customers and 4 vehicle types. In the context of LCVRP, Liu et al. [
11] and Luo et al. [
23] proposed branch-and-cut and branch-price-and-cut algorithms respectively, and demonstrated their effectiveness on instances with up to 100 customers. However, the exponential growth in computational complexity as problem size and constraints increase restricts the practical use of exact algorithms in large-scale or complex problems.
Given the computational limitations of exact methods, some researchers have extensively explored heuristic algorithms for solving the MCVRP, HVRP, and LCVRP. Chen et al. [
24] utilized an adaptive large neighborhood search algorithm to solve an MCVRP with 100 nodes and two compartments, incorporating fuel consumption costs. Ouertani et al. [
25] proposed an adaptive genetic algorithm to solve MCVRP with up to 199 nodes and 2 compartments. Chen et al. [
26] further considered refrigeration costs and carbon emissions, proposing a variable neighborhood search algorithm to solve a MCVRP with time windows, with up to 50 nodes and 2 temperature-controlled compartments. Guo et al. [
5] proposed a three-dimensional ant colony optimization algorithm to solve a MCVRP with up to 288 nodes and 3 compartments. For the HCVRP, Stavropoulou [
27] proposed a hierarchical Tabu Search framework capable of addressing instances with up to 199 customers and 3 vehicle types, while Máximo et al. [
28] proposed an adaptive iterated local search for instances with up to 360 customers and 6 vehicle types. For the LCVRP, Wang et al. [
29] proposed an improved ant colony optimization algorithm for instances with up to 50 customers, and Qi et al. [
30] developed a Q-learning-based evolutionary algorithm for instances with up to 100 customers. While heuristic algorithms can improve computational efficiency compared to exact methods, they often face substantial challenges in addressing the LC-HMCVRP. This is primarily due to the increased combinatorial complexity introduced by multiple temperature-controlled compartments and heterogeneous vehicle fleets. The intricate cross-constraints—such as matching product temperature requirements to specific compartments, managing compartment-specific capacities, and accounting for vehicle-specific cost and emission profiles—make the solution space highly constrained and irregular. Furthermore, metaheuristic algorithms are often sensitive to parameter tuning and may struggle to effectively explore the vast and complex solution space of LC-HMCVRP, which increases the risk of yielding suboptimal solutions.
Deep-learning-based approaches, particularly deep reinforcement learning (DRL), have emerged as a promising direction for solving VRPs. Some researchers developed improvement-based DRL models for various VRPs, similar to improvement heuristics that iteratively refine solutions from an initial solution [
31,
32,
33]. Improvement-based DRL models, while effective, often rely on a predefined initial solution, which can limit their adaptability and efficiency. To address these issues, more researchers have focused on constructive DRL models that incrementally build vehicle routes from scratch by sequentially adding customer nodes, offering faster and more flexible solutions. Earlier construction-based DRL models [
34,
35,
36] for solving VRPs were usually based on long short-term memory (LSTM) networks [
37]. The LSTM networks process each element in an input sequence sequentially, as the computation at each step depends on the hidden state from the previous step. This inherent sequential dependency prevents LSTM networks from efficiently parallelizing computations across the entire sequence, resulting in lower training and inference efficiency. Furthermore, the stepwise nature of LSTM architectures makes it difficult to capture the complex spatial relationships among nodes, which is important for VRPs, where accurately modeling these relationships is critical for finding high-quality solutions. As a result, LSTM-based DRL models encounter significant limitations in both computational efficiency and optimization performance when applied to large-scale or complex problems.
In contrast, the transformer model [
38], which employs multi-head attention (MHA) mechanisms, can encode all elements in the input sequence simultaneously through self-attention mechanisms, enabling efficient parallelization, batch processing, and richer modeling of complex spatial relationships within the data. Based on this model, Kool et al. [
39] proposed the attention model (AM) that demonstrated superior performance compared to several benchmarking models (e.g., OR-Tools) across various routing problems. Building on this work, Bono et al. [
40] introduced a multi-agent routing model using deep attention mechanisms (MARDAM), enabling parallel construction of multiple vehicle routes by incorporating fleet state representations. Li et al. [
41] tackled the HVRP by proposing a transformer-based DRL model, which incorporates a vehicle selection decoder to handle heterogeneous fleet constraints and a node selection decoder for route. Zou et al. [
42] proposed an enhanced transformer model that integrates MHA and attention-to-attention mechanisms to address the low-carbon multi-depot vehicle routing problem. However, to the best of our knowledge, there are currently no transformer-based deep reinforcement learning methods developed for the LC-HMCVRP. The existing models are not specifically designed for solving LC-HMCVRP, which may result in difficulties capturing the complex constraints of heterogeneous fleet composition, compartment allocation, and low-carbon requirements and may limit their optimum-seeking ability. Therefore, exploring an effective DRL model for solving LC-HMCVRP remains a meaningful challenge.
To address the LC-HMCVRP, we propose a novel DRL model called the dynamic vehicle state attention model (DVS-AM). Building upon the AM [
39], our model introduces two key enhancements. First, we design a vehicle state encoder (VSE) that captures the heterogeneous fleet features, including vehicle capacities, compartment configurations, and locations. This encoder generates comprehensive vehicle embeddings that reflect both the physical constraints and environmental impacts of different vehicle types. Second, we propose a dynamic vehicle state update mechanism (DVSU), which dynamically recalibrates the vehicle and fleet embeddings during the decoding process. By iteratively updating vehicle states and masking completed vehicles, DVSU enables the model to account for dynamic fleet conditions during the routing decision-making process.
The main contributions of this paper are summarized as follows:
We are the first to propose a DRL model that can effectively solve the LC-HMCVRP.
We propose a vehicle state encoder and a dynamic vehicle state update mechanism, which together enable comprehensive extraction of heterogeneous vehicle features and continuous updating of vehicle and fleet states during the decoding process.
Extensive computational experiments demonstrate that our approach outperforms two representative heuristics and two state-of-the-art DRL models in terms of solution quality, computation time, and generalization performance.
The remainder of this paper is organized as follows.
Section 2 presents the problem formulation and mathematical model.
Section 3 describes the proposed DRL-based model in detail.
Section 4 outlines the experimental settings.
Section 5 presents the experimental results and comparative analysis. Finally,
Section 6 concludes the paper and discusses future research directions.
2. Problem Formulation
2.1. Problem Description
In an urban cold chain logistics network, multiple distribution centers (DCs) within the city depend on fresh products (e.g., fruits, vegetables, and other perishable foods) being delivered from a central depot. A heterogeneous fleet of refrigerated vehicles with different capacities operating from the central depot delivers these products to DCs spread across various areas of the city. Each vehicle is equipped with multiple temperature-controlled compartments, each with its own capacity. The vehicles depart from the central depot at time 0 to deliver fresh products to their assigned DCs and then return to the central depot.
Each vehicle must not exceed its compartment capacity limits when loading fresh products. Taking into account the limited fuel availability or the maximum working hours of the driver, all vehicles have the same maximum travel distance limit. The location of each DC is known, along with its specific demand for products in different compartments. The temperature control systems operate throughout the delivery process, consuming energy and producing carbon emissions that depend on the ambient temperature, compartment temperature settings, vehicle load, and travel distance.
To facilitate the formulation of the mathematical model, the following assumptions are made:
All vehicles depart from the central depot at time 0.
The demand at each DC for a specific category of products does not exceed the capacity of the corresponding temperature-controlled compartment in any vehicle.
Each DC is served by one vehicle only and exactly once.
The transportation (refrigeration) costs per unit travel distance (time) are the same for all vehicles.
The carbon emissions per unit travel distance (time) are constant across vehicles of the same type but depend on the vehicle load and compartment temperature settings.
All vehicles have the same preservation capability and traveling speed. The number of compartments in each vehicle corresponds to the number of product categories, and it is assumed that the cost of product loss during delivery is negligible.
2.2. Mathematical Model
The problem described in
Section 2.1 can be modeled as a low-carbon heterogeneous multi-compartment vehicle routing problem (LC-HMCVRP). Let
be an undirected graph, where
represents the set of all nodes, with node 0 denoting the central depot and
representing the set of DCs. The set
represents all possible edges connecting these nodes. Each edge
is associated with a length
. Let
denote the set of product categories classified by temperature requirements, where each category
corresponds to a temperature-controlled compartment in the vehicles. Let
denote the demand at node
i for product category
, where
for the depot. Let
denote the vehicle fleet. Each vehicle
has compartment capacities
. Let
denote the total remaining capacity (load) of vehicle
k before traveling from node
i to node
j, where
represents the remaining capacity of compartment
c in vehicle
k. The distance traveled by each vehicle cannot exceed
. A binary decision variable
takes the value of 1 if vehicle
k travels from node
i to node
j and is 0 otherwise.
The optimization objective of the LC-HMCVRP is to minimize the total operation cost
, which consists of three parts: the transportation cost
, the refrigeration cost
, and the carbon emission cost
. The transportation cost represents the basic operational expenses of vehicles, which is proportional to the travel distance. It is formulated as
where
represents the transportation cost per unit travel distance for vehicle
k with a rated load capacity
. In this study, we consider three types of vehicles with progressively increasing transportation costs (including fuel consumption, driver wages, depreciation expenses, and more). From the
Table A1, these vehicles have rated capacities of 1950 kg, 4750 kg, and 7355 kg, which we approximate as 2, 5, and 7.5 tons, respectively. The values of
are set as follows:
The refrigeration cost is incurred for maintaining different temperature levels in compartments during transportation. It is defined as
where
represents the refrigeration cost per unit travel time for compartment
c of vehicle
k, and
is the travel time between nodes
i and
j, calculated as
with
being the distance and
v the average speed of vehicles. In this study,
v is assumed to be
, a reasonable estimate based on typical operating conditions in urban and suburban transportation. Three temperature zones are considered: frozen (−18 °C), chilled (0∼5 °C), and ambient (room temperature). Since refrigeration energy consumption mainly depends on the temperature difference between the compartment and the ambient environment, and assuming consistent insulation performance across vehicles, the impact of compartment size on refrigeration costs is negligible. The refrigeration cost
is set as 15, 8, and 0 CNY/h for frozen, chilled, and ambient compartments, respectively, based on their temperature requirements and energy consumption [
43].
The carbon emission cost is related to the fuel consumption during transportation, which includes two parts: the emissions from refrigeration units and from vehicle engines. The total carbon emission cost can be calculated as
where
is the unit carbon emission cost (0.25 CNY/kg) [
44],
is the carbon emission coefficient (2.63 kg/L), and
E is the energy generation per liter of diesel fuel (4 kWh/L).
denotes the power requirement of the independent refrigeration unit for compartment
c of vehicle
k (4 kW for frozen, 2 kW for chilled, and 0 for ambient compartments) [
45]. The comprehensive modal emission model (CMEM) [
46,
47] provides a robust framework for analyzing fuel consumption and emissions based on real-world vehicle dynamics. The variable
represents the fuel consumption rate for vehicle
k with rated load capacity
traveling from node
i to node
j, as determined by the CMEM. The calculation method is provided in Equation (
A2) and detailed in
Appendix A.
The mathematical model of LC-HMCVRP is formulated as follows:
s.t.
The objective function (
6) minimizes the total operation cost consisting of driving cost, refrigeration cost, and carbon emission cost. Constraints (
7) and (
8) ensure that each DC is visited exactly once. Constraint (
9) guarantees that each vehicle starts from and returns to the depot. Constraint (
10) ensures that each route is completed by the same vehicle. Constraint (
11) updates the remaining capacity after serving each customer, requiring that the remaining load in each compartment decreases by the delivered demand as the vehicle travels between nodes. Constraint (
12) restricts the remaining capacity at every step to be no less than the demand of the next customer and no greater than the available compartment capacity after serving the current customer. Constraint (
13) explicitly sets the initial condition for the remaining capacity of each compartment for each vehicle at the depot, ensuring that the capacity flow is properly initialized at the start of each route. Taken together, these three constraints ensure that the capacity flow can only be maintained on routes that originate from and return to the depot, thereby preventing the formation of subtours not connected to the depot and guaranteeing that all vehicle routes are feasible and closed. Constraint (
14) restricts the maximum travel distance of each vehicle. Constraint (
15) ensures the number of vehicles used does not exceed the fleet size. Constraint (
16) defines the binary decision variables.
2.3. Reformulation as an Markov Decision Process
To solve LC-HMCVRP using reinforcement learning, we formulate it as a Markov decision process (MDP):
State: Each state of decision step t consists of two components: vehicle states and node states. In the routing process, decisions are made sequentially, where each decision step t represents one routing decision for selecting the next destination node for one of the vehicles. The vehicle states include the current positions and remaining compartment capacities for all vehicles . The node state for node i is represented by a feature vector , where are the coordinates and the demand for product c at node i.
Action: At each decision step t, the action space consists of selecting vehicle and its next visiting node , subject to capacity constraints and maximum travel distance constraint.
Transition: After executing action
, state
transitions to
by
Only the selected vehicle and the visited node are updated at each step.
Reward: The total reward
is defined as the sum of immediate rewards over all decision steps. At each decision step
t, the immediate reward
is calculated as
where the first term represents the transportation cost, the second term accounts for the refrigeration cost, and the last two terms capture the carbon emission cost from vehicle engines and refrigeration units, respectively.
To further clarify the above MDP formulation,
Figure 1 presents a simple example involving two vehicles and three customer nodes. In this figure, each node
represents the state at decision step
t. The annotations above each state node summarize the status of each vehicle (such as its location and availability) and indicate which customer nodes are still pending or have been served. The directed edges between states correspond to possible actions at each step, that is, the agent’s selection of a vehicle
and its next destination
, as defined in our action space. Blue and purple lines indicate decisions made by vehicle 1 and vehicle 2, respectively. The red dashed lines represent the rewards obtained after taking a certain action in a given state and transitioning to the next state. The feasibility of each action is determined by checking the current state against all problem constraints, such as remaining compartment capacities and maximum travel distance, ensuring that only vehicles with feasible moves and customer nodes with unsatisfied demand are considered. If no feasible customer node exists for a given vehicle, the only available action is to return to the depot. When an action is executed, the system transitions to the next state according to the update rules in Equations (
17)–(
19). This process continues step by step, as shown by the branching structure in the figure, until all customer demands are satisfied and all vehicles have returned to the depot.
3. The Proposed DVS-AM Model
While DRL models like AM [
39] have demonstrated effectiveness in solving vehicle routing problems, they lack mechanisms to handle heterogeneous fleet characteristics and their interactions with routing decisions, affecting their optimization capability for the LC-HMCVRP. To address these limitations, we propose the DVS-AM, which extends the AM with two key innovations. The vehicle state encoder (VSE) captures heterogeneous fleet features, such as capacities, compartment configurations, and current locations. This encoder produces comprehensive vehicle embeddings that accurately represent the capacity limits, compartment structure, and up-to-date status of each vehicle. Additionally, the dynamic vehicle state update (DVSU) mechanism dynamically recalibrates the vehicle and fleet embeddings during the decoding process. By iteratively updating vehicle states and masking completed vehicles, DVSU enables the model to account for changing fleet conditions during sequential routing decisions. These innovations help our approach more effectively address the challenges of low-carbon cold chain logistics with multi-compartment and heterogeneous fleets.
3.1. Overview of DVS-AM
Figure 2 shows the architecture of DVS-AM. The encoder of this model consists of two parallel components: a node encoder and a vehicle state encoder. These components work together to generate comprehensive embeddings that capture both node relationships and vehicle states.
The node encoder adopts a transformer-based architecture [
38] with three stacked blocks. As defined in
Section 2.3, each node
i is associated with a feature vector
that includes its coordinates and demands for each product category. The input node feature matrix
is then formed by stacking these vectors for all nodes. This input is first put through a feed-forward (FF) layer for linear transformation, then processed by three stacked blocks where each block contains a MHA layer and a FF layer. Through the self-attention mechanism in MHA layers, the node encoder captures the relationships among all nodes, generating node embeddings
, where
is the embedding of node
. The global graph embedding
is then obtained by applying mean pooling over all node embeddings.
At decision step t, the vehicle state encoder takes input vehicle features as input. For each vehicle k of decision step t, its feature contains its current node position at decision step t and the remaining capacity for each compartment c. The vehicle features first go through a FF layer for linear transformation, followed by an MHA layer that captures the relationships between vehicles, generating the transformed vehicle features .
To effectively integrate vehicle-specific information and the global graph context through adaptive control of information flow between vehicle features and the global graph embedding
, ensuring that the final fleet state embedding
captures both local details and global context, we introduce a gating mechanism:
where
and
are learnable parameters,
denotes concatenation, and ⊙ represents element-wise multiplication. The fleet state embedding
is then obtained by applying mean pooling over all vehicle embeddings in
. We chose the sigmoid activation because its output is strictly bounded between 0 and 1, which makes it well suited for gating and controlled information blending. In contrast, tanh outputs values in
, which can introduce negative weights and potentially reduce the contribution of certain features when blending information from different sources. ReLU outputs values in
, so the gate values can become arbitrarily large, potentially causing one information source to overwhelm the other and resulting in unstable model behavior. Using sigmoid ensures the gate always forms a convex combination of the two information sources, keeping the information flow stable and interpretable throughout the model.
The decoder consists of two main modules—a vehicle selection module and a node selection module—that work in conjunction with a dynamic vehicle state update mechanism (DVSU). The vehicle selection module first concatenates
and the global graph embedding
and transforms it through a feed-forward network. This transformed output and
are then fed into a mask attention layer, which calculates attention values following the probability computation approach described in Kool et al. [
39]. The mask mechanism sets the attention values of currently used vehicles to zero. These attention values then pass through a Softmax layer to obtain the selection probabilities for each available vehicle. Vehicle
is selected by a sampling strategy that assigns higher selection chances to vehicles with higher probabilities.
After selecting vehicle , its state embedding , the global graph embedding and the embedding of vehicle ’s currently located node are concatenated and linearly transformed to obtain the context query . This context query and node embeddings are then fed into a Mask MHA layer, which functions similarly to the MHA layer but includes a mask mechanism to prevent selecting invalid nodes (e.g., nodes that have been visited or would violate constraints). The context query and the output from the Mask MHA layer pass through another mask attention layer and softmax layer to obtain node selection probabilities. Node is selected through the sampling strategy. Through the DVSU mechanism, once a node is selected, both the individual vehicle state embedding and the fleet embedding are immediately updated, ensuring that subsequent decisions are made based on the most current system state. This sequential decision-making continues until all customer nodes have been visited.
3.2. Model Training
We train our DVS-AM model, parameterized by
, using the REINFORCE algorithm [
48] to learn the optimal vehicle route planning strategy. The model learns to maximize the expected rewards (minimizing total operation cost) through gradient-based optimization. To reduce the variance in gradient estimation, we introduce a baseline method based on Rollout [
39].
During the model training process, we implement a sampling strategy (detailed in
Section 3.1) to enhance the model’s exploration of diverse route planning strategies, while utilizing greedy strategy in Rollout to establish a stable baseline value. The greedy strategy always selects the customer with the highest probability at each decision step. The gradient of the loss function with regard to
is defined as
where
and
represent the total rewards obtained by sampling strategy and greedy strategy, respectively. The
and
denote the solution generated by sampling and greedy strategies, respectively, and
s represents the problem instance. The DVS-AM’s parameters
are updated using the Adam optimizer [
49].
4. Experiment Settings
4.1. Instance Generation
The experiments are conducted on a real-world road network from Chengdu, China. The road network is extracted from OpenStreetMap, covering the area within the Fourth Ring Road of Chengdu. After processing and simplifying intersections, the network contains 1746 nodes and 4274 edges. Each edge represents a road segment with its actual travel distance. Based on this road network, we use Dijkstra’s algorithm to compute the shortest paths and corresponding travel distances between all pairs of nodes, transforming the original network into a complete undirected network. This preprocessing step generates a distance matrix that contains the minimum travel distances between any two nodes in the network.
To evaluate our approach under different problem scales, we consider three settings with 20, 50, and 100 DCs, respectively. For each problem instance, we randomly select one node from the road network as the central depot and several other nodes as DCs. Each DC has demands for three temperature-controlled products: frozen (−18 °C), chilled (0–5 °C), and ambient (room temperature) goods. The demand for each type of product is randomly generated from uniform distributions: frozen products from U (0.08, 0.15) tons, chilled products from U (0.10, 0.20) tons, and ambient products from U (0.15, 0.25) tons.
The heterogeneous fleet consists of three types of vehicles (small, medium, and large). For small vehicles, the capacities of frozen, chilled, and ambient compartments are set to 0.5, 0.7, and 0.8 tons, respectively. For medium vehicles, the corresponding compartment capacities are 1.2, 1.8, and 2.0 tons. For large vehicles, the compartment capacities are set to 1.8, 2.7, and 3.0 t.
4.2. Implementation Details
For problem sizes of 20 and 50 DCs, we generate 1,280,000 instances with a batch size of 512, while for 100 DCs, we generate 640,000 instances with a batch size of 256. For validation and testing, we generate 1000 instances each. The model consists of an encoder and a decoder. The encoder includes a node encoder with three transformer blocks and a vehicle encoder with one transformer block. Both the node and vehicle embedding dimensions are set to 128, and the feed-forward networks use hidden dimension 512. These hyperparameters are determined through preliminary experiments using the grid search method, where we systematically evaluate different combinations of learning rate (0.0001, 0.00005, 0.001), embedding dimensions (64, 128, 256), number of transformer blocks ((2, 3, 4) for node encoder, (1, 2) for vehicle encoder), and attention heads (4, 8, 16) to optimize the model’s performance while maintaining computational efficiency.
We train the model for 100 epochs using the Adam optimizer with a learning rate of 0.0001. During training, we apply a tanh clip with C = 10 to stabilize the learning process. To improve training efficiency, we implement a rollout baseline, which is updated per epoch if the performance improvement on the validation set is statistically significant according to a paired t-test (). During testing, we adopt a greedy selection strategy where nodes with maximum attention values are selected.
The model is implemented by PyTorch (v2.5.1), and all training, testing, and deployment experiments were conducted on a server running Windows 11, equipped with an Intel i9-14900 CPU (24 cores, 32 threads), 128 GB RAM and an NVIDIA RTX 4090 GPU (24 GB VRAM).
4.3. Baselines
We compare our proposed approach with four baseline methods: