A Deep Reinforcement-Learning-Based Route Optimization Model for Multi-Compartment Cold Chain Distribution

Hu, Jingming; Wang, Chong

doi:10.3390/math13132039

Open AccessArticle

A Deep Reinforcement-Learning-Based Route Optimization Model for Multi-Compartment Cold Chain Distribution

by

Jingming Hu

¹ and

Chong Wang

^2,*

¹

School of Management, Sichuan Agricultural University, Chengdu 611130, China

²

School of Business and Tourism, Sichuan Agricultural University, Chengdu 611130, China

^*

Author to whom correspondence should be addressed.

Mathematics 2025, 13(13), 2039; https://doi.org/10.3390/math13132039

Submission received: 15 April 2025 / Revised: 14 June 2025 / Accepted: 17 June 2025 / Published: 20 June 2025

(This article belongs to the Special Issue Application of Neural Networks and Deep Learning)

Download

Browse Figures

Versions Notes

Abstract

Cold chain logistics is crucial in ensuring food quality and safety in modern supply chains. The required temperature control systems increase operational costs and environmental impacts compared to conventional logistics. To reduce these costs while maintaining service quality in real-world distribution scenarios, efficient route planning is essential, particularly when products with different temperature requirements need to be delivered together using multi-compartment refrigerated vehicles. This substantially increases the complexity of the routing process. We propose a novel deep reinforcement learning approach that incorporates a vehicle state encoder for capturing fleet characteristics and a dynamic vehicle state update mechanism for enabling real-time vehicle state updates during route planning. Extensive experiments on a real-world road network show that our proposed method significantly outperforms four representative methods. Compared to a recent ant colony optimization algorithm, it achieves up to a 6.32% reduction in costs while being up to 1637 times faster in computation.

Keywords:

multi-compartment cold chain distribution; low carbon; heterogeneous vehicle routing problem; deep reinforcement learning; vehicle state encoding; dynamic vehicle state update mechanism

MSC:

90B06

1. Introduction

Ensuring environmental sustainability in logistics is a critical challenge, as the sector significantly contributes to global carbon emissions [1]. Cold chain logistics, essential for food and pharmaceutical safety, requires substantial energy due to refrigeration and temperature control. The resulting operational costs and carbon footprint highlight the urgent need for greener distribution solutions.

To address these challenges in cold chain logistics, one key approach lies in optimizing delivery routes, which can significantly reduce both operational costs and carbon emissions without compromising service quality. Vehicle routing problems (VRPs) are a class of classical combinatorial optimization problems in logistics and transportation [2,3]. Cold chain distribution in practice often exhibits several complex features: shipments may require multiple temperature conditions (e.g., 0–4 °C for fresh produce; below −18 °C for frozen goods), fleets consist of vehicles with different capacities and energy efficiencies, and there is an increasing emphasis on reducing carbon emissions from both transportation and refrigeration. In order to address these practical challenges, various VRP variants that account for such characteristics have been proposed. The multi-compartment VRP (MCVRP) [4,5] addresses the need for vehicles with separate compartments. The heterogeneous VRP (HVRP) [6,7] considers the use of fleets with diverse vehicle types, while the low-carbon VRP (LCVRP) [8,9] incorporates carbon emission costs into the routing objective. Most of the existing studies address these factors in isolation. In reality, however, all of these aspects often arise together in cold chain logistics. To address this gap, we investigate the low-carbon heterogeneous multi-compartment vehicle routing problem (LC-HMCVRP).

Approaches for solving VRPs can be broadly categorized into exact techniques [10,11,12], heuristic algorithms [13,14,15], and deep-learning-based methods [16,17,18]. Exact methods are designed to guarantee optimal solutions in theory, although their practical applicability is often limited by computational complexity. In recent years, some researchers have developed branch-and-cut algorithms and their variants to solve the MCVRP, HVRP, and LCVRP. For the MCVRP, Henke et al. [19] introduced a branch-and-cut capable of solving MCVRP with up to 50 nodes and 9 compartments. Subsequently, Heßler [20] proposed two exact algorithms: a branch-and-cut algorithm based on a two-index formulation and a branch-price-and-cut algorithm based on a route-indexed formulation, which successfully solved a MCVRP with up to 50 nodes and 10 compartments. For the HVRP, Şahin and Yaman [21] proposed a branch-and-price algorithm capable of solving instances with up to 40 customers and 2 vehicle types. Li et al. [22] developed a branch-and-Benders decomposition method, which can handle up to 40 customers and 4 vehicle types. In the context of LCVRP, Liu et al. [11] and Luo et al. [23] proposed branch-and-cut and branch-price-and-cut algorithms respectively, and demonstrated their effectiveness on instances with up to 100 customers. However, the exponential growth in computational complexity as problem size and constraints increase restricts the practical use of exact algorithms in large-scale or complex problems.

Given the computational limitations of exact methods, some researchers have extensively explored heuristic algorithms for solving the MCVRP, HVRP, and LCVRP. Chen et al. [24] utilized an adaptive large neighborhood search algorithm to solve an MCVRP with 100 nodes and two compartments, incorporating fuel consumption costs. Ouertani et al. [25] proposed an adaptive genetic algorithm to solve MCVRP with up to 199 nodes and 2 compartments. Chen et al. [26] further considered refrigeration costs and carbon emissions, proposing a variable neighborhood search algorithm to solve a MCVRP with time windows, with up to 50 nodes and 2 temperature-controlled compartments. Guo et al. [5] proposed a three-dimensional ant colony optimization algorithm to solve a MCVRP with up to 288 nodes and 3 compartments. For the HCVRP, Stavropoulou [27] proposed a hierarchical Tabu Search framework capable of addressing instances with up to 199 customers and 3 vehicle types, while Máximo et al. [28] proposed an adaptive iterated local search for instances with up to 360 customers and 6 vehicle types. For the LCVRP, Wang et al. [29] proposed an improved ant colony optimization algorithm for instances with up to 50 customers, and Qi et al. [30] developed a Q-learning-based evolutionary algorithm for instances with up to 100 customers. While heuristic algorithms can improve computational efficiency compared to exact methods, they often face substantial challenges in addressing the LC-HMCVRP. This is primarily due to the increased combinatorial complexity introduced by multiple temperature-controlled compartments and heterogeneous vehicle fleets. The intricate cross-constraints—such as matching product temperature requirements to specific compartments, managing compartment-specific capacities, and accounting for vehicle-specific cost and emission profiles—make the solution space highly constrained and irregular. Furthermore, metaheuristic algorithms are often sensitive to parameter tuning and may struggle to effectively explore the vast and complex solution space of LC-HMCVRP, which increases the risk of yielding suboptimal solutions.

Deep-learning-based approaches, particularly deep reinforcement learning (DRL), have emerged as a promising direction for solving VRPs. Some researchers developed improvement-based DRL models for various VRPs, similar to improvement heuristics that iteratively refine solutions from an initial solution [31,32,33]. Improvement-based DRL models, while effective, often rely on a predefined initial solution, which can limit their adaptability and efficiency. To address these issues, more researchers have focused on constructive DRL models that incrementally build vehicle routes from scratch by sequentially adding customer nodes, offering faster and more flexible solutions. Earlier construction-based DRL models [34,35,36] for solving VRPs were usually based on long short-term memory (LSTM) networks [37]. The LSTM networks process each element in an input sequence sequentially, as the computation at each step depends on the hidden state from the previous step. This inherent sequential dependency prevents LSTM networks from efficiently parallelizing computations across the entire sequence, resulting in lower training and inference efficiency. Furthermore, the stepwise nature of LSTM architectures makes it difficult to capture the complex spatial relationships among nodes, which is important for VRPs, where accurately modeling these relationships is critical for finding high-quality solutions. As a result, LSTM-based DRL models encounter significant limitations in both computational efficiency and optimization performance when applied to large-scale or complex problems.

In contrast, the transformer model [38], which employs multi-head attention (MHA) mechanisms, can encode all elements in the input sequence simultaneously through self-attention mechanisms, enabling efficient parallelization, batch processing, and richer modeling of complex spatial relationships within the data. Based on this model, Kool et al. [39] proposed the attention model (AM) that demonstrated superior performance compared to several benchmarking models (e.g., OR-Tools) across various routing problems. Building on this work, Bono et al. [40] introduced a multi-agent routing model using deep attention mechanisms (MARDAM), enabling parallel construction of multiple vehicle routes by incorporating fleet state representations. Li et al. [41] tackled the HVRP by proposing a transformer-based DRL model, which incorporates a vehicle selection decoder to handle heterogeneous fleet constraints and a node selection decoder for route. Zou et al. [42] proposed an enhanced transformer model that integrates MHA and attention-to-attention mechanisms to address the low-carbon multi-depot vehicle routing problem. However, to the best of our knowledge, there are currently no transformer-based deep reinforcement learning methods developed for the LC-HMCVRP. The existing models are not specifically designed for solving LC-HMCVRP, which may result in difficulties capturing the complex constraints of heterogeneous fleet composition, compartment allocation, and low-carbon requirements and may limit their optimum-seeking ability. Therefore, exploring an effective DRL model for solving LC-HMCVRP remains a meaningful challenge.

To address the LC-HMCVRP, we propose a novel DRL model called the dynamic vehicle state attention model (DVS-AM). Building upon the AM [39], our model introduces two key enhancements. First, we design a vehicle state encoder (VSE) that captures the heterogeneous fleet features, including vehicle capacities, compartment configurations, and locations. This encoder generates comprehensive vehicle embeddings that reflect both the physical constraints and environmental impacts of different vehicle types. Second, we propose a dynamic vehicle state update mechanism (DVSU), which dynamically recalibrates the vehicle and fleet embeddings during the decoding process. By iteratively updating vehicle states and masking completed vehicles, DVSU enables the model to account for dynamic fleet conditions during the routing decision-making process.

The main contributions of this paper are summarized as follows:

We are the first to propose a DRL model that can effectively solve the LC-HMCVRP.
We propose a vehicle state encoder and a dynamic vehicle state update mechanism, which together enable comprehensive extraction of heterogeneous vehicle features and continuous updating of vehicle and fleet states during the decoding process.
Extensive computational experiments demonstrate that our approach outperforms two representative heuristics and two state-of-the-art DRL models in terms of solution quality, computation time, and generalization performance.

The remainder of this paper is organized as follows. Section 2 presents the problem formulation and mathematical model. Section 3 describes the proposed DRL-based model in detail. Section 4 outlines the experimental settings. Section 5 presents the experimental results and comparative analysis. Finally, Section 6 concludes the paper and discusses future research directions.

2. Problem Formulation

2.1. Problem Description

In an urban cold chain logistics network, multiple distribution centers (DCs) within the city depend on fresh products (e.g., fruits, vegetables, and other perishable foods) being delivered from a central depot. A heterogeneous fleet of refrigerated vehicles with different capacities operating from the central depot delivers these products to DCs spread across various areas of the city. Each vehicle is equipped with multiple temperature-controlled compartments, each with its own capacity. The vehicles depart from the central depot at time 0 to deliver fresh products to their assigned DCs and then return to the central depot.

Each vehicle must not exceed its compartment capacity limits when loading fresh products. Taking into account the limited fuel availability or the maximum working hours of the driver, all vehicles have the same maximum travel distance limit. The location of each DC is known, along with its specific demand for products in different compartments. The temperature control systems operate throughout the delivery process, consuming energy and producing carbon emissions that depend on the ambient temperature, compartment temperature settings, vehicle load, and travel distance.

To facilitate the formulation of the mathematical model, the following assumptions are made:

All vehicles depart from the central depot at time 0.
The demand at each DC for a specific category of products does not exceed the capacity of the corresponding temperature-controlled compartment in any vehicle.
Each DC is served by one vehicle only and exactly once.
The transportation (refrigeration) costs per unit travel distance (time) are the same for all vehicles.
The carbon emissions per unit travel distance (time) are constant across vehicles of the same type but depend on the vehicle load and compartment temperature settings.
All vehicles have the same preservation capability and traveling speed. The number of compartments in each vehicle corresponds to the number of product categories, and it is assumed that the cost of product loss during delivery is negligible.

2.2. Mathematical Model

The problem described in Section 2.1 can be modeled as a low-carbon heterogeneous multi-compartment vehicle routing problem (LC-HMCVRP). Let

G = (V, E)

be an undirected graph, where

V = 0 \cup N

represents the set of all nodes, with node 0 denoting the central depot and

N = {1, \dots, N}

representing the set of DCs. The set

E = {(i, j) | i, j \in V, i \neq j}

represents all possible edges connecting these nodes. Each edge

(i, j)

is associated with a length

l_{i j}

. Let

C = {1, \dots, C}

denote the set of product categories classified by temperature requirements, where each category

c \in C

corresponds to a temperature-controlled compartment in the vehicles. Let

D_{i}^{c}

denote the demand at node i for product category

c \in C

, where

D_{0}^{c} = 0, \forall c \in C

for the depot. Let

K = {1, \dots, K}

denote the vehicle fleet. Each vehicle

k \in K

has compartment capacities

{Q_{k}^{c} | c \in C}

. Let

R_{i j k} = \sum_{c \in C} R_{i j k}^{c}

denote the total remaining capacity (load) of vehicle k before traveling from node i to node j, where

R_{i j k}^{c}

represents the remaining capacity of compartment c in vehicle k. The distance traveled by each vehicle cannot exceed

L^{max}

. A binary decision variable

x_{i j k}

takes the value of 1 if vehicle k travels from node i to node j and is 0 otherwise.

The optimization objective of the LC-HMCVRP is to minimize the total operation cost

F_{t o t a l}

, which consists of three parts: the transportation cost

F_{1}

, the refrigeration cost

F_{2}

, and the carbon emission cost

F_{3}

. The transportation cost represents the basic operational expenses of vehicles, which is proportional to the travel distance. It is formulated as

F_{1} = \sum_{k \in K} \sum_{i \in V} \sum_{j \in V} α_{k} l_{i j} x_{i j k},

(1)

where

α_{k}

represents the transportation cost per unit travel distance for vehicle k with a rated load capacity

ψ_{k}

. In this study, we consider three types of vehicles with progressively increasing transportation costs (including fuel consumption, driver wages, depreciation expenses, and more). From the Table A1, these vehicles have rated capacities of 1950 kg, 4750 kg, and 7355 kg, which we approximate as 2, 5, and 7.5 tons, respectively. The values of

α_{k}

are set as follows:

α_{k} = \{\begin{matrix} 3 CNY / km & ψ_{k} = 2 tons \\ 5 CNY / km & ψ_{k} = 5 tons \\ 7 CNY / km & ψ_{k} = 7.5 tons \end{matrix},

(2)

The refrigeration cost is incurred for maintaining different temperature levels in compartments during transportation. It is defined as

F_{2} = \sum_{k \in K} \sum_{i \in V} \sum_{j \in V} \sum_{c \in C} β_{k}^{c} t_{i j} x_{i j k},

(3)

where

β_{k}^{c}

represents the refrigeration cost per unit travel time for compartment c of vehicle k, and

t_{i j}

is the travel time between nodes i and j, calculated as

t_{i j} = l_{i j} / v

with

l_{i j}

being the distance and v the average speed of vehicles. In this study, v is assumed to be

30 km / h

, a reasonable estimate based on typical operating conditions in urban and suburban transportation. Three temperature zones are considered: frozen (−18 °C), chilled (0∼5 °C), and ambient (room temperature). Since refrigeration energy consumption mainly depends on the temperature difference between the compartment and the ambient environment, and assuming consistent insulation performance across vehicles, the impact of compartment size on refrigeration costs is negligible. The refrigeration cost

β_{k}^{c}

is set as 15, 8, and 0 CNY/h for frozen, chilled, and ambient compartments, respectively, based on their temperature requirements and energy consumption [43].

The carbon emission cost is related to the fuel consumption during transportation, which includes two parts: the emissions from refrigeration units and from vehicle engines. The total carbon emission cost can be calculated as

F_{3} = c_{t} ω [\frac{1}{E} \sum_{k \in K} \sum_{i \in V} \sum_{j \in V} \sum_{c \in C} γ_{k}^{c} t_{i j} x_{i j k} + \sum_{k \in K} \sum_{i \in V} \sum_{j \in V ∖ {0}} δ_{i j k}^{ψ} l_{i j} x_{i j k}],

(4)

γ_{k}^{c} = \{\begin{matrix} 4 KW & β_{k}^{c} = 15, R_{i j k}^{c} > 0 \\ 2 KW & β_{k}^{c} = 8, R_{i j k}^{c} > 0 \\ 0 & β_{k}^{c} = 0 or R_{i j k}^{c} \leq 0 \end{matrix},

(5)

where

c_{t}

is the unit carbon emission cost (0.25 CNY/kg) [44],

ω

is the carbon emission coefficient (2.63 kg/L), and E is the energy generation per liter of diesel fuel (4 kWh/L).

γ_{k}^{c}

denotes the power requirement of the independent refrigeration unit for compartment c of vehicle k (4 kW for frozen, 2 kW for chilled, and 0 for ambient compartments) [45]. The comprehensive modal emission model (CMEM) [46,47] provides a robust framework for analyzing fuel consumption and emissions based on real-world vehicle dynamics. The variable

δ_{i j k}^{ψ_{k}}

represents the fuel consumption rate for vehicle k with rated load capacity

ψ_{k}

traveling from node i to node j, as determined by the CMEM. The calculation method is provided in Equation (A2) and detailed in Appendix A.

The mathematical model of LC-HMCVRP is formulated as follows:

min F_{t o t a l} = F_{1} + F_{2} + F_{3} .

(6)

s.t.

\sum_{k \in K} \sum_{j \in V} x_{i j k} = 1, \forall i \in N,

(7)

\sum_{k \in K} \sum_{i \in V} x_{i j k} = 1, \forall j \in N,

(8)

\sum_{j \in N} x_{0 j k} = \sum_{i \in N} x_{i 0 k} = 1, \forall k \in K,

(9)

\sum_{i \in V} x_{i h k} = \sum_{j \in V} x_{h j k}, \forall h \in N, k \in K,

(10)

R_{h i k}^{c} - R_{i j k}^{c} = D_{i}^{c}, \forall i, j, h \in V, k \in K, c \in C,

(11)

D_{j}^{c} \leq R_{i j k}^{c} \leq Q_{k}^{c} - D_{i}^{c}, \forall i, j \in V, k \in K, c \in C,

(12)

R_{0 j k}^{c} = Q_{k}^{c} x_{0 j k}, \forall j \in N, k \in K, c \in C,

(13)

\sum_{i \in V} \sum_{j \in V} l_{i j} x_{i j k} \leq L^{max}, \forall k \in K,

(14)

\sum_{k \in K} \sum_{j \in N} x_{0 j k} \leq K,

(15)

x_{i j k} \in {0, 1}, \forall i, j \in V, i \neq j, k \in K .

(16)

The objective function (6) minimizes the total operation cost consisting of driving cost, refrigeration cost, and carbon emission cost. Constraints (7) and (8) ensure that each DC is visited exactly once. Constraint (9) guarantees that each vehicle starts from and returns to the depot. Constraint (10) ensures that each route is completed by the same vehicle. Constraint (11) updates the remaining capacity after serving each customer, requiring that the remaining load in each compartment decreases by the delivered demand as the vehicle travels between nodes. Constraint (12) restricts the remaining capacity at every step to be no less than the demand of the next customer and no greater than the available compartment capacity after serving the current customer. Constraint (13) explicitly sets the initial condition for the remaining capacity of each compartment for each vehicle at the depot, ensuring that the capacity flow is properly initialized at the start of each route. Taken together, these three constraints ensure that the capacity flow can only be maintained on routes that originate from and return to the depot, thereby preventing the formation of subtours not connected to the depot and guaranteeing that all vehicle routes are feasible and closed. Constraint (14) restricts the maximum travel distance of each vehicle. Constraint (15) ensures the number of vehicles used does not exceed the fleet size. Constraint (16) defines the binary decision variables.

2.3. Reformulation as an Markov Decision Process

To solve LC-HMCVRP using reinforcement learning, we formulate it as a Markov decision process (MDP):

State: Each state

s_{t}

of decision step t consists of two components: vehicle states and node states. In the routing process, decisions are made sequentially, where each decision step t represents one routing decision for selecting the next destination node for one of the vehicles. The vehicle states include the current positions

p_{k, t}

and remaining compartment capacities

{R_{k, t}^{c} | c \in C}

for all vehicles

k \in K

. The node state for node i is represented by a feature vector

f_{i} = (h_{i}, v_{i}, D_{i}^{1}, \dots, D_{i}^{C})

, where

(h_{i}, v_{i})

are the coordinates and

D_{i}^{c}

the demand for product c at node i.

Action: At each decision step t, the action space consists of selecting vehicle

k_{t}

and its next visiting node

π_{k, t + 1}

, subject to capacity constraints and maximum travel distance constraint.

Transition: After executing action

(k_{t}, π_{k, t})

, state

s_{t}

transitions to

s_{t + 1}

by

p_{k, t + 1} = π_{k, t}

(17)

R_{k, t + 1}^{c} = R_{k, t}^{c} - D_{π_{k, t}}^{c}, \forall c \in C

(18)

D_{π_{k, t}}^{c} = 0, \forall c \in C

(19)

Only the selected vehicle and the visited node are updated at each step.

Reward: The total reward

R = - \sum_{t = 1}^{T} r_{t}

is defined as the sum of immediate rewards over all decision steps. At each decision step t, the immediate reward

r_{t}

is calculated as

r_{t} = - (α_{k_{t}} l_{p_{k, t}, π_{k, t + 1}} + \sum_{c \in C} β_{k_{t}}^{c} t_{p_{k, t}, π_{k, t + 1}} + γ_{k_{t}} l_{p_{k, t}, π_{k, t + 1}} + \sum_{c \in C} δ_{k_{t}}^{c} t_{p_{k, t}, π_{k, t + 1}})

(20)

where the first term represents the transportation cost, the second term accounts for the refrigeration cost, and the last two terms capture the carbon emission cost from vehicle engines and refrigeration units, respectively.

To further clarify the above MDP formulation, Figure 1 presents a simple example involving two vehicles and three customer nodes. In this figure, each node

S_{t}

represents the state at decision step t. The annotations above each state node summarize the status of each vehicle (such as its location and availability) and indicate which customer nodes are still pending or have been served. The directed edges between states correspond to possible actions at each step, that is, the agent’s selection of a vehicle

k_{t}

and its next destination

π_{k_{t}, t + 1}

, as defined in our action space. Blue and purple lines indicate decisions made by vehicle 1 and vehicle 2, respectively. The red dashed lines represent the rewards obtained after taking a certain action in a given state and transitioning to the next state. The feasibility of each action is determined by checking the current state against all problem constraints, such as remaining compartment capacities and maximum travel distance, ensuring that only vehicles with feasible moves and customer nodes with unsatisfied demand are considered. If no feasible customer node exists for a given vehicle, the only available action is to return to the depot. When an action is executed, the system transitions to the next state according to the update rules in Equations (17)–(19). This process continues step by step, as shown by the branching structure in the figure, until all customer demands are satisfied and all vehicles have returned to the depot.

3. The Proposed DVS-AM Model

While DRL models like AM [39] have demonstrated effectiveness in solving vehicle routing problems, they lack mechanisms to handle heterogeneous fleet characteristics and their interactions with routing decisions, affecting their optimization capability for the LC-HMCVRP. To address these limitations, we propose the DVS-AM, which extends the AM with two key innovations. The vehicle state encoder (VSE) captures heterogeneous fleet features, such as capacities, compartment configurations, and current locations. This encoder produces comprehensive vehicle embeddings that accurately represent the capacity limits, compartment structure, and up-to-date status of each vehicle. Additionally, the dynamic vehicle state update (DVSU) mechanism dynamically recalibrates the vehicle and fleet embeddings during the decoding process. By iteratively updating vehicle states and masking completed vehicles, DVSU enables the model to account for changing fleet conditions during sequential routing decisions. These innovations help our approach more effectively address the challenges of low-carbon cold chain logistics with multi-compartment and heterogeneous fleets.

3.1. Overview of DVS-AM

Figure 2 shows the architecture of DVS-AM. The encoder of this model consists of two parallel components: a node encoder and a vehicle state encoder. These components work together to generate comprehensive embeddings that capture both node relationships and vehicle states.

The node encoder adopts a transformer-based architecture [38] with three stacked blocks. As defined in Section 2.3, each node i is associated with a feature vector

x_{i}

that includes its coordinates and demands for each product category. The input node feature matrix

F = {f_{0}, \dots, f_{N}}

is then formed by stacking these vectors for all nodes. This input is first put through a feed-forward (FF) layer for linear transformation, then processed by three stacked blocks where each block contains a MHA layer and a FF layer. Through the self-attention mechanism in MHA layers, the node encoder captures the relationships among all nodes, generating node embeddings

H = {h_{0}, \dots, h_{N}}

, where

h_{i}

is the embedding of node

i \in V

. The global graph embedding

h^{avg}

is then obtained by applying mean pooling over all node embeddings.

At decision step t, the vehicle state encoder takes input vehicle features

V t = {v_{1, t}, v_{2, t}, \dots, v_{K, t}}

as input. For each vehicle k of decision step t, its feature

v_{k, t} = (p_{k, t}, R_{k, t}^{c})

contains its current node position

p_{k, t}

at decision step t and the remaining capacity

R_{k, t}^{c}

for each compartment c. The vehicle features first go through a FF layer for linear transformation, followed by an MHA layer that captures the relationships between vehicles, generating the transformed vehicle features

\tilde{V_{t}}

.

To effectively integrate vehicle-specific information and the global graph context through adaptive control of information flow between vehicle features and the global graph embedding

h^{avg}

, ensuring that the final fleet state embedding

E_{t} = {E_{1, t}, \dots, E_{K, t}}

captures both local details and global context, we introduce a gating mechanism:

g = sigmoid (W_{g} [\tilde{V_{t}}; h^{avg}] + b_{g})

(21)

E_{t} = g ⊙ h^{avg} + (1 - g) ⊙ V_{t}

(22)

where

W_{g}

and

b_{g}

are learnable parameters,

[\cdot; \cdot]

denotes concatenation, and ⊙ represents element-wise multiplication. The fleet state embedding

E_{t}^{avg}

is then obtained by applying mean pooling over all vehicle embeddings in

E_{t}

. We chose the sigmoid activation because its output is strictly bounded between 0 and 1, which makes it well suited for gating and controlled information blending. In contrast, tanh outputs values in

[- 1, 1]

, which can introduce negative weights and potentially reduce the contribution of certain features when blending information from different sources. ReLU outputs values in

[0, + \infty)

, so the gate values can become arbitrarily large, potentially causing one information source to overwhelm the other and resulting in unstable model behavior. Using sigmoid ensures the gate always forms a convex combination of the two information sources, keeping the information flow stable and interpretable throughout the model.

The decoder consists of two main modules—a vehicle selection module and a node selection module—that work in conjunction with a dynamic vehicle state update mechanism (DVSU). The vehicle selection module first concatenates

E_{t}^{avg}

and the global graph embedding

h^{avg}

and transforms it through a feed-forward network. This transformed output and

E_{t}

are then fed into a mask attention layer, which calculates attention values following the probability computation approach described in Kool et al. [39]. The mask mechanism sets the attention values of currently used vehicles to zero. These attention values then pass through a Softmax layer to obtain the selection probabilities for each available vehicle. Vehicle

k_{t}

is selected by a sampling strategy that assigns higher selection chances to vehicles with higher probabilities.

After selecting vehicle

k_{t}

, its state embedding

E_{k, t}

, the global graph embedding

h^{avg}

and the embedding of vehicle

k_{t}

’s currently located node

h_{k, t}

are concatenated and linearly transformed to obtain the context query

q^{context}

. This context query and node embeddings

h i

are then fed into a Mask MHA layer, which functions similarly to the MHA layer but includes a mask mechanism to prevent selecting invalid nodes (e.g., nodes that have been visited or would violate constraints). The context query

q^{context}

and the output from the Mask MHA layer pass through another mask attention layer and softmax layer to obtain node selection probabilities. Node

π_{k, t}

is selected through the sampling strategy. Through the DVSU mechanism, once a node is selected, both the individual vehicle state embedding

E_{k, t}

and the fleet embedding

E_{t}^{avg}

are immediately updated, ensuring that subsequent decisions are made based on the most current system state. This sequential decision-making continues until all customer nodes have been visited.

3.2. Model Training

We train our DVS-AM model, parameterized by

θ

, using the REINFORCE algorithm [48] to learn the optimal vehicle route planning strategy. The model learns to maximize the expected rewards (minimizing total operation cost) through gradient-based optimization. To reduce the variance in gradient estimation, we introduce a baseline method based on Rollout [39].

During the model training process, we implement a sampling strategy (detailed in Section 3.1) to enhance the model’s exploration of diverse route planning strategies, while utilizing greedy strategy in Rollout to establish a stable baseline value. The greedy strategy always selects the customer with the highest probability at each decision step. The gradient of the loss function with regard to

θ

is defined as

\nabla L_{R L} (θ | s) = E p_{θ} (π | s) (R (π^{s a m p l e} | s) - R (π^{g r e e d y} | s)) \nabla log p_{θ} (π^{s a m p l e} | s)

(23)

where

R (π^{s a m p l e} | s)

and

R (π^{g r e e d y} | s)

represent the total rewards obtained by sampling strategy and greedy strategy, respectively. The

π^{s a m p l e}

and

π^{g r e e d y}

denote the solution generated by sampling and greedy strategies, respectively, and s represents the problem instance. The DVS-AM’s parameters

θ

are updated using the Adam optimizer [49].

4. Experiment Settings

4.1. Instance Generation

The experiments are conducted on a real-world road network from Chengdu, China. The road network is extracted from OpenStreetMap, covering the area within the Fourth Ring Road of Chengdu. After processing and simplifying intersections, the network contains 1746 nodes and 4274 edges. Each edge represents a road segment with its actual travel distance. Based on this road network, we use Dijkstra’s algorithm to compute the shortest paths and corresponding travel distances between all pairs of nodes, transforming the original network into a complete undirected network. This preprocessing step generates a distance matrix that contains the minimum travel distances between any two nodes in the network.

To evaluate our approach under different problem scales, we consider three settings with 20, 50, and 100 DCs, respectively. For each problem instance, we randomly select one node from the road network as the central depot and several other nodes as DCs. Each DC has demands for three temperature-controlled products: frozen (−18 °C), chilled (0–5 °C), and ambient (room temperature) goods. The demand for each type of product is randomly generated from uniform distributions: frozen products from U (0.08, 0.15) tons, chilled products from U (0.10, 0.20) tons, and ambient products from U (0.15, 0.25) tons.

The heterogeneous fleet consists of three types of vehicles (small, medium, and large). For small vehicles, the capacities of frozen, chilled, and ambient compartments are set to 0.5, 0.7, and 0.8 tons, respectively. For medium vehicles, the corresponding compartment capacities are 1.2, 1.8, and 2.0 tons. For large vehicles, the compartment capacities are set to 1.8, 2.7, and 3.0 t.

4.2. Implementation Details

For problem sizes of 20 and 50 DCs, we generate 1,280,000 instances with a batch size of 512, while for 100 DCs, we generate 640,000 instances with a batch size of 256. For validation and testing, we generate 1000 instances each. The model consists of an encoder and a decoder. The encoder includes a node encoder with three transformer blocks and a vehicle encoder with one transformer block. Both the node and vehicle embedding dimensions are set to 128, and the feed-forward networks use hidden dimension 512. These hyperparameters are determined through preliminary experiments using the grid search method, where we systematically evaluate different combinations of learning rate (0.0001, 0.00005, 0.001), embedding dimensions (64, 128, 256), number of transformer blocks ((2, 3, 4) for node encoder, (1, 2) for vehicle encoder), and attention heads (4, 8, 16) to optimize the model’s performance while maintaining computational efficiency.

We train the model for 100 epochs using the Adam optimizer with a learning rate of 0.0001. During training, we apply a tanh clip with C = 10 to stabilize the learning process. To improve training efficiency, we implement a rollout baseline, which is updated per epoch if the performance improvement on the validation set is statistically significant according to a paired t-test (

p < 0.05

). During testing, we adopt a greedy selection strategy where nodes with maximum attention values are selected.

The model is implemented by PyTorch (v2.5.1), and all training, testing, and deployment experiments were conducted on a server running Windows 11, equipped with an Intel i9-14900 CPU (24 cores, 32 threads), 128 GB RAM and an NVIDIA RTX 4090 GPU (24 GB VRAM).

4.3. Baselines

We compare our proposed approach with four baseline methods:

TDACO: A three-dimensional ant colony optimization algorithm developed by Guo et al. [5] that considers carbon emissions in MCVRP. It introduces a novel three-dimensional pheromone concentration matrix and adaptive local search strategies to balance exploration and exploitation.
ALNS: The adaptive large neighborhood search algorithm proposed by [24] specifically tackles MCVRP in cold-chain distribution. It employs multiple destroy and repair operators, demonstrating strong performance in handling complex constraints.
AM: A learning-based method using transformer architecture for routing problems proposed by [39]. It leverages a self-attention mechanism to capture node features and relationships.
HVRP-DRL: A DRL model based on the AM for HVRP proposed by Li et al. [41], which extends AM with dedicated vehicle and node selection decoders to address heterogeneous fleet constraints.
Greedy: A simple construction heuristic that selects the nearest feasible customer in each step.

5. Experimental Results

5.1. Training Performances and Convergence

Figure 3 shows the change in average objective value on the training and validation sets over 100 training epochs for problem instances with 20, 50, and 100 DCs. The average objective value refers to the mean total operational cost, including transportation, refrigeration, and carbon emission costs, calculated over all instances in each dataset.

It can be found that the average objective value decreases rapidly during the initial epochs and gradually stabilizes as training progresses for all problem sizes. The training and validation curves remain closely aligned throughout the training process, with no significant divergence between them. This demonstrates that the model achieves stable convergence and no overfitting is observed. The consistent trends across different problem sizes further indicate the robustness of the proposed DVS-AM model.

5.2. Comparison with Baselines

Table 1 provides a detailed performance comparison between our DVR-AM model and four baselines. We evaluate all methods on three differently sized problems, involving 20, 50, and 100 DCs (referred to as 20-DC, 50-DC, and 100-DC, respectively). For each problem size, we generate 1000 test instances (as detailed in Section 4). During testing, the DVS-AM and AM models process instances in batches of 100. In Table 1, the “Cost” represents the average operational cost (comprising transportation cost, refrigeration cost, and carbon emission cost) across the 1000 test instances for each problem size. The “Gap” indicates the percentage difference between the cost of a given method and the cost achieved by TDACO, normalized by the TDACO cost. The “Time” column reports the average computational time required to solve a batch of 100 instances. The best values in each column are highlighted in bold for clear comparison.

It can be found from the Table 1 that (1) the proposed DVS-AM model significantly outperforms TDACO across all problem sizes, achieving negative gaps of −3.05%, −4.54%, and −6.32% for the 20-DC, 50-DC, and 100-DC problem sizes, respectively. The performance gap becomes larger as the problem size increases. (2) In terms of computational efficiency, DVS-AM demonstrates exceptional speed advantages over both TDACO and ALNS. For 20-DC problems, DVS-AM is approximately 685 times faster than TDACO. This speed advantage grows with problem size, reaching 689 and 1637 times faster than TDACO for 50-DC and 100-DC instances. (3) While the Greedy method is computationally fast, it produces significantly worse solutions, with gaps exceeding 20.13–26.67% across all problem sizes. (4) The AM method achieves acceptable results, with gaps ranging from 4.42% to 2.61%. However, it is clearly outperformed by the DVS-AM. (5) HVRP-DRL achieves better results than AM and ALNS, but is still outperformed by DVS-AM, with the gap ranging from 4.31% to 7.05%.

5.3. Ablation Study

To validate the effectiveness of our proposed enhancements to the AM model [39], we conduct ablation experiments. Our DVS-AM model introduces two key innovations: the vehicle state encoder (VSE) that captures heterogeneous fleet characteristics and the dynamic vehicle state update mechanism (DVSU) that dynamically recalibrates vehicle and fleet embeddings during the decoding process. Since DVSU is built upon VSE, we examine the performance of three models: AM, AM with only VSE (i.e., AM w/o DVSU), and our DVS-AM.

The results demonstrate that both VSE and DVSU contribute significantly to the overall performance improvement. It can be found from Table 2 that (1) The introduction of VSE alone (AM w/o DVSU) brings substantial improvements, reducing costs by 2.57–5.23% across different problem sizes compared to AM. This demonstrates the effectiveness of VSE in handling heterogeneous fleet characteristics. (2) When further incorporating DVSU (DVS-AM), we observe additional performance gains, achieving total cost reductions of 4.28–8.70%. This demonstrates that the dynamic state update mechanism effectively enhances the model’s capability in capturing evolving vehicle states during the decoding process. (3) The computational overhead introduced by these enhancements is minimal. Compared to the original AM, DVS-AM only requires an additional 0.03–0.07 s across different problem sizes, maintaining the model’s efficiency while significantly improving its effectiveness.

6. Conclusions

This study proposes DVS-AM, a deep reinforcement-learning-based route optimization model for multi-compartment cold chain distribution, aiming to reduce both operational costs and environmental impacts. The model incorporates a vehicle state encoder and a dynamic vehicle state update mechanism to handle fleet characteristics and real-time states during route optimization.

Our experimental results on the real-world road network validate the superiority of DVS-AM against several effective solution methods, including AM and three representative heuristic methods (TDACO, ALNS, and Greedy). Compared to TDACO, it achieves cost reductions of up to 6.32% while being up to 1637 times faster. The ablation study further validates that both VSE and DVSU contribute to these improvements, with VSE providing substantial gains in handling fleet heterogeneity and DVSU further enhancing performance through dynamic state updates.

Future research could focus on extending the current model to address more complex scenarios with dynamic customer demands and time-dependent travel times, which are common challenges in real-world logistics operations. Additionally, further experiments on various real-world road networks could be conducted to validate the generalizability and robustness of the proposed model across different urban environments.

Author Contributions

Conceptualization, J.H. and C.W.; methodology, J.H.; software, J.H.; validation, C.W.; investigation, J.H.; resources, C.W.; writing—original draft preparation, J.H.; writing—review and editing, C.W.; supervision, C.W.; project administration, C.W.; funding acquisition, C.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (No. 71972136) and Sichuan Science and Technology Program (No. 2022JDTD0022).

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Fuel Consumption Rate Estimation

In the Comprehensive Modal Emission Model (CMEM, [46,47]), when a vehicle travels a distance

l_{i j}

at a constant speed v over duration

t_{i j}

while carrying a load Q, the fuel consumption

F (t_{i j}, v, Q)

is calculated as

\begin{matrix} F (t_{i j}, v, R_{i j k}) & = λ f N_{e} V_{d} t_{i j} + λ τ σ v^{3} t_{i j} + λ τ χ l_{i j} (μ + R_{i j k}), \end{matrix}

(A1)

where

μ

is the vehicle curb-weight,

λ = ι / (κ θ)

,

τ = 1 / (1000 ι η)

,

χ = g s i n (ϕ) + g C_{r} c o s (ϕ)

, and

σ = 0.5 C_{d} A_{f r o n} ρ

. Table A1 summarizes the parameters and their corresponding values for three types of vehicles. By substituting the vehicle rated capacity

ψ_{k}

for Q in Equation (A1) and simulating each type of fully loaded vehicle traveling 100 km at a constant speed of 30 km/h, we calculated the fuel consumption rates for fully loaded small, medium, and large vehicles as 7.42 L/100 km, 9.09 L/100 km, and 13.97 L/100 km, respectively. The fuel consumption rate,

δ_{i j k}^{ψ_{k}}

, for vehicle k with a rated load capacity

ψ_{k}

traveling from i to j is defined in Equation (A2), where

Λ_{i j k} = λ τ χ l_{i j}

represents the weight module coefficient.

δ_{i j k}^{ψ_{k}} = \{\begin{matrix} [7.42 - Λ_{i j k} (ψ_{k} - R_{i j k})] L / 100 km, & if ψ_{k} = 2 t, \\ [9.09 - Λ_{i j k} (ψ_{k} - R_{i j k})] L / 100 km, & if ψ_{k} = 5 t, \\ [13.97 - Λ_{i j k} (ψ_{k} - R_{i j k})] L / 100 km, & if ψ_{k} = 7.5 t . \end{matrix}

(A2)

Table A1. Notation and Specifications of Small, Medium, and Large Vehicles.

Notation	Description	Type of Vehicle
Notation	Description	Small	Medium	Large
f	Engine friction factor (kJ/rev/L)	0.20	0.20	0.20
$N_{e}$	Engine speed (rev/s)	62.67	43.33	38.33
$V_{d}$	Engine displacement (L)	3.12	3.76	6.77
$A_{f r o n}$	Frontal surface area ( $m^{2}$ )	2.62	8.18	10.39
$μ$	Curb-weight (kg)	2120	5490	8245
$ψ$	Vehicle rated capacity (kg)	1950	4750	7355
$ι$	Vehicle drive train efficiency	0.4	0.4	0.4
$η$	Efficiency parameter for diesel engines	0.9	0.9	0.9
$ϕ$	Road angle	0	0	0
$C_{r}$	Coefficient of rolling resistance	0.01	0.01	0.01
$ξ$	Fuel-to-air mass ratio	1	1	1
$κ$	Heating value of a typical diesel fuel (kJ/g)	44	44	44
$θ$	Conversion factor (g/L)	737	737	737
$ρ$	Air density (kg/m³)	1.2041	1.2041	1.2041
g	Gravitational constant (m/s²)	9.81	9.81	9.81
$C_{d}$	Coefficient of aerodynamic drag	0.7	0.7	0.7

Appendix B. Example Solutions

Figure A1 shows example solutions generated by our method for instances with 20, 50, and 100 DCs. In each plot, the depot (red star), distribution centers (gray circles), and routes for different vehicle types are clearly distinguished using different markers and line styles.

Figure A1. Visualization of example solutions generated by for instances with 20, 50, and 100 DCs.

References

Mariano, E.B.; Gobbo, J.A., Jr.; de Castro Camioto, F.; do Nascimento Rebelatto, D.A. CO2 emissions and logistics performance: A composite index proposal. J. Clean. Prod. 2017, 163, 166–178. [Google Scholar] [CrossRef]
Dantzig, G.B.; Ramser, J.H. The truck dispatching problem. Manag. Sci. 1959, 6, 80–91. [Google Scholar] [CrossRef]
Laporte, G. Fifty years of vehicle routing. Transp. Sci. 2009, 43, 408–416. [Google Scholar] [CrossRef]
Ostermeier, M.; Henke, T.; Hübner, A.; Wäscher, G. Multi-compartment vehicle routing problems: State-of-the-art, modeling framework and future directions. Eur. J. Oper. Res. 2021, 292, 799–817. [Google Scholar] [CrossRef]
Guo, N.; Qian, B.; Na, J.; Hu, R.; Mao, J.L. A Three-Dimensional Ant Colony Optimization Algorithm for Multi-Compartment Vehicle Routing Problem Considering Carbon Emissions. Appl. Soft Comput. 2022, 127, 109326. [Google Scholar] [CrossRef]
Koç, Ç.; Bektaş, T.; Jabali, O.; Laporte, G. Thirty years of heterogeneous vehicle routing. Eur. J. Oper. Res. 2016, 249, 1–21. [Google Scholar] [CrossRef]
Wang, Y.; Hong, X.; Wang, Y.; Zhao, J.; Sun, G.; Qin, B. Token-based deep reinforcement learning for Heterogeneous VRP with Service Time Constraints. Knowl. Based Syst. 2024, 300, 112173. [Google Scholar] [CrossRef]
Fernández Gil, A.; Lalla-Ruiz, E.; Gómez Sánchez, M.; Castro, C. A review of heuristics and hybrid methods for green vehicle routing problems considering emissions. J. Adv. Transp. 2022, 2022, 5714991. [Google Scholar] [CrossRef]
Xiao, J.; Liu, X.; Zhang, H.; Cao, Z.; Kang, L.; Niu, Y. The low-carbon vehicle routing problem with dynamic speed on steep roads. Comput. Oper. Res. 2024, 169, 106736. [Google Scholar] [CrossRef]
Rostami, B.; Desaulniers, G.; Errico, F.; Lodi, A. Branch-price-and-cut algorithms for the vehicle routing problem with stochastic and correlated travel times. Oper. Res. 2021, 69, 436–455. [Google Scholar] [CrossRef]
Liu, Y.; Yu, Y.; Zhang, Y.; Baldacci, R.; Tang, J.; Luo, X.; Sun, W. Branch-cut-and-price for the time-dependent green vehicle routing problem with time windows. INFORMS J. Comput. 2023, 35, 14–30. [Google Scholar] [CrossRef]
Zhou, H.; Qin, H.; Cheng, C.; Rousseau, L.M. An exact algorithm for the two-echelon vehicle routing problem with drones. Transp. Res. Part B Methodol. 2023, 168, 124–150. [Google Scholar] [CrossRef]
Ma, W.; Zeng, L.; An, K. Dynamic vehicle routing problem for flexible buses considering stochastic requests. Transp. Res. Part Emerg. Technol. 2023, 148, 104030. [Google Scholar] [CrossRef]
Zhang, Z.; Che, Y.; Liang, Z. Split-demand multi-trip vehicle routing problem with simultaneous pickup and delivery in airport baggage transit. Eur. J. Oper. Res. 2024, 312, 996–1010. [Google Scholar] [CrossRef]
Ren, T.; Luo, T.; Jia, B.; Yang, B.; Wang, L.; Xing, L. Improved ant colony optimization for the vehicle routing problem with split pickup and split delivery. Swarm Evol. Comput. 2023, 77, 101228. [Google Scholar] [CrossRef]
Pan, W.; Liu, S.Q. Deep reinforcement learning for the dynamic and uncertain vehicle routing problem. Appl. Intell. 2023, 53, 405–422. [Google Scholar] [CrossRef]
Li, J.; Ma, Y.; Cao, Z.; Wu, Y.; Song, W.; Zhang, J.; Chee, Y.M. Learning feature embedding refiner for solving vehicle routing problems. IEEE Trans. Neural Netw. Learn. Syst. 2023, 35, 15279–15291. [Google Scholar] [CrossRef]
Zhang, K.; Lin, X.; Li, M. Graph attention reinforcement learning with flexible matching policies for multi-depot vehicle routing problems. Phys. A Stat. Mech. Appl. 2023, 611, 128451. [Google Scholar] [CrossRef]
Henke, T.; Speranza, M.G.; Wäscher, G. A Branch-and-Cut Algorithm for the Multi-Compartment Vehicle Routing Problem with Flexible Compartment Sizes. Ann. Oper. Res. 2019, 275, 321–338. [Google Scholar] [CrossRef]
Heßler, K. Exact Algorithms for the Multi-Compartment Vehicle Routing Problem with Flexible Compartment Sizes. Eur. J. Oper. Res. 2021, 294, 188–205. [Google Scholar] [CrossRef]
Şahin, M.K.; Yaman, H. A branch and price algorithm for the heterogeneous fleet multi-depot multi-trip vehicle routing problem with time windows. Transp. Sci. 2022, 56, 1636–1657. [Google Scholar] [CrossRef]
Li, J.; Qin, H.; Shen, H.; Tong, X.; Xu, Z. Exact algorithms for the multiple depot vehicle scheduling problem with heterogeneous vehicles, split loads and toll-by-weight scheme. Comput. Ind. Eng. 2022, 168, 108137. [Google Scholar] [CrossRef]
Luo, H.; Dridi, M.; Grunder, O. A branch-price-and-cut algorithm for a time-dependent green vehicle routing problem with the consideration of traffic congestion. Comput. Ind. Eng. 2023, 177, 109093. [Google Scholar] [CrossRef]
Chen, L.; Liu, Y.; Langevin, A. A Multi-Compartment Vehicle Routing Problem in Cold-Chain Distribution. Comput. Oper. Res. 2019, 111, 58–66. [Google Scholar] [CrossRef]
Ouertani, N.; Ben-Romdhane, H.; Nouaouri, I.; Allaoui, H.; Krichen, S. A multi-compartment VRP model for the health care waste transportation problem. J. Comput. Sci. 2023, 72, 102104. [Google Scholar] [CrossRef]
Chen, J.; Dan, B.; Shi, J. A Variable Neighborhood Search Approach for the Multi-Compartment Vehicle Routing Problem with Time Windows Considering Carbon Emission. J. Clean. Prod. 2020, 277, 123932. [Google Scholar] [CrossRef]
Stavropoulou, F. The consistent vehicle routing problem with heterogeneous fleet. Comput. Oper. Res. 2022, 140, 105644. [Google Scholar] [CrossRef]
Máximo, V.R.; Cordeau, J.F.; Nascimento, M.C. An adaptive iterated local search heuristic for the Heterogeneous Fleet Vehicle Routing Problem. Comput. Oper. Res. 2022, 148, 105954. [Google Scholar] [CrossRef]
Wang, H.; Li, M.; Wang, Z.; Li, W.; Hou, T.; Yang, X.; Zhao, Z.; Wang, Z.; Sun, T. Heterogeneous fleets for green vehicle routing problem with traffic restrictions. IEEE Trans. Intell. Transp. Syst. 2022, 24, 8667–8676. [Google Scholar] [CrossRef]
Qi, R.; Li, J.q.; Wang, J.; Jin, H.; Han, Y. QMOEA: A Q-learning-based multiobjective evolutionary algorithm for solving time-dependent green vehicle routing problems with time windows. Inf. Sci. 2022, 608, 178–201. [Google Scholar] [CrossRef]
Lu, H.; Zhang, X.; Yang, S. A Learning-based Iterative Method for Solving Vehicle Routing Problems. In Proceedings of the Seventh International Conference on Learning Representations, New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
Chen, X.; Tian, Y. Learning to Perform Local Rewriting for Combinatorial Optimization. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2019; Volume 32. [Google Scholar]
Hou, Q.; Yang, J.; Su, Y.; Wang, X.; Deng, Y. Generalize Learned Heuristics to Solve Large-scale Vehicle Routing Problems in Real-time. In Proceedings of the Eleventh International Conference on Learning Representations, Virtual Event, 25–29 April 2022. [Google Scholar]
Nazari, M.; Oroojlooy, A.; Snyder, L.; Takac, M. Reinforcement Learning for Solving the Vehicle Routing Problem. In Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2018; Volume 31. [Google Scholar]
Bello, I.; Pham, H.; Le, Q.V.; Norouzi, M.; Bengio, S. Neural Combinatorial Optimization with Reinforcement Learning. arXiv 2017, arXiv:cs.AI/1611.09940. [Google Scholar]
Yu, J.J.Q.; Yu, W.; Gu, J. Online Vehicle Routing with Neural Combinatorial Optimization and Deep Reinforcement Learning. IEEE Trans. Intell. Transp. Syst. 2019, 20, 3806–3817. [Google Scholar] [CrossRef]
Hochreiter, S. Long Short-term Memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. arXiv 2017, arXiv:cs/1706.03762. [Google Scholar]
Kool, W.; van Hoof, H.; Welling, M. Attention, Learn to Solve Routing Problems! arXiv 2019, arXiv:cs.LG/1803.08475. [Google Scholar]
Bono, G.; Dibangoye, J.S.; Simonin, O.; Matignon, L.; Pereyron, F. Solving multi-agent routing problems using deep attention mechanisms. IEEE Trans. Intell. Transp. Syst. 2020, 22, 7804–7813. [Google Scholar] [CrossRef]
Li, J.; Ma, Y.; Gao, R.; Cao, Z.; Lim, A.; Song, W.; Zhang, J. Deep Reinforcement Learning for Solving the Heterogeneous Capacitated Vehicle Routing Problem. IEEE Trans. Cybern. 2022, 52, 13572–13585. [Google Scholar] [CrossRef]
Zou, Y.; Wu, H.; Yin, Y.; Dhamotharan, L.; Chen, D.; Tiwari, A.K. An improved transformer model with multi-head attention and attention to attention for low-carbon multi-depot vehicle routing problem. Ann. Oper. Res. 2024, 339, 517–536. [Google Scholar] [CrossRef]
Wang, S.; Tao, F.; Shi, Y.; Wen, H. Optimization of vehicle routing problem with time windows for cold chain logistics based on carbon tax. Sustainability 2017, 9, 694. [Google Scholar] [CrossRef]
Guo, W.; Wang, Q.; Liu, H.; Desire, W.A. Multi-energy collaborative optimization of park integrated energy system considering carbon emission and demand response. Energy Rep. 2023, 9, 3683–3694. [Google Scholar] [CrossRef]
Chen, J.; Liao, W.; Yu, C. Route optimization for cold chain logistics of front warehouses based on traffic congestion and carbon emission. Comput. Ind. Eng. 2021, 161, 107663. [Google Scholar] [CrossRef]
Barth, M.; Younglove, T.; Scora, G. Development of a Heavy-Duty Diesel Modal Emissions and Fuel Consumption Model; University of California: Berkeley, CA, USA, 2005. [Google Scholar]
Barth, M.; Boriboonsomsin, K. Energy and emissions impacts of a freeway-based dynamic eco-driving system. Transp. Res. Part D Transp. Environ. 2009, 14, 400–410. [Google Scholar] [CrossRef]
Williams, R.J. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach. Learn. 1992, 8, 229–256. [Google Scholar] [CrossRef]
Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. In Proceedings of the 3rd International Conference on Learning Representations, San Diego, CA, USA, 7–9 May 2015; pp. 1–15. [Google Scholar]

Figure 1. Visualization of the MDP process for the LC-HMCVRP with two vehicles and three DCs.

Figure 2. Architectures of DVS-AM.

Figure 3. Training and validation average objective value curves of DVS-AM on 20-DC, 50-DC, and 100-DC instances over training epochs.

Table 1. Performance comparison of different methods.

Method	20-DC			50-DC			100-DC
Method	Cost	Gap (%)	Time (s)	Cost	Gap (%)	Time (s)	Cost	Gap (%)	Time (s)
TDACO	954.32	0.00	123.36	2265.45	0.00	282.57	4325.68	0.00	1015.42
Greedy	1146.45	20.13	0.08	2845.32	25.60	0.15	5478.92	26.67	0.31
ALNS	963.21	0.94	215.32	2285.64	0.89	478.45	4352.84	0.62	1574.36
AM	996.53	4.42	0.15	2344.75	3.50	0.39	4438.42	2.61	0.55
HVRP-DRL	966.82	1.31	0.17	2291.41	1.15	0.41	4359.76	0.79	0.60
DVS-AM	925.18	−3.05	0.18	2162.67	−4.54	0.41	4052.43	−6.32	0.62

Table 2. Results of ablation study.

Method	20-DC			50-DC			100-DC
Method	Cost	Gap (%)	Time (s)	Cost	Gap (%)	Time (s)	Cost	Gap (%)	Time (s)
AM	966.53	0.00%	0.15	2344.75	0.00%	0.39	4438.42	0.00%	0.55
AM w/o DVSU	941.72	−2.57%	0.17	2232.70	−4.78%	0.40	4206.32	−5.23%	0.60
DVS-AM	925.18	−4.28%	0.18	2162.67	−7.77%	0.41	4052.43	−8.70%	0.62

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Hu, J.; Wang, C. A Deep Reinforcement-Learning-Based Route Optimization Model for Multi-Compartment Cold Chain Distribution. Mathematics 2025, 13, 2039. https://doi.org/10.3390/math13132039

AMA Style

Hu J, Wang C. A Deep Reinforcement-Learning-Based Route Optimization Model for Multi-Compartment Cold Chain Distribution. Mathematics. 2025; 13(13):2039. https://doi.org/10.3390/math13132039

Chicago/Turabian Style

Hu, Jingming, and Chong Wang. 2025. "A Deep Reinforcement-Learning-Based Route Optimization Model for Multi-Compartment Cold Chain Distribution" Mathematics 13, no. 13: 2039. https://doi.org/10.3390/math13132039

APA Style

Hu, J., & Wang, C. (2025). A Deep Reinforcement-Learning-Based Route Optimization Model for Multi-Compartment Cold Chain Distribution. Mathematics, 13(13), 2039. https://doi.org/10.3390/math13132039

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Deep Reinforcement-Learning-Based Route Optimization Model for Multi-Compartment Cold Chain Distribution

Abstract

1. Introduction

2. Problem Formulation

2.1. Problem Description

2.2. Mathematical Model

2.3. Reformulation as an Markov Decision Process

3. The Proposed DVS-AM Model

3.1. Overview of DVS-AM

3.2. Model Training

4. Experiment Settings

4.1. Instance Generation

4.2. Implementation Details

4.3. Baselines

5. Experimental Results

5.1. Training Performances and Convergence

5.2. Comparison with Baselines

5.3. Ablation Study

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A. Fuel Consumption Rate Estimation

Appendix B. Example Solutions

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI