This study centers on the central aspect of the front warehouse mode distribution optimization problem in relation to the multi-depot vehicle routing problem with soft time windows, drawing upon the research conducted by previous scholars. We propose a specific mathematical model of MDVRPSTW and model it as a representation of the relevant elements of a Markov process (MDP) in this section.
3.1. Problem Description
The optimization problem based on MDVRPSTW can be formulated as follows: We model the road network as a connected graph G = (V,E), where V represents the set of all nodes and E represents the set of all edges. Specifically, the distribution area contains M distribution centers, each of which is equipped with Km delivery vehicles with a rated cargo capacity of Q units. The area is served by N customers, each with demand quantity qi and known time windows. Vehicles arriving before the Ei or after the Li fail to meet customer demand within their designated time windows and thus incur penalty costs for premature or delayed arrivals. Vehicles arriving within the time window [Ei, Li] for customer i incur no penalty costs. All penalty costs are uniformly converted into unit time costs and incorporated into the edge weight of E.
Given the locations xi of each customer node, the time windows [Ei, Li], the demand quantities the qi, the locations of depots, the rated capacity Q of each vehicle, and the total number of vehicles K, optimal routes should be designed to minimize the total delivery time cost.
The definition of mathematical symbols is shown in
Table 2.
3.3. Overview of MDP
By modeling the route solution of the MDVRPSTW as an MDP, this paper transforms the problem into a sequential decision-making problem. The specific process is shown in
Figure 1. The tuple
M = {
S,
A,
τ,
r} serves as the primary definition of the MDP. This section outlines the state space, action space, state transition, and reward function, respectively.
The state space consists of two components, the global static state Sg and the vehicle dynamic states Sd, and these two components together form the complete decision state.
Global Static State Sg: The inherent invariant parameters of the problem, including geographic coordinates (xi, yi) of all nodes, the demand qi of each customer node, customer service time windows [Ei, Li], vehicle rated load Q, number of depots M, vehicle allocation per depot Km, total vehicle count K, the travel cost Dij between each pair of nodes, early arrival penalty factor α, and late arrival penalty factor β.
Dynamic State Sd: The parameters updated in real-time during the decision-making process, which represent the decision states of individual vehicles. Each vehicle’s dynamic state is denoted as Sdk = {(x0, d0), (x1, d1), …, (xt, dt)}, where xt denotes the customer node/lot node that vehicle k is ready to access at time t, and dt represents the remaining load after vehicle k accesses xt. The dynamic state must also include a global node mask at time t. Note that if all customer nodes in the global node mask are prohibited and the current action ends, the system will first force the next action to return all vehicles to the depots before terminating the decision-making process.
The overall state space S = Sg ∪ {Sd1, Sd2, …, SdK} represents the set of global static parameters and all vehicle dynamic states, where every state is valid.
Action Space: The core decision-making objective of the action space is to assign the next node to visit for each vehicle currently in a valid decision state, enabling synchronized path decisions among multiple vehicles. The action form of a single decision is defined as at = {xt1, xt2, …, xtK}, where K denotes the total number of vehicles, and xtk represents the next node to visit assigned to the k-th vehicle at step t (including customer nodes and depot nodes; if no movement occurs, the vehicle remains at the node from time t − 1). To ensure the legitimacy of action space A, a vehicle capacity mask is computed based on dt before node selection (nodes from time t − 1 are always available). The global mask is then replicated, AND-ed with the vehicle capacity mask, and the current node is set as accessible before calculating action probabilities. Once the i-th vehicle selects its node, the global node mask is immediately updated.
State transition: The state transition function τ: S × A→S is defined as the process where, under the legal state st at time t, executing the synchronous decision action at updates the decision state from st to the legal state st+1 at time t + 1. The core logic of state transition involves independent updates of multiple vehicle states, unified global constraint verification, real-time mask synchronization, and precise sequential progression. All update operations strictly adhere to the constraints of MDVRPSTW (vehicle load capacity, unique node access, number of vehicles departing from the depot, etc.). The specific update rules are as follows: 1. The global static state Sg remains unchanged and contains inherent parameters such as node coordinates, demand quantities, time windows, and penalty factors; these parameters remain constant throughout the decision-making process and serve solely as constraints for state transitions. 2. Vehicle Dynamic State Sd: For each vehicle k (k = 1, 2, …, K) which is synchronized independently, its dynamic state Sdk is updated based on the next node xtk assigned in action at (where t is the t-th step assigned to the k-th vehicle). The core update rule is as follows: When updating the access-ready node, replace “node xt ready to access at time t in Sd+1” with “node xt+1 ready to access at time t + 1”, i.e., update Sd+1’s core structure from {(x0, d0), …, (xt, dt)} to {(x0, d0), …, (xt, dt), (xt+1, dt+1)}. Remaining load updates are as follows: dt+1 = dt − qt+1 (where qt+1 is the customer demand for xt+1; if xt+1 is a time node at step t, dt+1 = dt; if a depot node, dt+1 = Q). Global node mask updates are as follows: Strictly follow the rule “update immediately after the i-th vehicle selects its node”—during action execution, mark xtk (customer node) as “allocated” immediately after the completion and validation of the assignment of xtk to each vehicle, ensuring subsequent vehicle node selection uses the updated mask to prevent redundant allocation. Time updates are as follows: The global decision-making time uniformly progresses from t to t + 1, which guarantees complete synchronization of all vehicles’ decision timing to avoid constraint conflicts caused by temporal asynchrony.
Reward function: The goal of the MDVRPSTW is to reduce the total time needed to complete the delivery task. This study defines the reward function as the negative value of the objective function . The objective is to minimize the sum of vehicle driving distance and penalty time.