Dynamic Optimization of Multi-Echelon Supply Chain Inventory Policies Under Disruptive Scenarios: A Deep Reinforcement Learning Approach

Xiaonong Lu; Hongzhe Wang; Zhanglin Peng; Chen Liao; Chunyan Liu

doi:10.3390/sym17122078

,

and

School of Management, Hefei University of Technology, Hefei 230009, China

^*

Author to whom correspondence should be addressed.

Symmetry2025, 17(12), 2078;https://doi.org/10.3390/sym17122078

Version Notes

Order Reprints

Abstract

Addressing two types of supply chain disruptions—frequent short-duration disruptions (e.g., minor natural disasters) and infrequent long-duration disruptions (e.g., geopolitical conflicts, public health crises)—while considering their impact on logistics capacity, this paper proposes a multi-echelon inventory management optimization framework based on the Proximal Policy Optimization (PPO) algorithm. Unlike traditional inventory control models with simplistic assumptions, this study integrates factors such as the frequency, duration, and impact of disruptions into the inventory optimization process. It is designed to coordinate replenishment decisions at the warehouse while reacting to local retailer states. Since retailers share the same cost parameters and demand dynamics, their decision problems are structurally symmetric, which allows us to use a shared policy across retailers and thus keep the learning model compact and scalable. Numerical experiments compare the PPO policy with classical inventory heuristics under various network sizes and disruption types. The results show that PPO consistently achieves lower total costs than the benchmarks, and its relative advantage becomes more pronounced under severe or longer disruptions. These findings suggest that modern policy-gradient methods, combined with simple forms of structural symmetry, can provide an effective and scalable tool for managing disrupted multi-echelon supply chains.

Keywords:

inventory management; deep reinforcement learning; proximal policy optimization; multi-echelon inventory model; supply chain disruptions

1. Introduction

With the acceleration of globalization and the growing complexity of supply chain networks, supply chains have become indispensable to the stability of the global economy. However, supply chain disruptions (SCDs)—defined as unexpected interruptions or severe delays in the flow of goods, services, or information caused by unforeseen events—are occurring with increasing frequency, posing significant threats to the operational continuity of enterprises. Short-term, high-frequency disruptions, such as earthquakes, floods, and hurricanes, can directly damage logistics infrastructure, leading to production halts, inventory losses, and transportation breakdowns. In contrast, long-term, low-frequency disruptions—such as geopolitical conflicts and abrupt policy changes (e.g., trade disputes, wars, sanctions)—may disrupt cross-border flows of critical raw materials and finished goods, creating severe supply risks and cost escalations. Public health crises, exemplified by the COVID-19 pandemic, exacerbate these risks by causing labor shortages, factory shutdowns, and paralysis of logistics systems, ultimately leading to widespread supply shortages and volatile demand patterns. According to the International Maritime Organization (IMO), global freight costs surged by approximately 20% in 2020 due to port delays. In this context of heightened uncertainty, effective inventory management and improved supply chain resilience—enabling flexible responses to disruptions—have become crucial for safeguarding operational stability and supply chain security. Therefore, there is a clear practical necessity to revisit multi-echelon inventory policies under disruption risk: firms require decision rules that explicitly account for different disruption regimes and remain effective as the network scales, rather than relying on stationary safety-stock rules calibrated ex ante.

Classical inventory models such as the Economic Order Quantity (EOQ) model, the (s, S) policy, and base-stock policies provide important structural insights and have been widely used in practice, but they are typically developed for deterministic or mildly stochastic environments and rely on fixed parameters and stationary assumptions [1,2]. When supply and demand become highly volatile, and disruptions are frequent, especially in multi-echelon networks where upstream and downstream uncertainties interact, such static policies may no longer be effective, which has motivated the use of reinforcement learning (RL) to learn adaptive inventory policies from data [3]. However, most disruption-aware and DRL-based studies still model supply states in a simplified way and focus on single-node or two-echelon settings, so the scalability of RL policies to multi-echelon networks and the explicit role of heterogeneous disruption regimes in shaping inventory decisions remain underexplored [4,5].

To address the above challenges, this study is guided by the following research questions. RQ1: How do heterogeneous disruption patterns—distinguishing frequent short-duration shocks from infrequent long-duration events—affect multi-echelon inventory dynamics when both upstream arrival rates and downstream fulfillment rates are impacted? RQ2: Can a PPO-based replenishment policy coordinate multiple retailers more effectively than classical (s, S) and (S, T) policies in terms of total cost and resilience under different disruption regimes and network scales? RQ3: What managerial guidelines can be derived regarding safety stock configuration and replenishment timing for firms operating multi-echelon networks under disruption risk?

For addressing these questions, this paper studies a multi-echelon inventory system consisting of a central warehouse and multiple retailer nodes, where supply chain disruptions simultaneously affect the effective arrival ratio from the warehouse to retailers and the fraction of goods that can be successfully delivered to customers. Retailers make replenishment decisions in each period based on inventory levels, in-transit orders, realized demand, and disruption status. The problem is formulated as a Markov Decision Process (MDP), in which the state captures multi-echelon inventory and disruption information, actions correspond to continuous order quantities for each retailer, and one-period costs include holding, lost-sales, and logistics/disruption-related components. A PPO-based policy is then trained to minimize the discounted total cost of the overall system and is benchmarked against classical (s, S) and (S, T) policies.

The main contributions of this paper are threefold. First, we develop a unified modeling framework that integrates heterogeneous disruption regimes with multi-echelon inventory dynamics, explicitly allowing disruptions to simultaneously reduce upstream arrival ratios and downstream fulfillment fractions. This extends existing multi-echelon DRL frameworks by providing a richer representation of disruption heterogeneity and its effect on inventory decisions. Second, we employ PPO to learn a single policy that exploits structural symmetry among retailers to coordinate replenishment decisions across nodes, and we systematically compare its performance with classical (s, S) and (S, T) policies. Third, through numerical experiments, we show that PPO achieves lower total cost and higher resilience, especially under infrequent long disruptions and in larger networks, and we derive managerial implications: managers should proactively build higher pre-disruption safety stocks when facing long disruptions, and DRL-based replenishment policies such as PPO can serve as scalable decision-support tools that reduce manual parameter tuning and enhance operational robustness in practice.

The remainder of this paper is organized as follows: Section 2 reviews related literature on inventory management and RL applications in this field. Section 3 presents the dynamic modeling framework for the multi-echelon inventory system under disruptions. Section 4 formulates the MDP and details the PPO-based policy learning process. Section 5 reports experimental results comparing PPO with traditional policies. Section 6 concludes the paper and outlines future research directions.

2. Literature Review

2.1. Research on Inventory Management Problems

Inventory management, as a core component of supply chain management, has been extensively studied over the decades. Early work focused on deriving optimal policies under stylized conditions. The Economic Order Quantity (EOQ) model pioneered by Arrow, Harris, and Marschak [6] established the fundamental cost trade-off between ordering and holding costs, while the (s, S) policy formalized by Arrow [7] was shown to be structurally optimal for periodic review systems with stochastic demand and fixed ordering costs. Clark [8] further demonstrated the optimality of base-stock policies in serial systems with linear holding and backorder costs. Subsequent research extended these ideas to multi-echelon settings, for example, through inter-echelon cost transfer mechanisms [9] and dual-supplier structures with delayed ordering [10], laying the foundation for coordinated control in distribution networks.

As global supply chains became more complex and vulnerable, the literature began to explicitly incorporate disruption risks. Parlar and Berkin [11] introduced a two-state (normal/disrupted) supplier model extending EOQ to supply disruption scenarios with lost sales, and Parlar [12] developed semi-Markov models for ON/OFF supply processes. Schmitt [13] used discrete-time Markov chains to analyze base-stock (S, T) policies under supply disruptions, while Saithong [14] and Taleizadeh [15] refined (S, T)-type models by considering continuous disruption durations and optimizing base-stock levels. Other studies highlighted the multi-dimensional impact of disruptions on lead times and flows: Song and Zipkin [16] and Gupta [17] examined variable lead times and their interaction with Poisson demand; Saputro, Figueira, and Almada-Lobo [18] showed that transportation disruptions amplify inventory fluctuations; Pathy and Rahimian [19] emphasized the role of delivery failures and inventory losses in healthcare supply chains. These works collectively demonstrate that disruptions affect both upstream arrival processes and downstream service levels, but they typically rely on relatively low-dimensional, often binary, disruption representations.

Recent studies have further enriched the modeling of uncertainty using confidence-level and game-theoretic approaches, and Liu [20] considers the impact of cost uncertainty at different confidence levels on supply chain competition. De Giovanni [21] developed a dynamic supply chain game with vertical coordination and horizontal competition, showing how contractual mechanisms and strategic interactions shape inventory and pricing decisions over time. Gao and Hua [22] analyzed a green e-commerce supply chain with delivery-time-dependent demand and epistemic uncertainty, using confidence levels to shape robust decisions on pricing and carbon-emission efforts. Liu [23] introduced the Merton Jump Diffusion (MJD) model to simulate non-stationary demand fluctuations induced by disruptions. Compared with these studies, which focus on strategic or contract-level decisions under epistemic or cost uncertainty, our work concentrates on operational multi-echelon inventory control under stochastic disruption processes and uses DRL rather than analytical game equilibria to handle high-dimensional, state-dependent decisions.

In summary, the classical and disruption-aware inventory literature provides rich insights into optimal policies and risk mechanisms, but most models either rely on simplified two-state disruption structures or treat uncertainty via confidence levels and game-theoretic parameters. They do not explicitly represent heterogeneous disruption regimes (e.g., frequent short vs. infrequent long events) acting simultaneously on upstream arrival ratios and downstream fulfillment, nor do they leverage learning-based methods to adapt inventory policies in such environments.

2.2. Application of Deep Reinforcement Learning in Inventory Management

In parallel, Deep Reinforcement Learning (DRL) has shown strong potential for addressing complex inventory and supply chain problems. At the single-echelon level, Dittrich and Fohlmeister [24] applied DQN-type algorithms to inventory control under demand volatility, while Barat et al. [25] used actor–critic methods for joint inventory–transportation decisions. DRL has also been deployed for domain-specific applications such as perishable inventory management [26], shortage risk mitigation for critical medical supplies [27], and vendor-managed inventory in the semiconductor industry [28]. In the broader supply chain context, Aboutorabet al. [29] used DRL to identify disruption risks, while Zhao et al. [30] showed that DRL can enhance supply chain resilience against external shocks.

For multi-echelon systems, several recent works are particularly relevant. Geevers [31] designed DRL models for continuous-action replenishment in multi-echelon networks. Harsha et al. [32] integrated DRL with mathematical programming to coordinate inventory decisions, and Wang and Lin [33] explored multi-agent RL for collaborative decision-making. More recently, Meisheri [34] proposed a multi-agent DRL framework for multi-echelon inventory management, demonstrating the feasibility of learning coordinated policies across echelons. Babazadeh [4] developed a hybrid ANN–MILP model for agile recovery production planning for PPE under sharp demand spikes, while Wang [35] introduced a DRL-based dynamic replenishment approach for multi-echelon inventory systems. These studies confirm that DRL and related machine-learning techniques are powerful tools for multi-echelon and disruption-prone environments.

At the methodological level, the literature has compared different DRL algorithms and architectures. Foundational DQN variants are widely used for discrete decisions [36], whereas actor–critic methods, including A2C and other policy-gradient algorithms, are preferred for more complex or continuous control tasks [37]. Hybrid approaches combine DRL with mathematical programming [32], discrete-event simulation [38], or reward-shaping and architecture tuning [26]. However, most DRL applications either consider generic demand or cost uncertainty or treat disruptions implicitly, and many are limited to single-node or simple two-echelon structures without explicitly modeling heterogeneous disruption regimes.

2.3. Positioning and Methodological Novelty of This Study

Against this background, the studies reviewed in Section 2.1 and Section 2.2 can be viewed as the current state-of-the-art along two closely related strands. Our work contributes to the literature in two main directions. First, relative to classical and confidence-level-based disruption models [13,39], we propose a unified multi-echelon modeling framework that explicitly captures disruption heterogeneity along two dimensions—frequency and duration—and allows disruptions to simultaneously reduce the effective arrival ratio from the warehouse to retailers and the fraction of goods delivered to customers. This extends existing disruption modeling, which typically uses two-state or aggregate risk parameters, by embedding a richer Markovian disruption structure directly into the inventory dynamics and cost components.

Second, compared with current multi-echelon DRL frameworks [40,41,42,43], our study emphasizes the methodological integration of disruption heterogeneity and DRL. We formulate the problem as a high-dimensional MDP where the state includes multi-echelon inventories, in-transit orders, and disruption states, and the action space consists of continuous order quantities for multiple retailers. We adopt Proximal Policy Optimization (PPO) as the learning algorithm, leveraging its stable policy-gradient updates for long-horizon, non-stationary reward structures in disruption-prone systems, and show how a single PPO agent can exploit structural symmetry among retailers to coordinate replenishment decisions across nodes. Numerical experiments systematically compare the learned PPO policy with classical (s, S) and (S, T) policies under different disruption regimes and network scales, demonstrating not only cost advantages but also resilience improvements. In this sense, the methodological superiority of our approach lies in jointly modeling disruption heterogeneity, multi-echelon interactions, and continuous-action DRL within a coherent optimization framework.

3. Modeling the Multi-Echelon Inventory Problem Under Supply Chain Disruptions

3.1. Problem Description

This study focuses on managing multi-echelon inventory systems within supply chains subject to disruptions, as shown in Figure 1. The core system under investigation comprises a central warehouse supplying a network of M retailer nodes. System operation unfolds over discrete time periods (e.g., weeks). During each period, every retailer node must make real-time, online replenishment quantity decisions. These decisions are based on the prevailing inventory level, observed market demand, and the overall state of the supply chain, with particular emphasis on its disruption status.

Figure 1. Illustration of a multi-level supply chain.

To quantify the specific impact of disruptions on logistics capacity, this study refers to the framework of Pathy et al. [19] and introduces a key disruption-loss coefficient mechanism. The actual quantity of goods received by retailers is only a portion of their expected arrivals and is adjusted proportionally by

μ_{1} \in [0,1]

. Meanwhile, during the delivery process to the final customers, not all available inventory can be delivered intact. Only a proportion

μ_{2} \in [0,1]

of the goods can remain intact and be delivered to customers.

3.2. Premises and Assumptions

To accurately capture the inherent dynamics and uncertainties of multi-echelon inventory management and ensure model robustness and adherence to practical business logic, this study adopts the following key assumptions:

(1): The central warehouse possesses infinite storage capacity, whereas each retailer node operates under a finite storage capacity constraint, ensuring actual inventory levels never exceed prescribed maximum limits.
(2): Replenishment lead time is fixed and deterministic, constituting an integer multiple of the periodic review interval; retailers employ a cycle-based inventory review policy triggered at each discrete time period.
(3): Retailer nodes exhibit operational independence: each maintains a dedicated, isolated supply chain link and internal inventory; surplus inventory cannot be shared across nodes, and replenishment decisions are based solely on local inventory levels, in-transit orders, and realized customer demand.
(4): Customer demand per retailer per period is stochastic, modeled as a Poisson-distributed random variable; demand occurrences are mutually independent across retailers and time periods, and the distribution mean/variance parameters are dynamically adjustable to reflect heterogeneous market conditions.
(5): If the inventory is sufficient to meet demand, all remaining inventory will be carried over to the next period, which results in permanent sales loss.
(6): Replenishment order quantities are continuous non-negative real numbers, ensuring practical implementability.
(7): Supply chain disruptions manifest under two archetypal modes: frequent-short events and rare-prolonged events; these may occur at critical stages (e.g., warehouse supply, transport links, retailer operations), significantly degrading associated logistics capabilities.
(8): When the system enters a disruption state, the changes in $μ_{1}$ and $μ_{2}$ apply simultaneously to all retailers, rather than independent local disruptions at individual retailers.

These assumptions are consistent with mainstream models of multi-echelon inventory control and supply chain disruptions in the OR/MS literature. They abstract away from some operational details to keep the problem analytically and computationally tractable, while still reflecting key features of real distribution systems.

3.3. Definitions Related to the Issues

Table 1 lists the symbols and their explanations used for problem formulation. In inventory management problems, adjusting inventory levels based on factors such as current inventory, demand, replenishment quantity, and supply chain status is central. The inventory update rules are as follows:

Table 1. Mathematical symbols and their meanings.

Receiving Inventory:

{i l}_{m} (τ)

and

{I L}_{m} (τ)

are used to define the initial inventory and ending inventory of a retailer

m \in M

in period

τ \in [1, T]

at the beginning of the period, respectively. If there is an in-transit order arriving within the period

τ \in [1, T]

, then the initial inventory of the retailer

m \in M

is

{i l}_{m} (τ) = {I L}_{m} (τ - 1) + {i o}_{m} (τ)

(1)

If a disruption occurs, it is

{{i l}_{m} (τ) = I L}_{m} (τ - 1) + μ_{1} {i o}_{m} (τ)

(2)

Updating Inventory Facing Customer Demand:

D_{m} (τ)

is used to represent the actual demand of the retailer

m \in M

for the commodity in the period

τ \in [1, T]

.

D_{m}

is stochastic and follows a Poisson distribution. The ending inventory calculation formula is

{I L}_{m} (τ) = m a x (0, i l - D m (τ))

(3)

Loss Measurement:

q_{m}^{L S} (τ)

and

q_{m}^{L o s s} (τ)

are used to represent the sales loss and logistics loss quantity that occurred for the retailer

m \in M

in the period

τ \in [1, T]

, respectively. Sales loss occurs when supply falls short of demand; logistics loss only occurs in the event of a disruption. The calculation formulas are

q_{m}^{L S} (τ) = m a x (0, D m (τ) - i l)

(4)

q_{m}^{L o s s} (τ) = (1 - μ_{1}) * {i o}_{m} (τ) + (1 - μ_{2}) * m i n (D m (τ), {i l}_{m} (τ))

(5)

In-Transit Order Management: After meeting the current demand at

τ

, retailer

m \in M

needs to make the next order decision based on the inventory status. This can be represented as

O_{m} (τ)

after the order is dispatched, it is added to the in-transit list and arrives at

τ + L_{m}

.

3.4. Cost Composition and Optimization Objective

3.4.1. Cost Composition

In this paper, the costs incurred by a retailer in each time period mainly include contract fees, inventory costs, sales loss fees, and goods loss fees. Specific definitions are as follows:

(1): Contract fees $F_{m} (τ)$ are the agreement costs between the retailer and the supplier, including the fees for placing orders and the fixed fees stipulated in the transportation agreement. In this paper, the contract fee is a fixed amount determined by mutual agreement, and unit ordering costs are not considered.

F_{m} (τ) = \{\begin{matrix} f_{m} o r d e r \\ 0 n o o r d e r \end{matrix}

(6)

where

f_{m}

is the contract fee for retailer m ordering from the supplier.

(2): Inventory costs $H_{m} (τ)$ refer to the expenses the retailer must bear to maintain a certain inventory level, including inventory-holding costs and warehousing costs.

H_{m} (τ) = h_{m} \cdot {I L}_{m} (τ)

(7)

where

h_{m}

is unit inventory cost.

(3): Sales loss fees ${L S}_{m} (τ)$ occur when demand exceeds the current inventory, leading to an inability to meet customer requirements. In inventory management, sales loss fees reflect the negative impact of the inventory strategy’s failure to respond effectively to demand fluctuations and supply chain disruptions.

{L S}_{m} (τ) = {l s}_{m} \cdot q_{m}^{L S} (τ)

(8)

where

{l s}_{m}

is the unit sales loss cost.

(4): Goods loss ${L o s s}_{m} (τ)$ fees refer to the cost of commodity losses caused by reasons such as transportation disruptions, improper warehouse management, or supply chain disruptions. Especially under supply chain disruption scenarios, transportation delays and damage to goods will lead to increased losses.

{L o s s}_{m} (τ) = {{l o s s}_{m} \cdot q}_{m}^{L o s s} (τ)

(9)

where

{l o s s}_{m}

is the unit cost of goods loss.

3.4.2. Optimization Objective

By accurately calculating and optimizing these cost items, the goal is to achieve an optimal inventory management strategy and reduce the overall cost of the supply chain system. These costs can be calculated as follows:

Within

τ

period, the total cost of the retailer

M

can be expressed as follows:

c o s t (τ) = f_{m} (τ) + H_{m} (τ) + L S m (τ) + {L o s s}_{m} (τ)

(10)

The main objective of this paper is to design and optimize multilevel inventory management strategies in a complex, volatile, and uncertain supply chain environment, aiming to minimize the long-term total operating cost of the system. Uncertainty in the system stems from two primary sources: demand volatility, characterized by randomness and the difficulty of accurately forecasting retailer demand; and supply chain disruptions, which probabilistically reduce goods arrivals and delivery rates. Together, these factors render the cost structure highly dynamic and difficult to predict. The objective function is

\min_{τ \in [1, T], m \in M} \sum_{τ = 1}^{T} c o s t (τ) = \min_{τ \in [1, T], m \in M} \sum_{τ = 1}^{T} f_{m} (τ) + \min_{τ \in [1, T], m \in M} \sum_{τ = 1}^{T} H_{m} (τ) + \min_{τ \in [1, T], m \in M} \sum_{τ = 1}^{T} L S m (τ) + \min_{τ \in [1, T], m \in M} \sum_{τ = 1}^{T} {L o s s}_{m} (τ)

(11)

4. PPO-Based Multilevel Inventory Strategy Optimization

Traditional inventory methods, such as the Economic Order Quantity (EOQ) model and (s, S) policies, work in stable environments but are inadequate under frequent disruptions and demand volatility due to their fixed assumptions. This paper employs a Deep Reinforcement Learning (DRL) approach and Proximal Policy Optimization (PPO), which learns adaptive policies through agent-to-environment interaction. PPO offers greater flexibility and robustness, stabilizes training via a clipping mechanism, avoids gradient issues, and efficiently handles continuous action spaces. These features make PPO well-suited for dynamic, uncertain, multi-level inventory management, enabling adaptive strategy adjustments in complex supply chains.

4.1. Markov Decision Process Model

To better solve the optimization problem, this paper first models the inventory management problem as a Markov Decision Process (MDP). By defining states, actions, and rewards, the MDP provides structured support for inventory decisions, enabling them to cope with different scenarios of supply chain disruptions and demand changes. As illustrated in the figure, the specific decision-making process at each time step is as follows:

The agent observes the environment

S_{τ}

. The agent first perceives the environment by observing the current state

S_{τ}

. The state describes all the necessary information about the system at

τ

, including the current inventory levels of retailers, in-transit orders, market demand, and supply chain disruption status.

The agent selects an action

a_{τ}

. Based on the observed

S_{τ}

, the agent selects an action according to its policy

π

. The action represents the behavior the agent can take in the current state

S_{τ}

. In the problem structure of this paper, the action refers to the decision made by retailers at a specific point in time. Specifically, it is each retailer deciding how many orders to place for the next time period based on their current inventory status and demand forecast.

The environment transitions to a new state, and after the action occurs, the system transitions from

S_{τ}

to

S_{τ + 1}

. The transition triggers a reward.

R_{τ}

reflects the quality of the current decision. Lower inventory holding costs, fewer sales losses, and smaller logistics costs typically correspond to higher rewards (a smaller negative total cost).

The goal of the MDP is to find a policy minimizing the infinite sum of discounted costs, where a decision is a mapping from states to actions. Note that the objective of this paper is to minimize the long-term average cost. This means that, in practice,

τ

is chosen sufficiently high to reflect the infinite-time cost. This objective can therefore be expressed as shown in the equation:

π^{*} = a r g \underset{π \in Π}{minE} [\sum_{τ = 0}^{\infty} γ^{τ} \cdot c_{τ} (S_{τ}, a_{τ}^{π})]

(12)

where

γ

is the discount factor balancing the influence of future and current rewards.

a_{τ}^{π}

denotes the action taken by the policy

π

at

τ

.

The multilevel inventory management problem is one that requires multi-step, continuous decision-making. At each time step, the agent outputs the optimal decision, eventually forming a complete service strategy. The decision-making process at each time step is modeled as a Markov Decision Process.

4.2. Definition of States, Actions, and Rewards

MDPs can be defined as a tuple

(S, A, R)

, where S represents the set of environment states,

A

represents the set of actions available for the agent to choose, and R represents the set of reward values provided by the environment after each decision.

State: Includes the current inventory level of the retailer, outstanding orders, market demand, and the status of supply chain disruptions.

S_{τ} = [{I L}_{m} (τ), D_{m} (τ), {\hat{I O}}_{m} (τ), E t (τ)]

(13)

where

{I L}_{m} (τ) \in 0, {M a x I L}_{m}

. To track the replenishment process, an in-transit inventory list

{\hat{I O}}_{m} (τ)

is established. After a certain time, the order information will be recorded as

{i o}_{m} (τ)

in the in-transit inventory list at

τ

position, and once the goods arrive in the inventory (the goods ordered at

(τ - L

), they will be removed from the in-transit list.

E t (τ) = 0 o r 1

, represents whether there is an interruption.

Action: The action space is defined as the replenishment decision of each retailer at every decision epoch. Specifically, each retailer determines its order quantity for the next period based on the current state. Since the replenishment quantity is a continuous, non-negative variable in practice, it is discretized to form a finite action set. The continuous order quantity is first normalized into the interval

[0,1],

then discretized into evenly spaced grid points (e.g., rounded to two decimal places). This discretization ensures that the decision space remains computationally tractable while preserving the ability to approximate continuous order decisions.

\begin{matrix} A & = & {O_{1} (τ), O_{2} (τ) \dots O_{m} (τ)} \end{matrix}

(14)

After the agent selects an action, the action value is mapped to

O_{m} (τ)

to obtain

a_{τ}

, that is,

a_{τ} = [O_{m} (τ)] (O_{m} (τ) \in (0, {M a x I L}_{m}))

This processing method not only ensures that the values related to the order quantity do not exhibit excessive fluctuations, which better meets the requirements of the actual business scenario.

Reward: In the inventory problem considered in this paper, the environment is cost-minimizing, and the reward fed back to the agent is a negative cost rather than a normalized penalty. Specifically, at each decision epoch

τ

, we first compute the total cost

C_{τ}

as the sum of holding cost, shortage (lost-sales) cost, and logistics/disruption-related cost. The immediate reward is then defined as follows:

r_{τ} = - C_{τ}

(15)

Because all cost components are measured in the same monetary unit and the per-period cost magnitudes are moderate (see Table 2), we do not apply additional normalization or clipping to

r_{τ}

. This linear transformation preserves the optimal policy while keeping the scale of rewards within a numerically stable range for PPO training.

Table 2. Retailers’ parameter settings.

4.3. PPO Method Model Framework

After modeling the multi-echelon inventory management problem as a Markov Decision Process (MDP), this paper adopts the Proximal Policy Optimization (PPO) algorithm to optimize retailer replenishment decisions under disruption scenarios. PPO, a policy-gradient method proposed by Schulman et al. (2017) [44] within an actor–critic framework, stabilizes training via a clipping mechanism, improving convergence and robustness in dynamic environments. It achieves high sample efficiency through multiple mini-batch updates, reducing computational costs in simulations. PPO also offers strong flexibility and scalability, adapting to environmental changes and enabling efficient multi-node parallel decision-making, thus providing effective algorithmic support for complex supply chain inventory optimization.

The PPO algorithm training process is illustrated in Figure 2. At each time period

τ

, the agent first observes the environmental state

S_{τ}

. Based on the current state, the policy network

π_{θ} (a | s_{τ})

outputs a parameterized distribution over actions. By stochastically sampling from this distribution, the agent obtains the actual order quantity decision

a_{τ}

to execute. After executing the action, the agent receives an immediate reward

R_{τ}

(i.e., the negative total periodic cost) and the next state

S_{τ + 1}

from the environment. To evaluate the relative advantage of each action under the current policy, the Generalized Advantage Estimation (GAE) method is employed. The advantage value

\hat{A_{t}}

is calculated as follows:

δ_{t} = r_{t} + γ V (s_{t + 1}) - V (s_{t})

(16)

\hat{A_{t}} = \sum_{l = 0}^{T - t} {(γ λ)}^{l} δ_{t + l}

(17)

Figure 2. Training flowchart of the PPO algorithm.

Through GAE, the improvement potential of each action can be efficiently estimated while controlling variance, providing the basis for subsequent policy updates. Based on the collected trajectory data and computed advantage estimates, the PPO updates the policy network parameters

θ

by minimizing a clipped objective function to stably improve policy performance. The optimization objective is

L^{C L I P} (θ) = E_{t} [m i n (r_{t} (θ) A_{t}, c l i p (r_{t} (θ), 1 - ϵ, 1 + ϵ) A_{t})]

(18)

Simultaneously, the value network parameters

ϕ

are updated by minimizing the mean squared error loss to enhance the accuracy of value estimation. The training process utilizes mini-batch Stochastic Gradient Descent (SGD), performing multiple rounds of updates per batch of samples.

5. Experimental Setup

5.1. Simulation Parameter Settings

In inventory management, the problem scale, defined by the number of retailers, demand volatility, inventory capacity, and cost coefficients, directly affects decision complexity and optimization outcomes. Multi-echelon systems amplify these challenges due to increased coordination needs and resource allocation complexity. Demand and capacity fluctuations further impact stability and efficiency, heightening uncertainty. To assess the PPO performance across different scales, this study configures simulation parameters to reflect disparities between small and large retailers, as shown in Table 2.

Supply Interruption Modeling: To accurately simulate disruption scenarios triggered by unforeseen events within a supply chain, this paper adopts a two-state supply process for modeling. Snyder et al. (2016) [45] indicate that this process can effectively describe a wide range of interruption scenarios, spanning from frequent but brief disruptions to infrequent yet prolonged ones. The parameter α denotes the probability of the system transitioning from the normal state to the disrupted state, while the parameter β represents the probability of the system recovering from the disrupted state back to the normal state. The average duration in each state follows a geometric distribution. Within the stochastic process, the probability of the system being disrupted for exactly

τ

review periods is given

\frac{α}{α + β}

[45]. Consequently, the average duration of the disrupted state can be inferred as

\frac{α}{α + β}

, and the average duration of the normal state as

P = 1 - \frac{β}{α + β}

. The interruption parameters utilized in this paper are adopted from the settings defined by Czerniak et al. [46], as shown in Table 3.

Table 3. Disruption type and probability.

Hyperparameter Settings and Tuning: To select these hyperparameters, we followed a two-stage tuning procedure. First, we started with standard PPO defaults commonly used in the DRL literature and adjusted them to ensure numerically stable training in our multi-echelon inventory environment. Second, we performed a coarse grid search on a representative scenario with two retailers and disruption type 2 (infrequent, long disruptions). We then selected the configuration that achieved the lowest average total cost with smooth, non-divergent learning curves. This final configuration is fixed across all other experimental settings so that performance differences can be attributed to disruption patterns and network scales rather than case-specific hyperparameter retuning. The detailed hyperparameter settings are shown in Table 4.

Table 4. Hyperparameter settings.

Implementation details and training stability: We used a shared actor–critic multilayer perceptron with two hidden layers of 64 units and ReLU activations, optimized by Adam. The state variables and costs were scaled to moderate ranges, and no additional batch or layer normalization was applied. PPO was trained for a fixed maximum of 10,000 iterations; during development, we monitored the moving average of the discounted total cost and observed that learning curves plateaued well before this limit, so we did not employ early stopping. For reproducibility, we used a fixed global random seed for both the simulation environment and the PPO implementation, and we applied the same seed to all compared policies when computing the costs in Table 5, Table 6 and Table 7. We also tracked the PPO clipped objective, value-function loss, and policy entropy throughout training and did not observe divergence or collapse, indicating stable training behavior.

Table 5. Comparison of the results.

Table 6. Parameter sensitivity analysis.

Table 7. Impact of lead time L on the retailer’s cost.

5.2. Experimental Results Analysis

5.2.1. Comparative Experiments on Effectiveness

The (s, S) and (S, T) inventory policies are employed as baselines to evaluate the algorithm’s performance. For the effectiveness experiments, scenarios with two, three, and four independent retailers and a lead time of two were considered. Given that the retailers are mutually independent and exhibit no synergistic effects, the sum of the locally optimal policies for each retailer can be approximated as the global optimum in the absence of cross-node coordination [7]. Consequently, the baseline policies are constructed from the optimal policy for each individual retailer, determined using a grid-search method within the search range

s \in (0, {M a x I L}_{m}), S \in (0, {M a x I L}_{m})

. The average total cost over a 50-time-step horizon was computed via Monte Carlo simulation for each candidate (s, S) pair within the grid space; the optimal (s, S) combination was then selected as the solution yielding the minimum cost. An identical optimization procedure was employed for the (S, T) inventory policy.

The results shown in Figure 3 and Figure 4 are obtained when only two retailers are considered. It is clear that the PPO algorithm outperforms the other two baseline algorithms, especially in the face of infrequent and long-duration interruptions. Specifically, in the case of long and infrequent disruptions, the PPO algorithm can adjust the inventory policy more effectively, reduce the overall cost of the supply chain by 8%, and improve the resilience of the system. This shows that the PPO algorithm has stronger adaptability and optimization ability when dealing with longer supply chain disruptions. In contrast, in frequent and short-duration outage scenarios, the traditional baseline algorithm can quickly respond and adjust to achieve better performance, and the PPO algorithm can only improve by 2%.

Figure 3. Comparison of algorithms in frequent and short duration interruptions.

Figure 4. Comparison of algorithms in infrequent and long-duration interruptions.

It can be seen from Table 5 that with the increase in the number of retailers (M), the traditional algorithm is unable to cope with the scale expansion of the complex supply chain. Especially when the number of retailers increases, the traditional (S,s) and (S,T) strategies gradually lose efficiency in terms of inventory allocation and coordination. Due to the inability to effectively deal with the increasingly complex dependencies and dynamic changes between supply chain nodes, traditional algorithms often lead to uneven resource allocation and inventory instability, which affects the optimization effect of the overall system.

To better understand these cost improvements, we qualitatively examine the behavior of the learned PPO policy. Under frequent short disruptions (Type 1), PPO behaves close to a base–stock rule: inventory positions are kept around a moderate target, and order quantities fluctuate only mildly. In this regime, the (S, T) policy can also track the shallow shocks reasonably well, which explains why the relative cost advantage of PPO in Figure 3 and Figure 4 remains modest when disruptions are frequent and short.

Under infrequent long disruptions (Type 2), PPO adopts a more conservative, anticipatory behavior. During long normal periods, the agent gradually builds higher pre-disruption inventory than the (s, S) and (S, T) policies, and once a disruption occurs, it reduces both the frequency and size of orders, relying more on positioned stock. This anticipatory buffer pattern reduces sales loss and logistics loss costs during long disruption spells and leads to the more pronounced 8–10% cost reductions reported in Table 5 as the number of retailers increases. Moreover, when the network scales from two to four retailers, PPO exploits the symmetry of the system to coordinate replenishment across nodes and avoid extreme stockouts at individual retailers, while the static policies optimize each retailer in isolation and thus suffer from higher total costs.

5.2.2. Comparative Experiments on Generalizability

To further validate the algorithm’s robustness and performance, this paper conducts comparative experiments on generalization to assess the sensitivity of experimental results to variations in key parameters. The objective of these experiments is to examine the algorithm’s adaptability across diverse scenarios and conditions, ensuring stable operation and effective solutions in various environments. To this end, multiple key parameters were varied, and their impacts on algorithm performance were evaluated by comparing experimental outcomes. The generalization parameters used are detailed in Table 6. Specifically, the experiments encompass two primary components:

Variations in Environmental Uncertainty: Uncertainties inherent in the environment, such as supply chain disruptions, transportation delays, and demand forecast errors, represent common challenges in inventory management. Therefore, differing disruption levels were considered as they might affect algorithm performance. Different disruption severities were characterized by the goods arrival rate. Changes in the goods arrival rate reflect varying degrees of supply chain disruption severity, encompassing the following four scenarios: Complete disruption, Severe disruption, Minor disruption, No disruption. By adjusting parameters

μ_{1}

and

μ_{2}

, the impact of these varying disruption severities within the supply chain was simulated. For example, A configuration with

μ_{1} = 0.5

and

μ_{2} = 0.8

represents a severe disruption between the supplier and retailer, and a minor disruption between the retailer and customer. The corresponding results are presented in Figure 5 and Figure 6.

Figure 5. Cost comparison under infrequent long-duration disruptions. (a–d) represent

μ_{1} = 0, 0.5, 0.7, 1

, respectively.

Figure 6. Cost comparison under frequent short-duration disruptions; (a–d) represent

μ_{1} = 0, 0.5, 0.7, 1

, respectively.

Experimental results show that the PPO advantage grows with disruption severity. In severe, large-scale, and long-duration events, it effectively adjusts inventory policies, lowering total costs and enhancing stability and adaptability, demonstrating strong flexibility and robustness in complex, uncertain environments. Comparing traditional methods, the (S, T) policy generally outperforms (s, S) under most disruption conditions due to its ability to flexibly adjust order quantities, maintain optimal inventory levels, and minimize losses from overstock or stockouts. In contrast, (s, S) relies on fixed bounds and historical demand, limiting responsiveness to sudden interruptions. However, as disruptions diminish or remain minor, the performance gap narrows, and (s, S) regains the advantage. In stable, predictable environments, it leverages simplicity, efficiency, and lower operational complexity, making it more adaptive and cost-effective in a steady state.

Supply chain parameter variation: In supply chain management problems, different supply chain characteristics can significantly affect the performance of the algorithm. By tuning these parameters, it is possible to analyze how the algorithm adapts to supply chain environments of different sizes and complexities. We then tested the performance difference between the PPO algorithm and the baseline strategy under different lead times, thus assessing its generalization ability under multiple supply chain conditions.

The results shown in Table 7 indicate that the lead time

(L)

exerts a relatively limited influence on the performance of the PPO algorithm. The algorithm demonstrates minimal performance variation across different lead time settings. This suggests that the PPO algorithm exhibits robust stability and adaptability when managing supply chain scenarios characterized by varying lead times. In contrast, traditional algorithms often encounter response lag issues when faced with significant changes in lead time. This lag typically leads to a decrease in inventory management efficiency. Unlike these traditional approaches, the PPO algorithm, by leveraging adaptive strategies from Deep Reinforcement Learning, dynamically adjusts inventory management decisions. This capability allows it to effectively optimize synergistic effects between nodes, even in systems with a large number of retailers, thereby enhancing the overall efficiency of the supply chain. Consequently, the PPO algorithm demonstrates superior systemic adaptability and stability, enabling it to effectively balance demand and inventory across various nodes within complex and large-scale supply chain environments.

The sensitivity results in Figure 5 and Figure 6 show how PPO adapts to different disruption severities. As the arrival rates μ₁ and μ₂ decrease, all policies implicitly increase safety buffers, but PPO does so in a more state–dependent way: when disruption risk is high, it orders earlier and slightly more in the normal state, and then tapers orders once the system enters a sustained disruption. The fixed (s, S) and (S, T) rules, calibrated for a nominal environment, cannot dynamically adjust this trade–off, which leads to either pronounced stockouts or over–stocking when disruption severity changes.

A similar mechanism appears when varying the lead time L in Table 7. For longer lead times, PPO learns smoother and earlier replenishment actions to compensate for delayed arrivals, thereby mitigating both lost–sales costs and the need for sudden large orders when inventory becomes critically low. In contrast, the benchmark policies systematically lag behind changes in L because their parameters are fixed ex ante, so their total cost grows faster than that of PPO as L increases. Overall, PPO improves performance by learning disruption–aware policies that adjust safety–stock levels and replenishment timing over a wide range of operating conditions.

6. Conclusions

In this study, we proposed an inventory optimization framework based on the Proximal Policy Optimization (PPO) algorithm for managing multi-echelon supply chains under disruption scenarios. Numerical experiments show that leveraging the underlying structural symmetry of the multi-echelon system enables PPO to generalize policies across retailers, while still adapting to disruption-induced asymmetries in arrival and fulfillment rates.

Our experimental results demonstrate that PPO outperforms traditional inventory management models, particularly in reducing total supply chain costs during disruptions. Specifically, PPO dynamically adjusts replenishment strategies, improving the resilience of the supply chain while minimizing both inventory and sales loss costs. These findings have significant implications, both theoretically and practically.

This research contributes to the literature by applying reinforcement learning, specifically PPO, to the optimization of inventory management in disrupted supply chains. It challenges traditional inventory control models that rely on fixed rules and assumptions, highlighting the potential of adaptive learning methods to better respond to dynamic and uncertain environments. Our study provides valuable insights into how reinforcement learning can be used to optimize decision-making in multi-echelon systems, where conventional methods often struggle to achieve optimal performance.

From a practical standpoint, our study provides actionable insights for supply chain managers dealing with disruptions. PPO’s ability to adapt its replenishment policies in real-time offers a flexible and efficient solution for optimizing inventory management, especially in environments where supply chain disruptions are frequent. By adopting PPO, companies can enhance their decision-making processes, improving both cost efficiency and service levels during periods of uncertainty. This is particularly valuable for industries like retail and manufacturing, where traditional models fall short in handling disruptions, leading to stockouts and excessive inventory costs.

While this study demonstrates the potential of PPO for optimizing supply chain management, there are limitations to our work. Our model assumes simplified disruption types and does not incorporate certain real-world complexities, such as supplier diversity and transportation delays. Additionally, the scalability of the model to larger, multi-echelon supply chains requires further exploration.

Future research could explore more complex disruption scenarios and incorporate additional features such as demand forecasting and real-time data integration. Further studies could also extend the PPO model to multi-retailer and multi-echelon systems, addressing the scalability challenges and improving its practical applicability in real-world supply chains.

In conclusion, this research not only contributes to advancing reinforcement learning techniques in supply chain management but also provides practical solutions for real-world supply chain optimization in disrupted environments. The findings lay the foundation for further exploration of adaptive learning models in supply chains, ultimately helping organizations better navigate uncertainty and enhance operational efficiency.

Author Contributions

Conceptualization, X.L. and Z.P.; Methodology, X.L., H.W. and Z.P.; Software, H.W.; Validation, X.L. and H.W.; Formal analysis, H.W.; Investigation, H.W., C.L. (Chen Liao) and C.L. (Chunyan Liu); Resources, Z.P., C.L. (Chen Liao) and C.L. (Chunyan Liu); Data curation, H.W. and C.L. (Chen Liao); Writing—original draft, X.L. and H.W.; Writing—review and editing, H.W. and Z.P.; Visualization, H.W.; Supervision, X.L. and Z.P.; Project administration, Z.P.; Funding acquisition, X.L. and Z.P. All authors have read and agreed to the published version of the manuscript.

Funding

National Key Research and Development Program of China, No.2024YHB3311600. Natural Science Foundation of Anhui Province, No.2308085MG227.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

Jiang, C.; Sheng, Z. Case-based reinforcement learning for dynamic inventory control in a multi-agent supply-chain system. Expert Syst. Appl. 2009, 36, 6520–6526. [Google Scholar] [CrossRef]
Jain, S.; Raghavan, N.R.S. A queuing approach for inventory planning with batch ordering in multi-echelon supply chains. Cent. Eur. J. Oper. Res. 2009, 17, 95–110. [Google Scholar] [CrossRef]
Rolf, B.; Jackson, I.; Müller, M.; Lang, S.; Reggelin, T.; Ivanov, D. A review on reinforcement learning algorithms and applications in supply chain management. Int. J. Prod. Res. 2022, 61, 7151–7179. [Google Scholar] [CrossRef]
Babazadeh, R.; Taraghi Nazloo, H.; Kamran, M. A hybrid ANN–MILP model for agile recovery production planning for PPE products under sharp demands. Int. J. Prod. Res. 2025, 63, 758–778. [Google Scholar] [CrossRef]
Zhang, Y.; He, L.; Zheng, J. A Deep Reinforcement Learning-Based Dynamic Replenishment Approach for Multi-Echelon Inventory Considering Cost Optimization. Electronics 2025, 14, 66. [Google Scholar] [CrossRef]
Arrow, K.J.; Harris, T.; Marschak, J. Optimal inventory policy. Econometrica 1951, 19, 250–272. [Google Scholar] [CrossRef]
Arrow, K.J.; Karlin, S.; Scarf, H.E. Studies in the Mathematical Theory of Inventory and Production; Stanford University Press: Stanford, CA, USA, 1958. [Google Scholar]
Clark, A.J.; Scarf, H. Optimal policies for a multi-echelon inventory problem. Manag. Sci. 1960, 6, 475–490. [Google Scholar] [CrossRef]
Federgruen, A.; Zipkin, P.H. An efficient algorithm for computing optimal (s, S) policies. Oper. Res. 1984, 32, 1268–1285. [Google Scholar] [CrossRef]
Parlar, M.; Berkin, D. Future supply uncertainty in EOQ models. Nav. Res. Logist. 1991, 38, 107–121. [Google Scholar] [CrossRef]
Parlar, M.; Perry, D. Inventory models of future supply uncertainty with single and multiple suppliers. Nav. Res. Logist. 1996, 43, 191–210. [Google Scholar] [CrossRef]
Parlar, M. Continuous-review inventory problem with random supply interruptions. Eur. J. Oper. Res. 1997, 99, 366–385. [Google Scholar] [CrossRef]
Schmitt, A.J.; Snyder, L.V.; Shen, Z.-J.M. Centralization versus decentralization: Risk pooling, risk diversification, and supply uncertainty in a one-warehouse multiple-retailer system. Omega 2015, 52, 201–212. [Google Scholar] [CrossRef]
Saithong, C.; Lekhavat, S. Derivation of closed-form expression for optimal base stock level considering partial backorder, deterministic demand, and stochastic supply disruption. Cogent Eng. 2020, 7, 1767833. [Google Scholar] [CrossRef]
Taleizadeh, A.A.; Tafakkori, L.; Thaichon, P. Resilience toward supply disruptions: A stochastic inventory control model with partial backordering under the base stock policy. J. Retail. Consum. Serv. 2021, 58, 102291. [Google Scholar] [CrossRef]
Song, J.-S.; Zipkin, P.H. Inventory control with information about supply conditions. Manag. Sci. 1996, 42, 1409–1419. [Google Scholar] [CrossRef]
Gupta, D. The (Q, r) inventory system with an unreliable supplier. INFOR Inf. Syst. Oper. Res. 1996, 34, 59–76. [Google Scholar] [CrossRef]
Saputro, T.E.; Figueira, G.; Almada-Lobo, B. Integrating supplier selection with inventory management under supply disruptions. Int. J. Prod. Res. 2021, 59, 3304–3322. [Google Scholar] [CrossRef]
Pathy, S.R.; Rahimian, H. A resilient inventory management of pharmaceutical supply chains under demand disruption. Comput. Ind. Eng. 2023, 180, 109243. [Google Scholar] [CrossRef]
Liu, Z.; Zhou, C.; Chen, H.; Zhao, R. Impact of cost uncertainty on supply chain competition under different confidence levels. Int. Trans. Oper. Res. 2021, 28, 1465–1504. [Google Scholar] [CrossRef]
De Giovanni, P. A dynamic supply chain game with vertical coordination and horizontal competition. Int. Trans. Oper. Res. 2021, 28, 3117–3146. [Google Scholar] [CrossRef]
Gao, R.; Hua, K. Green e-commerce supply chain analysis considering delivery time under epistemic uncertainty based on confidence level. RAIRO-Oper. Res. 2025, 59, 701–724. [Google Scholar] [CrossRef]
Liu, X.; Hu, M.; Peng, Y.; Yang, Y. Multi-agent deep reinforcement learning for multi-echelon inventory management. Prod. Oper. Manag. 2025, 34, 1836–1856. [Google Scholar] [CrossRef]
Dittrich, M.-A.; Fohlmeister, S. A deep Q-learning-based optimization of the inventory control in a linear process chain. Prod. Eng. 2020, 15, 35–43. [Google Scholar] [CrossRef]
Barat, S.; Khadilkar, H.; Meisheri, H.; Kulkarni, V.; Baniwal, V.; Kumar, P.; Gajrani, M. Actor based simulation for closed loop control of supply chain using reinforcement learning. In Proceedings of the International Joint Conference on Autonomous Agents and Multiagent Systems (AAMAS), Montreal, QC, Canada, 13–17 May 2019; Volume 3, pp. 1802–1804. [Google Scholar]
Moor, B.; Gijsbrechts, J.; Boute, R.N.; Slowinski, R.; Artalejo, J.; Billaut, J.C.; Dyson, R.; Peccati, L. Reward shaping to improve the performance of deep reinforcement learning in perishable inventory management. Eur. J. Oper. Res. 2022, 301, 535–545. [Google Scholar] [CrossRef]
Zwaida, T.A.; Pham, C.; Beauregard, Y. Optimization of inventory management to prevent drug shortages in the hospital supply chain. Appl. Sci. 2021, 11, 2726. [Google Scholar] [CrossRef]
Afridi, M.T.; Nieto-Isaza, S.; Ehm, H.; Ponsignon, T.; Hamed, A. A deep reinforcement learning approach for optimal replenishment policy in a vendor managed inventory setting for semiconductors. In Proceedings of the 2020 Winter Simulation Conference (WSC), Orlando, FL, USA, 14–18 December 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 1753–1764. [Google Scholar] [CrossRef]
Aboutorab, H.; Hussain, O.K.; Saberi, M.; Hussain, F.K. A reinforcement learning-based framework for disruption risk identification in supply chains. Future Gener. Comput. Syst. 2022, 126, 110–122. [Google Scholar] [CrossRef]
Zhao, Y.; Hemberg, E.; Derbinsky, N.; Mata, G.; O’Reilly, U.-M. Simulating a logistics enterprise using an asymmetrical wargame simulation with Soar reinforcement learning and coevolutionary algorithms. In Proceedings of the Genetic and Evolutionary Computation Conference Companion (GECCO ’21), Lille, France, 10–14 July 2021; ACM: New York, NY, USA, 2021; pp. 1907–1915. [Google Scholar] [CrossRef]
Geevers, K.; van Hezewijk, L.; Mes, M.R.K. Multi-echelon inventory optimization using deep reinforcement learning. Cent. Eur. J. Oper. Res. 2024, 32, 653–683. [Google Scholar] [CrossRef]
Harsha, P.; Jagmohan, A.; Kalagnanam, J.R.; Quanz, B.; Singhvi, D. Deep policy iteration with integer programming for inventory management. Manuf. Serv. Oper. Manag. 2025, 27, 369–388. [Google Scholar] [CrossRef]
Wang, F.; Lin, L. Spare parts supply chain network modeling based on a novel scale-free network and replenishment path optimization with Q learning. Comput. Ind. Eng. 2021, 157, 107312. [Google Scholar] [CrossRef]
Meisheri, H.; Sultana, N.N.; Baranwal, M.; Baniwal, V.; Nath, S.; Verma, S.; Ravindran, B.; Khadilkar, H. Scalable multi-product inventory control with lead time constraints using reinforcement learning. Neural Comput. Appl. 2021, 34, 1735–1757. [Google Scholar] [CrossRef]
Wang, Q.; Peng, Y.; Yang, Y. Solving inventory management problems through deep reinforcement learning. J. Syst. Sci. Syst. Eng. 2022, 31, 677–689. [Google Scholar] [CrossRef]
Singi, S.; Gopal, S.; Auti, S.; Chaurasia, R. Reinforcement learning for inventory management. In Proceedings of International Conference on Intelligent Manufacturing and Automation; Springer: Singapore, 2020; pp. 317–326. [Google Scholar]
Barat, S.; Kumar, P.; Gajrani, M.; Khadilkar, H.; Meisheri, H.; Baniwal, V.; Kulkarni, V. Reinforcement learning of supply chain control policy using closed-loop multi-agent simulation. In International Workshop on Multi-Agent Systems and Agent-Based Simulation; Lecture Notes in Computer Science; Sichman, J.S., Paolucci, M., Verhagen, H., Eds.; Springer: Cham, Switzerland, 2020; Volume 12025, pp. 26–38. [Google Scholar] [CrossRef]
Hammler, P.; Riesterer, N.; Braun, T. Fully dynamic reorder policies with deep reinforcement learning for multi-echelon inventory management. Inform. Spektrum 2023, 46, 240–251. [Google Scholar] [CrossRef]
Konstantaras, I.; Skouri, K.; Lagodimos, A.G. EOQ with independent endogenous supply disruptions. Omega 2019, 83, 96–106. [Google Scholar] [CrossRef]
Demizu, T.; Fukazawa, Y.; Morita, H. Inventory management of new products in retailers using model-based deep reinforcement learning. Expert Syst. Appl. 2023, 229 Pt A, 120256. [Google Scholar] [CrossRef]
Gijsbrechts, J.; Boute, R.N.; Van Mieghem, J.; Zhang, D. Can deep reinforcement learning improve inventory management? Performance on dual sourcing, lost sales and multi-echelon problems. Manuf. Serv. Oper. Manag. 2021, 24, 1349–1368. [Google Scholar] [CrossRef]
Hachaichi, Y.; Chemingui, Y.; Affes, M. A policy gradient based reinforcement learning method for supply chain management. In Proceedings of the International Conference on Advanced Systems and Emergent Technologies (IC_ASET), Hammamet, Tunisia, 15–18 December 2020; IEEE: Tunis, Tunisia, 2020; pp. 135–140. [Google Scholar] [CrossRef]
Vanvuchelen, N.; Gijsbrechts, J.; Boute, R. Use of proximal policy optimization for the joint replenishment problem. Comput. Ind. 2020, 119, 103239. [Google Scholar] [CrossRef]
Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal policy optimization algorithms. arXiv 2017, arXiv:1707.06347. Available online: https://arxiv.org/abs/1707.06347 (accessed on 1 January 2024). [CrossRef]
Snyder, L.V.; Atan, Z.; Peng, P.; Rong, Y.; Schmitt, A.J.; Sinsoysal, B. OR/MS models for supply chain disruptions: A review. IIE Trans. 2016, 48, 89–109. [Google Scholar] [CrossRef]
Czerniak, L.L.; Daskin, M.S.; Lavieri, M.S.; Sweet, B.V.; Leja, J.; Tupps, M.A.; Renius, K. Closed-form (R,S) inventory policies for perishable inventory systems with supply chain disruptions. INFOR Inf. Syst. Oper. Res. 2023, 61, 327–367. [Google Scholar] [CrossRef]

Figure 1. Illustration of a multi-level supply chain.

Figure 2. Training flowchart of the PPO algorithm.

Figure 3. Comparison of algorithms in frequent and short duration interruptions.

Figure 4. Comparison of algorithms in infrequent and long-duration interruptions.

Figure 5. Cost comparison under infrequent long-duration disruptions. (a–d) represent

μ_{1} = 0, 0.5, 0.7, 1

, respectively.

Figure 6. Cost comparison under frequent short-duration disruptions; (a–d) represent

μ_{1} = 0, 0.5, 0.7, 1

, respectively.

Table 1. Mathematical symbols and their meanings.

Symbol	Explanation
$m$	Retailer ID
$τ$	Time period index
$L$	Lead time of retailers
$μ_{1}$	Percentage of goods from the warehouse to the retailer
$μ_{2}$	Percentage of goods from the retailer to the customer
${I L}_{m} (τ)$	Inventory level of the retailer at the end of $τ$ period
${i l}_{m} (τ)$	Initial inventory of $τ$ period
$D_{m} (τ)$	Customer demand faced by the retailer at $τ$
${\hat{I O}}_{m} (τ)$	$L i s t o f a l l o u t s t a n d i n g o r d e r s$ at $τ$
${i o}_{m} (τ)$	Elements in the list, indicating the expected arrival at $τ$
$E t (τ)$	Interrupt situation at $τ$
$O_{m} (τ)$	Orders sent at $τ$
$q_{m}^{L S} (τ)$	The quantity of sales loss
$q_{m}^{L o s s} (τ)$	The quantity of logistics loss
${M a x I L}_{m}$	Maximum inventory level
${l s}_{m}$	Unit cost of sales loss
${l o s s}_{m}$	Unit cost of logistics loss
$f_{m}$	Contract cost
$h_{m}$	Unit holding cost

Table 2. Retailers’ parameter settings.

Symbol	Retailer 1	Retailer 2	Retailer 3	Retailer 4
${I l}_{o}$	50	100	150	200
$f_{m}$	10	20	30	50
$h_{m}$	5	10	15	20
${l s}_{m}$	25	50	75	100
${l o s s}_{m}$	80	120	200	300
${M a x I L}_{m}$	200	450	700	1000

Table 3. Disruption type and probability.

Interrupt Type	$Interruption Probability α$	$Recovery Probability β$
Frequent but short	0.5	0.5
infrequent but long	0.01	0.01

Table 4. Hyperparameter settings.

Hyperparameter	Value
iterations	10,000
layers	2 hidden layers with (64, 64)
learning rate	linear from 5 × 10⁻⁴ to 1 × 10⁻⁵
batch size	64
n steps	2048
gamma	0.99
gae lambda	0.95
ent coef	0.01
n epochs	10
clip range (epsilon)	0.2

Table 5. Comparison of the results.

m	Retailers	Interrupt Type	(s, S)	(S, T)	Ours
2	1, 4	1	35,790	36,300	35,000
2	1, 4	2	34,000	34,500	31,500
3	1, 2, 4	1	54,302	55,020	51,500
3	1, 2, 4	2	47,302	48,200	46,850
4	1, 2, 3, 4	1	74,108	76,090	70,505
4	1, 2, 3, 4	2	63,108	64,200	62,300

Table 6. Parameter sensitivity analysis.

The Influencing Factors	Represent	The Sensitivity Analysis
Arrival rate from warehouse to retailer $μ_{1}$	0.7	0, 0.5, 0.7, 1
Retailer to customer goods arrival rate $μ_{2}$	0.8	0, 0.5, 0.8, 1
lead time $L$	2	2, 3, 4

Table 7. Impact of lead time L on the retailer’s cost.

L	m	Interrupt Type	(s, S)	(S, T)	Ours
3	2	1	40,000	41,000	36,050
	2	2	34,000	35,000	32,250
	3	1	58,000	60,000	54,000
	3	2	53,000	54,090	50,000
	4	1	77,000	80,000	73,800
	4	2	67,000	68,200	65,000
4	2	1	40,000	42,000	38,550
	2	2	36,000	38,000	33,500
	3	1	68,000	70,000	59,000
	3	2	55,000	59,000	52,000
	4	1	90,000	92,980	79,000
	4	2	74,000	79,000	68,000

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Article Metrics

Citations

Article Access Statistics

Journal Statistics

Article metric data becomes available approximately 24 hours after publication online.