Adaptive Supply Chain: Demand–Supply Synchronization Using Deep Reinforcement Learning

: Adaptive and highly synchronized supply chains can avoid a cascading rise-and-fall inventory dynamic and mitigate ripple effects caused by operational failures. This paper aims to demonstrate how a deep reinforcement learning agent based on the proximal policy optimization algorithm can synchronize inbound and outbound ﬂows and support business continuity operating in the stochastic and nonstationary environment if end-to-end visibility is provided. The deep reinforcement learning agent is built upon the Proximal Policy Optimization algorithm, which does not require hardcoded action space and exhaustive hyperparameter tuning. These features, complimented with a straightforward supply chain environment, give rise to a general and task unspeciﬁc approach to adaptive control in multi-echelon supply chains. The proposed approach is compared with the base-stock policy, a well-known method in classic operations research and inventory control theory. The base-stock policy is prevalent in continuous-review inventory systems. The paper concludes with the statement that the proposed solution can perform adaptive control in complex supply chains. The paper also postulates fully ﬂedged supply chain digital twins as a necessary infrastructural condition for scalable real-world applications.


Introduction
The 21st century has begun under the aegis of growing stored data quantities and processing power. This trend eventually resulted in the advent of a machine learning approach capable of leveraging the increased computational resource well-known as deep learning [1]. Despite the temporal crisis in the semiconductor industry, this trend is expected to continue. Currently, a multitude of nontrivial and vitally crucial tasks for logistics on operational and managerial levels have been mastered by deep learning models within the supervised learning paradigm. Under the supervised learning paradigm, deep learning models map data points to the associated labels. However, the potential of deep learning extends far beyond classical supervised learning. Deep reinforcement learning (DRL) takes advantage of the deep artificial neural network architectures to explore possible sequences of actions and associate them with long-term rewards. This appears to be a powerful computational framework for experience-driven autonomous learning [2]. As a result, an adaptive controller based on this principle, also known as an RL agent, can learn how to operate in dynamic and frequently stochastic environments.
DRL has already demonstrated remarkable performance in the fields of autonomous aerial vehicles [3], road traffic navigation [4], autonomous vehicles [5], and robotics [6]. However, what served as the inspiration for this work is the ability of systems based on DRL to master computer games at a level significantly superior to that of a professional Algorithms 2021, 14, 240 2 of 14 human player. Many skeptics can undervalue the astonishing success of AlphaZero in mastering such classic board games as chess, shogi, and go through self-play [7] by pointing out the relative simplicity of rules and determinism of games. However, it appears that DRL is capable of mastering video games that partially capture and reflect the complexity of the real world. Notable examples include Dota 2 [8], StarCraft 2 [9], and Minecraft [10]. On the one hand, computer games are rich and challenging domains for testing DRL techniques. On the other hand, the complexity of the mentioned games can be characterized by long-term planning horizons, partial observability, imperfect information, and high dimensionality of state and action spaces. The mentioned traits often characterize complex supply chains and distribution systems. Given these facts, a central research question arises: "Can the fundamental principles behind DRL be applied to control problems in supply chain management?" Modern supply chains are highly complex systems that continually make businesscritical decisions to stay competitive and adaptive in dynamic environments. Adaptive control in such systems is applied to ensure delivery to end customers with minimal delays and interruptions and avoid unnecessary costs. This objective cannot be achieved without supply chain synchronization, which is defined as real-time coordination of production scheduling, inventory control, and delivery plans among a multitude of individual supply chain participants that frequently spread across the globe. Since inventory on hand is the difference between inbound and outbound commodity flows, proper inventory control is a pivotal element of supply chain synchronization. For example, higher inventory levels allow one to maintain higher customer service levels, but they are associated with extra costs that propagate across the supply chain to the end consumers in the form of the higher price [11]. Therefore, the full potential of the supply chain is unlocked if and only if it becomes synchronized, namely, all the critical stakeholders obtain accurate real-time data, identify weaknesses, streamline processes, and mitigate risk. In this regard, a synchronized supply chain is akin to the gear and cog that operate in harmony. Even if one gear suddenly stops turning, it can critically undermine the synchronization of the entire system, and eventually, the other gears will become ineffective.
Synchronized supply chains can avoid a cascading rise-and-fall inventory dynamic widely known as the "bullwhip effect" [12] and mitigate ripple effects caused by operational failures [13]. Indeed, a DRL agent can only perform adaptive coordination along the whole supply chain if end-to-end visibility is ensured. Shocks and disruptions amid the COVID-19 pandemic, along with post-pandemic recoveries, can become a catalyst for necessary changes in information transparency and global coordination [14]. This paper aims to demonstrate how DRL can synchronize inbound and outbound flows and support business continuity operating in a stochastic environment if end-to-end visibility is provided.

Related Work and Novelty
Among the related works, it is worth highlighting [15] one that demonstrated the efficiency of Q-learning for dynamic inventory control in a four-echelon supply chain model that included a retailer, distributor, and manufacturer. The supply chain simulation was distinguished by nonstationary demand under a Poisson distribution. Barat et al. presented an RL agent based on the A2C algorithm for closed-loop replenishment control in supply chains [16]. Zhao et al. adapted the Soar RL algorithm to reduce the operational risk in resilient and agile logistics enterprises. The problem was modeled as an asymmetrical wargame environment [17]. Wang et al. applied a DRL agent to the supply chain synchronization problem under demand uncertainty [18]. In the recent paper, Perez et al. compared several techniques, including RL on a single product, a multi-period centralized system under stochastic stationary consumer demand [11].
An RL agent capable of playing the beer distribution game was proposed in a recent paper [19]. The beer distribution game is widely used in supply chain management education to demonstrate the importance of supply chain coordination. The RL agent is built upon deep Q-learning and is not provided any preliminary information on costs or other settings of the simulated environments. Therefore, it is imperative to highlight the OR-Gym [20], an open-source library that contains common operation research problems in the form of RL environments, including one used in this study.
In numerical experiments presented further in this paper, the DRL agent is built upon the proximal policy optimization (PPO) algorithm, which does not require exhaustive hyperparameter tuning. Besides, the implementation of PPO used in this paper does not require hardcode action space. These features, complimented with a simple and straightforward supply chain model used as an RL environment, result in a general and task unspecific approach to adaptive control in multi-echelon supply chains. It is essential to highlight that numerical experiments are conducted using a simplified supply chain model, which is sufficient at the proof-of-concept stage. However, the simplified assumptions cannot fully capture all the complexity of many real-world supply chains.

Methodology
This section introduces the fundamentals behind RL and the Markov decision process (MDP). After that, the supply chain environment (SCE) is described. Lastly, the section sheds light on the PPO algorithm.

Supply Chain Environment
SCE can be naturally formulated as MDP. MDP serves as the mathematical foundation and straightforward framework for goal-directed learning from interaction with a virtual environment. An RL agent interacts with an environment over multiple time steps ( Figure 1). An RL agent capable of playing the beer distribution game was proposed in a recent paper [19]. The beer distribution game is widely used in supply chain management education to demonstrate the importance of supply chain coordination. The RL agent is built upon deep Q-learning and is not provided any preliminary information on costs or other settings of the simulated environments. Therefore, it is imperative to highlight the OR-Gym [20], an open-source library that contains common operation research problems in the form of RL environments, including one used in this study.
In numerical experiments presented further in this paper, the DRL agent is built upon the proximal policy optimization (PPO) algorithm, which does not require exhaustive hyperparameter tuning. Besides, the implementation of PPO used in this paper does not require hardcode action space. These features, complimented with a simple and straightforward supply chain model used as an RL environment, result in a general and task unspecific approach to adaptive control in multi-echelon supply chains. It is essential to highlight that numerical experiments are conducted using a simplified supply chain model, which is sufficient at the proof-of-concept stage. However, the simplified assumptions cannot fully capture all the complexity of many real-world supply chains.

Methodology
This section introduces the fundamentals behind RL and the Markov decision process (MDP). After that, the supply chain environment (SCE) is described. Lastly, the section sheds light on the PPO algorithm.

Supply Chain Environment
SCE can be naturally formulated as MDP. MDP serves as the mathematical foundation and straightforward framework for goal-directed learning from interaction with a virtual environment. An RL agent interacts with an environment over multiple time steps ( Figure 1). The RL agent and environment interact at each instance of a sequence of discrete time steps t ∈ T. At each time step t, the agent receives a representation of the environmental state St ∈ S and numerical reward Rt ∈ R in response to a performed action At ∈ A. After that, the agent finds itself in a new state St+1. The sequence of states, actions, and rewards gives rise to a trajectory of the following form S0, A0, R1, S1, A1, R2, S2, A2, R3, …, Sn, where Sn stands for the terminal state [21]. At each time step, the agent conducts multiple trials with the stochastic environment (episodes). The agent observes St and Rt performing actions under a policy π that maps states to actions. In such settings, the agent's goal is to The RL agent and environment interact at each instance of a sequence of discrete time steps t ∈ T. At each time step t, the agent receives a representation of the environmental state S t ∈ S and numerical reward R t ∈ R in response to a performed action A t ∈ A. After that, the agent finds itself in a new state S t+1 . The sequence of states, actions, and rewards gives rise to a trajectory of the following form S 0 , A 0 , R 1 , S 1 , A 1 , R 2 , S 2 , A 2 , R 3 , . . . , S n , where S n stands for the terminal state [21]. At each time step, the agent conducts multiple trials with the stochastic environment (episodes). The agent observes S t and R t performing actions under a policy π that maps states to actions. In such settings, the agent's goal is to learn a policy that leads to the highest cumulative reward. In this regard, RL can be considered a stochastic optimization process [22].
The function p(.) describes the probability of an RL agent finding itself in a particular state S* and obtaining the particular reward R*, when taking action A* in state S t . This dynamics within the MDP framework can be described by the following equation: In SCE, the agent must decide how many goods to order for each stage m at each time step t. The action A t is an integer value corresponding to each reorder quantity at each stage along the supply chain. The state S t constitutes a vector that includes the inventory level for each stage and the previously taken actions. Thus, the RL agent attempts to synchronize the supply chain so that the total revenue over the time horizon is maximized (see Equation (7)).
SCE is based on the seminal work [23] and implemented within the OR-gym, an open-source collection of RL environments related to logistics and operations research [20]. SCE represents a single-product multi-echelon supply chain model. The model makes several assumptions. The tradable goods do not perish over time, and replenishment sizes are integers. Economic agents involved in the supply chains can be represented as stages M = {0, 1, . . . , m end }. Stage 0 stands for a retailer that fulfills a customer's demand. Stage m end stands for a raw material supplier. Stages from 1 to m end−1 represent intermediaries involved in the product lifecycle, for example, wholesalers and manufacturers ( Figure 2). One unit from the previous stage is transformed into one unit in the following stage until the final product is obtained. Replenishment lead times between stages are constant, measured in days, and include production and transportation times. Both production and inventory capacities are limited for all stages except the last one (raw material supply is assumed to be infinite). During the simulation at each time step t ∈ T, the following sequence of events take place: 1.
All the stages except the raw material pool place replenishment orders.

2.
Replenishment orders are satisfied according to available production and inventory capacity.

3.
Demand arises and is satisfied according to the inventory on hand at stage 0 (retailer).

4.
Unsatisfied demand and replenishment orders are lost (backlogging is not allowed).

5.
Holding costs are charged for each unit of excess inventory.
learn a policy that leads to the highest cumulative reward. In this regard, RL can be considered a stochastic optimization process [22]. The function p(.) describes the probability of an RL agent finding itself in a particular state S* and obtaining the particular reward R*, when taking action A* in state St. This dynamics within the MDP framework can be described by the following equation: In SCE, the agent must decide how many goods to order for each stage m at each time step t. The action At is an integer value corresponding to each reorder quantity at each stage along the supply chain. The state St constitutes a vector that includes the inventory level for each stage and the previously taken actions. Thus, the RL agent attempts to synchronize the supply chain so that the total revenue over the time horizon is maximized (see Equation (7)).
SCE is based on the seminal work [23] and implemented within the OR-gym, an open-source collection of RL environments related to logistics and operations research [20]. SCE represents a single-product multi-echelon supply chain model. The model makes several assumptions. The tradable goods do not perish over time, and replenishment sizes are integers. Economic agents involved in the supply chains can be represented as stages M = {0, 1, …, mend}. Stage 0 stands for a retailer that fulfills a customer's demand. Stage mend stands for a raw material supplier. Stages from 1 to mend-1 represent intermediaries involved in the product lifecycle, for example, wholesalers and manufacturers (Figure 2). One unit from the previous stage is transformed into one unit in the following stage until the final product is obtained. Replenishment lead times between stages are constant, measured in days, and include production and transportation times. Both production and inventory capacities are limited for all stages except the last one (raw material supply is assumed to be infinite). During the simulation at each time step t ∈ T, the following sequence of events take place: 1. All the stages except the raw material pool place replenishment orders. 2. Replenishment orders are satisfied according to available production and inventory capacity. 3. Demand arises and is satisfied according to the inventory on hand at stage 0 (retailer). 4. Unsatisfied demand and replenishment orders are lost (backlogging is not allowed). 5. Holding costs are charged for each unit of excess inventory. ∀ m ∈ M and ∀ t ∈ T, the SCE dynamics is governed by the following set of equations: ∀ m ∈ M and ∀ t ∈ T, the SCE dynamics is governed by the following set of equations: where at the beginning of each period t ∈ T at each stage m ∈ M, I denotes inventory on hand. V stands for the commodities that are ordered and on the way (pipeline inventory). Q corresponds to the accepted reorder quantity, andQ is the requested reorder quantity. L stands for replenishment lead time between stages. Demand D is a discrete random variable under Poisson distribution. The sales ζ at each period equal to the customer demand satisfied by the retailer at stage 0 and the accepted reorder quantities from the succeeding stages from 1 to mend. U denotes unfulfilled demand and unfulfilled reorder requests. The net profit NP equals sales revenue minus procurement costs, the penalty for unfulfilled demand, and inventory holding costs. ρ, r, k, and h stand for the unit sales price, unit procurement cost, unit penalty for unfulfilled demand, and unit inventory holding costs. If production capacity and inventory level are sufficient (do not exceed capacity constraints c), Q would be equal toQ. However, if capacities are insufficient, it imposes an upper bound on the reorder quantity that can be accepted. It is assumed that the inventory of raw materials at stage mend is infinite.
It is important to emphasize that besides the non-perishability of goods and integer replenishment sizes, SCE makes a strong assumption regarding end-to-end visibility. Namely, information is assumed to be complete, accurate, and available along the supply chain. This assumption appears to be unrealistic in many real-world applications, given the current digitalization level. This assumption can hold if a single company controls the vast majority of a supply chain, let us say, from farm to fork or from quartz sand to microchip. However, the assumption is false for a larger swath of supply chains. We primarily see an outsourcing trend, according to which companies mainly focus on those activities in the supply chain where they have a distinct advantage and outsource everything else. Therefore, the pervasive adoption of the fully fledged supply chain digital twins or other technologies capable of providing information transparency and end-to-end visibility is a necessary infrastructural condition for real-world implementation. Furthermore, assuming that information transparency is partially provided, such SCE can serve as a proxy for real-world applications in supply chain management to gauge generalized algorithm performance.

Proximal Policy Optimization Algorithm
PPO is applied to train the RL agent. PPO is developed by the OpenAI team and distinguished by simple implementation, generality, and low sample complexity (the number of training samples that the RL agent needs to learn a target function with a sufficient degree of accuracy). On the other hand, the vast majority of machine learning algorithms require hyperparameters that must be finetuned external to the learning algorithm itself.
In this regard, an extra advantage of PPO is that it can provide solid results with either default parameters or relatively little hyperparameter tuning [24].
PPO is an actor-critic approach to DRL. It uses two deep artificial neural networks, one to generate actions at each time step (actor) and one to predict the corresponding rewards (critic). The actor learns a policy that produces a probability distribution of feasible actions. On the other hand, the critic learns to estimate the reward function for given states and actions. The difference between the reward predicted by the critic with parameters θ and the actual reward received from the environment is reflected in the loss function Loss(θ). PPO limits the update of the parameters by clipping the loss function Equation (8). This loss function constitutes the distinguishing feature of PPO and demonstrates more stable learning compared to other state-of-the-art policy gradient methods across various benchmarks.
where π k−1 and π k denote the previous policy and the new policy. k stands for the number of updates to the policy since initialization. The function cl(.) imposes the constraint of form 1 − ε ≤ π k−1 /π k ≤ 1 + ε. ε is a tunable hyperparameter that prevents policy updates that are too radical. The advantage estimation of the state is the sum of the discounted prediction errors over T time stepsÂ t = ∑ γ T−t+1 δ T , where δ T is the difference between the actual and the estimated rewards provided by the critic artificial neural network, also known as the temporal difference error. γ stands for the discount rate. Algorithm 1 demonstrates the implementation of PPO in Actor-Critic style. The implemented PPO uses fixed-length trajectory segments. In each iteration, each of N actors collects T timesteps of data in parallel. Then, the surrogate loss function is obtained and can be further optimized with mini-batch stochastic gradient descent. Therefore, the algorithm aims to compute an update at each iteration that minimizes the loss function while ensuring the relatively small deviation from the previous policy. As a result, the algorithm reaches a balance between ease of implementation and tuning and sample complexity.

Numerical Experiment
This section describes two numerical experiments with different environment settings. In both experiments, the simulation lasts for 30 periods (modeling days). Furthermore, in both experiments, a DRL agent based on PPO is compared with the base-stock policy, a common approach in classic inventory control theory.

Implementation and Configurations
The first numeric experiment is conducted with a four-stage supply chain. Table  1 contains the parameters of SCE used in the experiment. The second and the third numeric experiments are conducted with larger four-stage supply chains using longer lead times and generally more challenging environment settings (Tables 2 and 3). The PPO algorithm is implemented in parallel using the Ray framework. The framework performs actor-based computations orchestrated by a single dynamic execution engine. Besides, Ray is distinguished by a distributed scheduler and fault-tolerant storage framework. The framework demonstrated scaling beyond 1.8 million tasks per second and better performance than alternative solutions for several challenging RL problems [25].   In the numerical experiments, a feed-forward architecture with one hidden layer and 256 neurons is used for actor and critic artificial neural networks. An exponential linear unit is used as an activation function, ε equals 0.3, and the learning rate is set to 10 −5 .
The PPO algorithm is compared with the base-stock policy, a well-known approach in classic operations research and inventory control theory. The base-stock policy is especially common in continuous-review inventory systems. Even though demand is assumed to be generated by a stationary and independent Poisson process, the base-stock policy can perform well even if this assumption does not hold [26]. The numeric experiment is conducted in a fully reproducible manner using Google Colaboratory, a hosted version of Jupyter Notebooks. Thus, results can be reproduced and verified by anyone [27]. Table 4 provides details on the available computational resources. Teraflops refer to the capability of calculating one trillion floating-point operations per second, a standard GPU performance estimation metric.

Results
In the first experiment, the base-stock policy used as a benchmark could achieve an average reward of 414.3 abstract monetary units with a standard deviation of 26.5. On the other hand, in 66,000 training episodes, the PPO agent could derive a policy that led to an average reward of 425.6 abstract monetary units, which is 2.7% higher, having a smaller standard deviation of 19.4 at the same time. The training procedure is conducted using GPU and takes 97 min on the hardware mentioned above.
The learning procedure and comparison against the benchmark is demonstrated in Figure 3.

Results
In the first experiment, the base-stock policy used as a benchmark could achieve an average reward of 414.3 abstract monetary units with a standard deviation of 26.5. On the other hand, in 66,000 training episodes, the PPO agent could derive a policy that led to an average reward of 425.6 abstract monetary units, which is 2.7% higher, having a smaller standard deviation of 19.4 at the same time. The training procedure is conducted using GPU and takes 97 min on the hardware mentioned above.
The learning procedure and comparison against the benchmark is demonstrated in Figure 3.       In the second experiment, the PPO agent decisively outcompeted the base-stock policy operation in more challenging environment settings. The training procedure took 79 min ( Figure 6).
During the third experiment, unusual behavior was observed. The PPO agent took advantage of the policy that terminates ordering additional inventory closer to the end of the simulation run. This strategy can be explained by the fact that the simulation time is limited by 30 days, and the agent, based on its experience, expects that there will not be enough time to sell all the goods, and holding costs associated with excessive inventory will not be paid off. In other words, the policy learned by the agent taking into account the specifics of the SCE. During the training session, the PPO agent was capable of learning the policy, which is a mapping between states and actions that entail maximum expected reward. Figure 7 illustrates this unusual behavior under the control of the trained PPO agent. Indeed, such a problem may be circumvented by modifying the SCE, for example, by adding a "cool down" period equal to the sum of all the lead times to the initial time horizon. The tendency of the RL agent to exploit the mechanics of simulated environments in an unexpected way is quite common and known in the domain of competitive cybersport. Therefore, this phenomenon deserves specific attention during the development of applications for real-world problems. In the second experiment, the PPO agent decisively outcompeted the base-stock policy operation in more challenging environment settings. The training procedure took 79 min ( Figure 6).
During the third experiment, unusual behavior was observed. The PPO agent took advantage of the policy that terminates ordering additional inventory closer to the end of the simulation run. This strategy can be explained by the fact that the simulation time is limited by 30 days, and the agent, based on its experience, expects that there will not be enough time to sell all the goods, and holding costs associated with excessive inventory will not be paid off. In other words, the policy learned by the agent taking into account the specifics of the SCE. During the training session, the PPO agent was capable of learning the policy, which is a mapping between states and actions that entail maximum expected reward. Figure 7 illustrates this unusual behavior under the control of the trained PPO agent. Indeed, such a problem may be circumvented by modifying the SCE, for example, by adding a "cool down" period equal to the sum of all the lead times to the initial time horizon. The tendency of the RL agent to exploit the mechanics of simulated environments in an unexpected way is quite common and known in the domain of competitive cybersport. Therefore, this phenomenon deserves specific attention during the development of applications for real-world problems. Nevertheless, even with such an unpredictable solution, the PPO agent managed to adapt and find the policy that entails a positive net profit in the specified environmental setup. The training procedure took 76 min. At the same time, the policy under base-stock did not lead to a profitable outcome (Figure 8). Table 5 summarizes the results of all three numerical experiments. The table contains mean rewards, standard deviations, coefficient of variations, and 95% confidence intervals. The values are reported for the fully trained PPO agent and base-stock policy.  Nevertheless, even with such an unpredictable solution, the PPO agent managed to adapt and find the policy that entails a positive net profit in the specified environmental setup. The training procedure took 76 min. At the same time, the policy under base-stock did not lead to a profitable outcome (Figure 8). Table 5  Nevertheless, even with such an unpredictable solution, the PPO agent managed to adapt and find the policy that entails a positive net profit in the specified environmental setup. The training procedure took 76 min. At the same time, the policy under base-stock did not lead to a profitable outcome (Figure 8). Table 5 summarizes the results of all three numerical experiments. The table contains mean rewards, standard deviations, coefficient of variations, and 95% confidence intervals. The values are reported for the fully trained PPO agent and base-stock policy.

Discussion
The PPO algorithm has demonstrated a decisive capability of performing adaptive control in multi-stage supply chains. Among the discovered benefits, it is essential to highlight that the PPO algorithm is general, task unspecific, and does not rely on prior knowledge about the system. Besides, the labor-intense procedure of exhaustive hyperparameter tuning is not required.
Concerning the technical aspects, it is also worth mentioning that PPO implementation and training are straightforward within the contemporary frameworks. However, it is essential to keep in mind that RL agents learn to map from the state of the environment to a choice of action that entails the highest reward. Therefore, their reliance on the simulated environments requires specific attention. In this regard, the simulated environment should not be oversimplified and must capture all the pivotal relations within the elements of a real-world supply chain. The tendency of the learning agent to abuse and exploit simulation mechanics has been discovered in this research and deserves primary consideration in development and deployment. It is also worth emphasizing that a manifold of advanced supply chain models exists in the form of discrete event simulations (DES) [28]. Currently, the vast majority of RL research follows the standard prescribed by the OpenAI Gym framework [29]. In short, an environment has to be a Python class inherited from gym.Env and containing at least such methods as reset and step. The step method takes action as the input and returns a new state and reward value. Thus, it can be considered the fundamental mechanism that advances time. Unlike DES models, the time by default advances with a constant time step. Therefore, in order to produce a standardized RL environment out of DES, the time has to become part of the state, an environment variable observable for an agent. On the other hand, since it is assumed that between consecutive events, no change in the system will take place, the method step has to execute and reschedule the next event in the list. This problem is addressed by Schuderer et al. [30] and, if a standardized and straightforward way of transforming DES models into RL environments is developed, a plethora of the models developed over the past 30 years will become available to RL applications.
Indeed, a DRL agent of that kind can synchronize the supply chain if and only if end-to-end visibility is provided. Besides, planning and real-time control require constant data availability on replenishments, inventory, demand, and capacities [31]. The ultimate fulfillment of these requirements is an advent of a digital replica of a supply chain representing the system state at any given moment in real-time, providing complete end-to-end visibility and the possibility to test contingency plans. This concept is known as a digital supply chain twin [14]. Incorporating RL agents into supply chain digital twins can be considered one of the most promising trajectories for future research.
The capability of a DRL agent to transfer knowledge across multiple environments is considered a critical aspect of intelligent agents and a potential way to artificial general intelligence [32]. The PPO agent demonstrated the ability to learn in different SCEs with the same hyperparameters, which can be considered sufficient at the proof-of-concept stage. This capability can be considered a different generalization type compared to cases in which the trained agent can perform well in unseen environments. However, the agent in the current implementation cannot perform well in the previously unseen configuration of SCE. In this regard, such performance generalization techniques as multi-task learning [33] and policy distillation [34] can be postulated as an additional direction for future research.
Besides, in order to study how the stochasticity in SCE inputs affects the output, in future work, it is essential to conduct a sensitivity analysis of the parameters in the model and the algorithm.

Conclusions
The conducted numerical experiments demonstrated the capability to outcompete the base-stock policy. The paper concludes with the statement that the proposed solution can outperform the base-stock policy in adaptive control over stochastic supply chains. In the three-stage supply chain environment, the base-stock policy used as a benchmark could achieve an average reward of 414.3 abstract monetary units with a standard deviation of 26.5. On the other hand, it took the PPO agent 66,000 training episodes to derive a policy that leads to an average reward of 425.6 abstract monetary units, which is 2.7% higher, while also having a smaller standard deviation. In more challenging environments with more extensive supply chains and longer lead times, the PPO agent decisively outperformed the base-stock policy by deriving the policy that entails a positive net profit.
However, the sequence of experiments also revealed the tendency of the learning agent to exploit simulation mechanics. The tendency of the RL agent to exploit the mechanics of simulated environments in an unexpected way is well-known in the domain of competitive cybersport. This phenomenon deserves specific consideration in real-world implementations. The simulated environment should not be oversimplified and must capture all the pivotal relations within the elements of a real-world supply chain or logistic system.
Among the technical advantages, it is essential to highlight that the PPO algorithm is general, task unspecific, does not rely on prior knowledge about the system, and does not require explicitly defined action space or exhaustive hyperparameter tuning. Additionally, its implementation and training are straightforward within the contemporary frameworks. It is worth keeping in mind that RL agents learn to map from the state of the environment to a choice of action that entails the highest reward. Therefore, their reliance on the simulated environments requires specific attention. From the applicability point of view, the assumption of the provided end-to-end visibility may be unrealistic in many realworld applications given the current digitalization level, and the pervasive adoption of the fully fledged supply chain digital twins is a necessary infrastructural condition. Nevertheless, if the information visibility and transparency are achieved, SCE can serve as a proxy for real-world applications in supply chain management to gauge generalized algorithm performance.