A Supply Chain Inventory Management Method for Civil Aircraft Manufacturing Based on Multi-Agent Reinforcement Learning
Abstract
:1. Introduction
1.1. Inventory Management Methods with Value-Based RL
1.2. Inventory Management Methods with Policy-Based RL
1.3. Inventory Management Methods with Actor-Critic RL
- Demand modeling is too simple and fails to reflect the correlation between materials and process flow and the dependency relationship between different process flows;
- Different participants play different roles in the supply-demand relationship and have different goals and expectations for inventory management. Single-agent RL algorithms or homogenous MARL algorithms have limited further efficiency improvement;
- As the number of supply chain participants and the types of materials increases, the state space and action space of the environment grow explosively, requiring more efficient exploration of the policy space;
- With the increase in the number of supply chain participants, the requirement for internal collaborative cooperation within the supply chain will also increase. More effective methods are needed to prevent coupling when supply chain participants update their policies.
- We formalized the supply chain inventory management problem of main manufacturer-supplier as a partially observable Markov decision process model.
- Based on the model, we proposed a MARL method for supply chain inventory management, in which the dual-policy and information transmission mechanism was designed to help the supply chain participant improve the global information utilization efficiency of the supply chain and the coordination efficiency with other participants.
- Based on the environment abstracted from actual civil aircraft manufacturing, we train and evaluate the proposed method. Experimental results show that our method has about 45% improvement compared with current RL-based supply chain inventory management methods.
2. POMDP Modeling
2.1. General Assumptions
- All participants in the supply chain make rational decisions and are not affected by uncontrollable factors such as politics;
- The time granularity of the model is weekly;
- The time required for material storage, withdrawal, and inventory count is not considered to emphasize the critical points of the issue;
- The circulation cost between warehouses within a particular party is not considered to highlight the interaction between the parties involved in the supply chain;
- The supplier selection problem is not considered based on the actual production situation, which means a unique supplier provides each material;
- The material requirements of the inventory department proposed by the production line are based on the types of processes rather than the entire aircraft;
- The start time nodes for preparing materials for each flight within the annual plan are evenly distributed throughout the year.
2.2. Demand Modeling
2.3. POMDP Specification
- N: represents the set of all agents involved in the problem, and is the number of agents.
- S: represents the state space of global states in the problem. represents the global state at time t. It is a vector composed of the observation information of each agent, .
- A: represents the joint action space of agents in the problem, which is composed of the action spaces of each agent, . Considering that there are different demand levels for different materials, we use the action to represent the ratio of the production capacity and inventory level of the agent when considering the actions of the agents in the model. Therefore, for the action space of agent i, assuming is the number of types of materials involved in the inventory of the agent, the action space . At time t, the action of agent i is . It seems logical to use continuous variables directly as the actions, but using such a continuous action space in actual training is inconvenient.On the one hand, it makes the action space size too large to exploration. On the other hand, due to the legality constraint of actions, the sampling of actions over the interval will be trimmed, causing the non-differentiable spike impact on the two endpoints of the feasible action domain. Therefore, the action space is discretized as .
- T: represents the state transition of the global state S in the problem.
- R: represents the reward obtained by the agents in the environment. For agents, there are two types of reward information they can receive: individual reward and team reward. The individual reward is the reward obtained by the agent based on its local observation information, representing the cost of the agent itself. The team reward is the reward obtained by all agents based on the global state, representing the total cost of the supply chain. Minimizing the total cost of the supply chain is the common optimization target of all agents. The team reward is the sum of individual reward of all agents in the supply chain . The individual reward of agent i is composed of the following three parts:
- -
- Inventory cost is the cost incurred to maintain inventory, including storage costs, storage management costs, etc. At time t, the inventory ratio of material j in the inventory of agent i is , and the inventory cost of the agent is .
- -
- Shortages cost is the cost incurred due to the inability to meet the demand for materials that are due to insufficient inventory. At time t, the proportion of material j in the unfulfilled due orders of agent i is , and the stock shortage cost of the agent is .
- -
- Overflow cost is the additional storage cost incurred due to excess inventory. At time t, the proportion of material j that exceeds the upper limit of agent i’s inventory capacity is , and the overflow cost of the agent is .
- O: represents the joint observation space of agents in the problem, which is composed of the observation spaces of each agent, . At time t, the observation information of agent i is . The observation information of each agent includes its inventory information and demand information. In addition, the observation information for the main manufacturer agent includes its logistics information, and for the supplier agent includes its production information.
- -
- The inventory observation information represents the proportion of various materials of the agent. At time t, for agent i it is: .
- -
- The demand observation information represents the order demand information for various materials of the agent. At time t, the material j demand observation information of agent i is , where w is the maximum waiting time for the material j and is the ratio of the absolute number of material j demand with remaining order duration time k to the maximum inventory limit of material j for agent i at time t.
- -
- The logistics observation information represents the transportation information for various materials of the main manufacturer.At time t, the logistics observation information of material j of the main manufacturer is , where l is the maximum transportation time for the material and is the ratio of the absolute number of material j transportation with remaining transportation time k to the maximum inventory limit of material j for the main manufacturer at time t.
- -
- The production observation information represents the production information for the corresponding material of the supplier.At time t, the production observation information of material j of supplier i is , where u is the maximum production time for the material and is the ratio of the absolute number of material j production with remaining production time k to the maximum inventory limit of material j for supplier i at time t.
- is the discount factor. It represents the importance of future rewards compared to immediate rewards for the agents.
3. Proposed Supply Chain Inventory Management Method
3.1. Method Overview
3.2. Dual Policy Model with Information Transmission
4. Experiment Results
4.1. Experiment Environment
4.2. Compared Methods
4.3. Results and Analysis
4.3.1. The Macro-Performance Evaluation
4.3.2. The Micro-Performance Evaluation
5. Conclusions and Future Work
5.1. Conclusions
5.2. Future Work
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Slimani, I.; Farissi, I.E.; Achchab, S. Configuration and implementation of a daily artificial neural network-based forecasting system using real supermarket data. Int. J. Logist. Syst. Manag. 2017, 28, 144–163. [Google Scholar] [CrossRef]
- Kim, M.; Lee, J.; Lee, C.; Jeong, J. Framework of 2d kde and lstm-based forecasting for cost-effective inventory management in smart manufacturing. Appl. Sci. 2022, 12, 2380. [Google Scholar] [CrossRef]
- Rajesh, R. A grey-layered ANP based decision support model for analyzing strategies of resilience in electronic supply chains. Eng. Appl. Artif. Intell. 2020, 87, 103338. [Google Scholar] [CrossRef]
- Mokhtarinejad, M.; Ahmadi, A.; Karimi, B.; Rahmati, S.H.A. A novel learning based approach for a new integrated location-routing and scheduling problem within cross-docking considering direct shipment. Appl. Soft Comput. 2015, 34, 274–285. [Google Scholar] [CrossRef]
- Cantini, A.; Peron, M.; De Carlo, F.; Sgarbossa, F. A decision support system for configuring spare parts supply chains considering different manufacturing technologies. Int. J. Prod. Res. 2022, 1–21. [Google Scholar] [CrossRef]
- Taboada, H.; Davizón, Y.A.; Espíritu, J.F.; Sánchez-Leal, J. Mathematical Modeling and Optimal Control for a Class of Dynamic Supply Chain: A Systems Theory Approach. Appl. Sci. 2022, 12, 5347. [Google Scholar] [CrossRef]
- Afsar, H.M.; Ben-Ammar, O.; Dolgui, A.; Hnaien, F. Supplier replacement model in a one-level assembly system under lead-time uncertainty. Appl. Sci. 2020, 10, 3366. [Google Scholar] [CrossRef]
- Fallahpour, A.; Wong, K.Y.; Olugu, E.U.; Musa, S.N. A predictive integrated genetic-based model for supplier evaluation and selection. Int. J. Fuzzy Syst. 2017, 19, 1041–1057. [Google Scholar] [CrossRef]
- De Moor, B.J.; Gijsbrechts, J.; Boute, R.N. Reward shaping to improve the performance of deep reinforcement learning in perishable inventory management. Eur. J. Oper. Res. 2022, 301, 535–545. [Google Scholar] [CrossRef]
- Zhang, C.; Xie, Y.; Bai, H.; Yu, B.; Li, W.; Gao, Y. A survey on federated learning. Knowl.-Based Syst. 2021, 216, 106775. [Google Scholar] [CrossRef]
- Giannoccaro, I.; Pontrandolfo, P. Inventory management in supply chains: A reinforcement learning approach. Int. J. Prod. Econ. 2002, 78, 153–161. [Google Scholar] [CrossRef]
- Cuartas, C.; Aguilar, J. Hybrid algorithm based on reinforcement learning for smart inventory management. J. Intell. Manuf. 2023, 34, 123–149. [Google Scholar] [CrossRef]
- Nurkasanah, I. Reinforcement learning approach for efficient inventory policy in multi-echelon supply chain under various assumptions and constraints. J. Inf. Syst. Eng. Bus. Intell. 2021, 7, 138–148. [Google Scholar] [CrossRef]
- Jiang, C.; Sheng, Z. Case-based reinforcement learning for dynamic inventory control in a multi-agent supply-chain system. Expert Syst. Appl. 2009, 36, 6520–6526. [Google Scholar] [CrossRef]
- Oroojlooy, A. Applications of Machine Learning in Supply Chains. Ph.D. Thesis, Lehigh University, Bethlehem, PA, USA, 2019. [Google Scholar]
- Kemmer, L.; von Kleist, H.; de Rochebouët, D.; Tziortziotis, N.; Read, J. Reinforcement learning for supply chain optimization. In Proceedings of the European Workshop on Reinforcement Learning, Lille, France, 1–3 October 2018; Volume 14. [Google Scholar]
- Hutse, V.; Verleysen, A.; Wyffels, F. Reinforcement Learning for Inventory Optimisation in Multi-Echelon Supply Chains; Master in Business Engineering—Ghent University: Gent, Belgium, 2019. [Google Scholar]
- Alves, J.C.; Mateus, G.R. Deep reinforcement learning and optimization approach for multi-echelon supply chain with uncertain demands. In Proceedings of the International Conference on Computational Logistics, Enschede, The Netherlands, 28–30 September 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 584–599. [Google Scholar]
- Alves, J.C.; Silva, D.M.d.; Mateus, G.R. Applying and comparing policy gradient methods to multi-echelon supply chains with uncertain demands and lead times. In Proceedings of the International Conference on Artificial Intelligence and Soft Computing, Virtual Event, 21–23 June 2021; Springer: Berlin/Heidelberg, Germany, 2021; pp. 229–239. [Google Scholar]
- Wang, H.; Tao, J.; Peng, T.; Brintrup, A.; Kosasih, E.E.; Lu, Y.; Tang, R.; Hu, L. Dynamic inventory replenishment strategy for aerospace manufacturing supply chain: Combining reinforcement learning and multi-agent simulation. Int. J. Prod. Res. 2022, 60, 4117–4136. [Google Scholar] [CrossRef]
- Kara, A.; Dogan, I. Reinforcement learning approaches for specifying ordering policies of perishable inventory systems. Expert Syst. Appl. 2018, 91, 150–158. [Google Scholar] [CrossRef]
- Abu Zwaida, T.; Pham, C.; Beauregard, Y. Optimization of inventory management to prevent drug shortages in the hospital supply chain. Appl. Sci. 2021, 11, 2726. [Google Scholar] [CrossRef]
- Mortazavi, A.; Khamseh, A.A.; Azimi, P. Designing of an intelligent self-adaptive model for supply chain ordering management system. Eng. Appl. Artif. Intell. 2015, 37, 207–220. [Google Scholar] [CrossRef]
- Lillicrap, T.P.; Hunt, J.J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; Wierstra, D. Continuous control with deep reinforcement learning. arXiv 2015, arXiv:1509.02971. [Google Scholar]
- Haarnoja, T.; Zhou, A.; Abbeel, P.; Levine, S. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Proceedings of the International Conference on Machine Learning, PMLR, Stockholm, Sweden, 15–15 July 2018; pp. 1861–1870. [Google Scholar]
- Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal policy optimization algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar]
- Barat, S.; Khadilkar, H.; Meisheri, H.; Kulkarni, V.; Baniwal, V.; Kumar, P.; Gajrani, M. Actor based simulation for closed loop control of supply chain using reinforcement learning. In Proceedings of the 18th International Conference on Autonomous Agents and Multiagent Systems, Montreal, QC, Canada, 13–17 May 2019; pp. 1802–1804. [Google Scholar]
- Sultana, N.N.; Meisheri, H.; Baniwal, V.; Nath, S.; Ravindran, B.; Khadilkar, H. Reinforcement learning for multi-product multi-node inventory management in supply chains. arXiv 2020, arXiv:2006.04037. [Google Scholar]
- Demizu, T.; Fukazawa, Y.; Morita, H. Inventory management of new products in retailers using model-based deep reinforcement learning. Expert Syst. Appl. 2023, 229, 120256. [Google Scholar] [CrossRef]
- Jullien, S.; Ariannezhad, M.; Groth, P.; de Rijke, M. A simulation environment and reinforcement learning method for waste reduction. Trans. Mach. Learn. Res. 2023; in press. [Google Scholar]
- Yu, C.; Velu, A.; Vinitsky, E.; Wang, Y.; Bayen, A.; Wu, Y. The surprising effectiveness of ppo in cooperative, multi-agent games. arXiv 2021, arXiv:2103.01955. [Google Scholar]
- Foerster, J.; Farquhar, G.; Afouras, T.; Nardelli, N.; Whiteson, S. Counterfactual multi-agent policy gradients. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; Volume 32. [Google Scholar]
- Son, K.; Kim, D.; Kang, W.J.; Hostallero, D.E.; Yi, Y. Qtran: Learning to factorize with transformation for cooperative multi-agent reinforcement learning. In Proceedings of the International Conference on Machine Learning, PMLR, Long Beach, CA, USA, 9–15 June 2019; pp. 5887–5896. [Google Scholar]
- Schulman, J.; Moritz, P.; Levine, S.; Jordan, M.; Abbeel, P. High-dimensional continuous control using generalized advantage estimation. arXiv 2015, arXiv:1506.02438. [Google Scholar]
Contributors | Contributed Content |
---|---|
Giannoccaro, I. [11] | Applied RL in supply chain inventory management |
Cuartas, C. [12] | Discussed reward function definition based on Q-Learning |
Nurkasanah, I. [13] | Setted minimizing costs as the optimize target |
Jiang, C. [14] | Setted maintaining a specific target inventory level |
Oroojlooy, A. [15] | Combined DQN and heuristic algorithms in the beer game |
Kemmer, L. [16] | Used SARSA in the multi-node single-item mode supply chain inventory management problem |
Hutse, V. [17] | Added the lead time in delivery |
Alves, J.C. [18] | Constructed a four-level single material supply chain environment |
Alves, J.C. [19] | Compared performance of some RL algorithms |
Wang, H. [20] | Researched aviation inventory management of multiple materials under a single node |
Assembling Process | Material Category | Demand Mean | Demand Std |
---|---|---|---|
assembling the center wing | structural component | 25 | 2 |
system component | 150 | 5 | |
standard component A | 1500 | 50 | |
assembling the center fuselage | structural component | 50 | 2 |
system component | 300 | 5 | |
standard component B | 2000 | 50 | |
assembling the aft fuselage | structural component | 50 | 2 |
system component | 300 | 5 | |
standard component A | 2000 | 50 | |
assembling the nose | structural component | 50 | 2 |
system component | 300 | 5 | |
standard component B | 2000 | 50 | |
joining the fuselage | structural component | 40 | 2 |
system component | 500 | 5 | |
standard component A | 3000 | 50 | |
assembling the tailfin | structural component | 25 | 2 |
system component | 150 | 5 | |
standard component B | 1500 | 50 | |
installing the wing stabilizer | structural component | 25 | 2 |
system component | 150 | 5 | |
standard component A | 1500 | 50 | |
joining the entire aircraft | structural component | 40 | 2 |
system component | 100 | 5 | |
standard component B | 2000 | 50 | |
installing system components | structural component | 25 | 2 |
system component | 200 | 5 | |
standard component A | 1000 | 50 | |
installing engines | structural component | 20 | 2 |
system component | 150 | 5 | |
standard component B | 2000 | 50 |
Parameters | Value |
---|---|
number of agents | 23 |
length of trajectories | 400 |
number of PPO epoch | 10 |
learning rate | 7 × 10 |
discount factor | 0.99 |
clip | 0.2 |
clip | 0.2 |
clip | 0.2 |
GAE | 0.95 |
KL coefficient in | 0.0 → 1.0 |
KL coefficient in | 1.0 → 0.0 |
Sub-Network | Parameters | Critic Network Setting | Policy Network Setting |
---|---|---|---|
MLP (input) | input feature size | size of | size of |
number of the hidden layers | 4 | 4 | |
hidden size | 64 | 64 | |
output feature size | 64 | 64 | |
RNN | input feature size | 64 | 64 |
number of the recurrent layers | 2 | 2 | |
output feature size | 64 | 64 | |
MLP (output) | input feature size | 64 | 64 |
number of the hidden layers | 4 | 4 | |
hidden size | 64 | 64 | |
output feature size | 1 | size of |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Piao, M.; Zhang, D.; Lu, H.; Li, R. A Supply Chain Inventory Management Method for Civil Aircraft Manufacturing Based on Multi-Agent Reinforcement Learning. Appl. Sci. 2023, 13, 7510. https://doi.org/10.3390/app13137510
Piao M, Zhang D, Lu H, Li R. A Supply Chain Inventory Management Method for Civil Aircraft Manufacturing Based on Multi-Agent Reinforcement Learning. Applied Sciences. 2023; 13(13):7510. https://doi.org/10.3390/app13137510
Chicago/Turabian StylePiao, Mingjie, Dongdong Zhang, Hu Lu, and Rupeng Li. 2023. "A Supply Chain Inventory Management Method for Civil Aircraft Manufacturing Based on Multi-Agent Reinforcement Learning" Applied Sciences 13, no. 13: 7510. https://doi.org/10.3390/app13137510
APA StylePiao, M., Zhang, D., Lu, H., & Li, R. (2023). A Supply Chain Inventory Management Method for Civil Aircraft Manufacturing Based on Multi-Agent Reinforcement Learning. Applied Sciences, 13(13), 7510. https://doi.org/10.3390/app13137510