Dynamic Scheduling Method for JobShop Manufacturing Systems by Deep Reinforcement Learning with Proximal Policy Optimization
Abstract
:1. Introduction
2. Preliminary
2.1. Markov Decision Process
2.2. Policy Gradient Theorem
3. Proposed Methods
3.1. Dynamic Simulation of Production Environment
 Machines that have similar disposal ability of orders are put into the same group.
 Orders are released by the sources and they are performed according to specific probability.
 The processing time of each order is decided by the predefined probability distribution.
 Machine failures are considered in this environment, which can result in the breakdown of all of the machines. The failure events are random triggered based on the mean time between failure (MTBF) and mean time offline (MTOL).
3.1.1. Action Module
3.1.2. States
 Firstly, the state of action ${S}_{a{s}_{i}}$ shows whether the current action is valid or not, and it is defined as:$${S}_{a{s}_{i}}=\left(\right)open="\{"\; close>\begin{array}{cc}\hfill 1& \phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{3.33333pt}{0ex}}{a}_{i}\in {A}_{valid}\hfill \\ \hfill 0& \phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{3.33333pt}{0ex}}else\hfill \end{array}$$
 The machine breakdown was designed in the simulation, and the state of failure for each machine ${S}_{m{f}_{i}}$ is also considered, which is defined as:$${S}_{m{f}_{i}}=\left(\right)open="\{"\; close>\begin{array}{cc}\hfill 1& \phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{3.33333pt}{0ex}}if\phantom{\rule{3.33333pt}{0ex}}{M}_{i}\phantom{\rule{3.33333pt}{0ex}}has\phantom{\rule{3.33333pt}{0ex}}a\phantom{\rule{3.33333pt}{0ex}}failure\hfill \\ \hfill 0& \phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{3.33333pt}{0ex}}else\hfill \end{array}$$
 The remaining processing time of each machine ${M}_{i}$ is defined as:$${S}_{rp{t}_{i}}=\frac{{T}_{rp{t}_{i}}}{{T}_{ap{t}_{i}}}$$
 ${S}_{be{n}_{i}}$ indicates the state information of remaining free buffer spaces of each machine ${M}_{i}$ in its entry buffer:$${S}_{be{n}_{i}}=1\frac{{N}_{oc{c}_{i}^{en}}}{{N}_{ca{p}_{i}^{en}}}$$
 ${S}_{be{x}_{i}}$ indicates the remaining free buffer space in the exit buffer for each machine ${M}_{i}$:$${S}_{be{x}_{i}}=1\frac{{N}_{oc{c}_{i}^{ex}}}{{N}_{ca{p}_{i}^{ex}}}$$
 ${S}_{w{t}_{i}}$ indicates the waiting times of orders waiting for transport:$${S}_{w{t}_{i}}=\frac{{T}_{w{t}_{i}^{max}}{T}_{w{t}_{i}^{mean}}}{{T}_{w{t}_{i}^{std}}}$$
3.2. Deep Reinforcement Learning for Dynamic Scheduling
3.2.1. Optimization Objectives
 The constant reward ${R}_{const}$ rewards the valid action with value ${\omega}_{1}$ for ${A}_{S\to M}$, and ${\omega}_{2}$ for ${A}_{M\to S}$ is defined as:$${R}_{const}({S}_{t},{A}_{t})=\left(\right)open="\{"\; close>\begin{array}{cc}\hfill {\omega}_{1}& \phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{3.33333pt}{0ex}}{A}_{t}\in {A}_{S\to M}\hfill \\ \hfill {\omega}_{2}& \phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{3.33333pt}{0ex}}{A}_{t}\in {A}_{M\to S}\hfill \\ \hfill 0& \phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{3.33333pt}{0ex}}else\hfill \end{array}$$
 To promote the average utilization U, ${R}_{uti}$ was designed with exponential function when the agent provides a valid action. The purpose of this reward function is to maximize utilization, and it is defined as:$${R}_{uti}({S}_{t},{A}_{t})=\left(\right)open="\{"\; close>\begin{array}{cc}\hfill {exp}^{\frac{U}{1.5}}1& \phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{3.33333pt}{0ex}}{A}_{t}\in {A}_{valid}\hfill \\ \hfill 0& \phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{3.33333pt}{0ex}}else\hfill \end{array}$$
 To shorten the waiting time $WT$ of orders, ${R}_{wt}$ is designed to award the valid action determined by the agent. The reward function also follows the exponential function to accelerate the order leaving the system, which is defined as:$${R}_{wt}({S}_{t},{A}_{t})=\left(\right)open="\{"\; close>\begin{array}{cc}\hfill {exp}^{0.1WT}0.5& \phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{3.33333pt}{0ex}}{A}_{t}\in {A}_{valid}\hfill \\ \hfill 0& \phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{3.33333pt}{0ex}}else\hfill \end{array}$$
 Combining ${R}_{const}$ with ${R}_{uti}$ and ${R}_{wt}$, two complex reward functions are designed as follows:$${R}_{\omega uti}({S}_{t},{A}_{t})=\left(\right)open="\{"\; close>\begin{array}{cc}\hfill {\omega}_{1}{R}_{uti}({S}_{t},{A}_{t})& \phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{3.33333pt}{0ex}}{A}_{t}\in {A}_{S\to M}\hfill \\ \hfill {\omega}_{2}{R}_{uti}({S}_{t},{A}_{t})& \phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{3.33333pt}{0ex}}{A}_{t}\in {A}_{M\to S}\hfill \\ \hfill 0& \phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{3.33333pt}{0ex}}else\hfill \end{array}$$$${R}_{\omega wt}({S}_{t},{A}_{t})=\left(\right)open="\{"\; close>\begin{array}{cc}\hfill {\omega}_{1}{R}_{wt}({S}_{t},{A}_{t})& \phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{3.33333pt}{0ex}}{A}_{t}\in {A}_{S\to M}\hfill \\ \hfill {\omega}_{2}{R}_{wt}({S}_{t},{A}_{t})& \phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{3.33333pt}{0ex}}{A}_{t}\in {A}_{M\to S}\hfill \\ \hfill 0& \phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{3.33333pt}{0ex}}else\hfill \end{array}$$
 For implementing the multipleobjective optimization, the hybrid reward function with ${R}_{uti}$ and ${R}_{wt}$ is defined as:$${R}_{hybird}({S}_{t},{A}_{t})={w}_{1}{R}_{uti}+{w}_{2}{R}_{wt}$$
3.2.2. Proximal Policy Optimization
Algorithm 1 DRL with PPO for dynamic scheduling 

4. Experiments
4.1. Case Description
4.2. Implementation Details
4.3. Results and Analysis
5. Discussion and Conclusions
 The optimal policy is only learned from the massive interaction data with the production environment. Expert knowledge would be considered as a support of further enhancement of efficiency and performance.
 In the current simulation, only one dispatcher is used as the transport agent. However, a dynamic simulation environment with multiple transport agents should be developed in future studies. The proposed deep reinforcement learning framework needs to be improved in multiagent situations.
 Toward the dynamic jobshop scheduling problem, other wellknown algorithms, such as GA, PSO, and TLBO, will be implemented and compared with the deep reinforcement learning framework. With the corresponding benchmark problems developed, we will validate all the algorithms within the dynamic environment.
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
 Sanchez, M.; Exposito, E.; Aguilar, J. Autonomic computing in manufacturing process coordination in industry 4.0 context. J. Ind. Inf. Integr. 2020, 19, 100159. [Google Scholar] [CrossRef]
 Csalódi, R.; Süle, Z.; Jaskó, S.; Holczinger, T.; Abonyi, J. Industry 4.0driven development of optimization algorithms: A systematic overview. Complexity 2021, 2021, 6621235. [Google Scholar] [CrossRef]
 Zenisek, J.; Wild, N.; Wolfartsberger, J. Investigating the potential of smart manufacturing technologies. Procedia Comput. Sci. 2021, 180, 507–516. [Google Scholar] [CrossRef]
 Popov, V.V.; Kudryavtseva, E.V.; Kumar Katiyar, N.; Shishkin, A.; Stepanov, S.I.; Goel, S. Industry 4.0 and Digitalisation in Healthcare. Materials 2022, 15, 2140. [Google Scholar] [CrossRef]
 Zhang, W.; Yang, D.; Wang, H. Datadriven methods for predictive maintenance of industrial equipment: A survey. IEEE Syst. J. 2019, 13, 2213–2227. [Google Scholar] [CrossRef]
 Kleindorfer, P.R.; Singhal, K.; Van Wassenhove, L.N. Sustainable operations management. Prod. Oper. Manag. 2005, 14, 482–492. [Google Scholar] [CrossRef]
 Kiel, D.; Müller, J.M.; Arnold, C.; Voigt, K.I. Sustainable industrial value creation: Benefits and challenges of industry 4.0. In Digital Disruptive Innovation; World Scientific: Singapore, 2020; pp. 231–270. [Google Scholar]
 Saxena, P.; Stavropoulos, P.; Kechagias, J.; Salonitis, K. Sustainability assessment for manufacturing operations. Energies 2020, 13, 2730. [Google Scholar] [CrossRef]
 Henao, R.; Sarache, W.; Gómez, I. Lean manufacturing and sustainable performance: Trends and future challenges. J. Clean. Prod. 2019, 208, 99–116. [Google Scholar] [CrossRef]
 Rajeev, A.; Pati, R.K.; Padhi, S.S.; Govindan, K. Evolution of sustainability in supply chain management: A literature review. J. Clean. Prod. 2017, 162, 299–314. [Google Scholar] [CrossRef]
 SerranoRuiz, J.C.; Mula, J.; Poler, R. Smart manufacturing scheduling: A literature review. J. Manuf. Syst. 2021, 61, 265–287. [Google Scholar] [CrossRef]
 SerranoRuiz, J.C.; Mula, J.; Poler, R. Development of a multidimensional conceptual model for job shop smart manufacturing scheduling from the Industry 4.0 perspective. J. Manuf. Syst. 2022, 63, 185–202. [Google Scholar] [CrossRef]
 Zhang, X.; Liu, W. Complex equipment remanufacturing schedule management based on multilayer graphic evaluation and review technique network and critical chain method. IEEE Access 2020, 8, 108972–108987. [Google Scholar] [CrossRef]
 Yu, J.M.; Lee, D.H. Scheduling algorithms for jobshoptype remanufacturing systems with component matching requirement. Comput. Ind. Eng. 2018, 120, 266–278. [Google Scholar] [CrossRef]
 Cai, L.; Li, W.; Luo, Y.; He, L. Realtime scheduling simulation optimisation of job shop in a productionlogistics collaborative environment. Int. J. Prod. Res. 2022, 1–21. [Google Scholar] [CrossRef]
 Satyro, W.C.; de Mesquita Spinola, M.; de Almeida, C.M.; Giannetti, B.F.; Sacomano, J.B.; Contador, J.C.; Contador, J.L. Sustainable industries: Production planning and control as an ally to implement strategy. J. Clean. Prod. 2021, 281, 124781. [Google Scholar] [CrossRef]
 Wang, L.; Hu, X.; Wang, Y.; Xu, S.; Ma, S.; Yang, K.; Liu, Z.; Wang, W. Dynamic jobshop scheduling in smart manufacturing using deep reinforcement learning. Comput. Netw. 2021, 190, 107969. [Google Scholar] [CrossRef]
 Garey, M.R.; Johnson, D.S.; Sethi, R. The complexity of flowshop and jobshop scheduling. Math. Oper. Res. 1976, 1, 117–129. [Google Scholar] [CrossRef]
 Manne, A.S. On the jobshop scheduling problem. Oper. Res. 1960, 8, 219–223. [Google Scholar] [CrossRef][Green Version]
 Van Laarhoven, P.J.; Aarts, E.H.; Lenstra, J.K. Job shop scheduling by simulated annealing. Oper. Res. 1992, 40, 113–125. [Google Scholar] [CrossRef][Green Version]
 Wang, Y.; Qingdaoerji, R. A new hybrid genetic algorithm for job shop scheduling problem. Comput. Oper. Res. 2012, 39, 2291–2299. [Google Scholar]
 Sha, D.; Lin, H.H. A multiobjective PSO for jobshop scheduling problems. Expert Syst. Appl. 2010, 37, 1065–1070. [Google Scholar] [CrossRef]
 Xu, Y.; Wang, L.; Wang, S.y.; Liu, M. An effective teaching–learningbased optimization algorithm for the flexible jobshop scheduling problem with fuzzy processing time. Neurocomputing 2015, 148, 260–268. [Google Scholar] [CrossRef]
 Du, Y.; Li, J.q.; Chen, X.l.; Duan, P.y.; Pan, Q.k. KnowledgeBased Reinforcement Learning and Estimation of Distribution Algorithm for Flexible Job Shop Scheduling Problem. IEEE Trans. Emerg. Top. Comput. Intell. 2022, 1–15. [Google Scholar] [CrossRef]
 Mohan, J.; Lanka, K.; Rao, A.N. A review of dynamic job shop scheduling techniques. Procedia Manuf. 2019, 30, 34–39. [Google Scholar] [CrossRef]
 Azadeh, A.; Negahban, A.; Moghaddam, M. A hybrid computer simulationartificial neural network algorithm for optimisation of dispatching rule selection in stochastic job shop scheduling problems. Int. J. Prod. Res. 2012, 50, 551–566. [Google Scholar] [CrossRef]
 Wang, C.; Jiang, P. Manifold learning based rescheduling decision mechanism for recessive disturbances in RFIDdriven job shops. J. Intell. Manuf. 2018, 29, 1485–1500. [Google Scholar] [CrossRef]
 Zhao, Y.; Zhang, H. Application of machine learning and rule scheduling in a jobshop production control system. Int. J. Simul. Model 2021, 20, 410–421. [Google Scholar] [CrossRef]
 Tian, W.; Zhang, H. A dynamic jobshop scheduling model based on deep learning. Adv. Prod. Eng. Manag. 2021, 16, 23–36. [Google Scholar] [CrossRef]
 Tassel, P.; Gebser, M.; Schekotihin, K. A reinforcement learning environment for jobshop scheduling. arXiv 2021, arXiv:2104.03760. [Google Scholar]
 Kuhnle, A.; Schäfer, L.; Stricker, N.; Lanza, G. Design, implementation and evaluation of reinforcement learning for an adaptive order dispatching in job shop manufacturing systems. Procedia CIRP 2019, 81, 234–239. [Google Scholar] [CrossRef]
 Kuhnle, A.; Röhrig, N.; Lanza, G. Autonomous order dispatching in the semiconductor industry using reinforcement learning. Procedia CIRP 2019, 79, 391–396. [Google Scholar] [CrossRef]
 Xia, K.; Sacco, C.; Kirkpatrick, M.; Saidy, C.; Nguyen, L.; Kircaliali, A.; Harik, R. A digital twin to train deep reinforcement learning agent for smart manufacturing plants: Environment, interfaces and intelligence. J. Manuf. Syst. 2021, 58, 210–230. [Google Scholar] [CrossRef]
 Kuhnle, A.; Kaiser, J.P.; Theiß, F.; Stricker, N.; Lanza, G. Designing an adaptive production control system using reinforcement learning. J. Intell. Manuf. 2021, 32, 855–876. [Google Scholar] [CrossRef]
 Zhao, Y.; Wang, Y.; Tan, Y.; Zhang, J.; Yu, H. Dynamic Jobshop Scheduling Algorithm Based on Deep Q Network. IEEE Access 2021, 9, 122995–123011. [Google Scholar] [CrossRef]
 Wang, H.; Sarker, B.R.; Li, J.; Li, J. Adaptive scheduling for assembly job shop with uncertain assembly times based on dual Qlearning. Int. J. Prod. Res. 2021, 59, 5867–5883. [Google Scholar] [CrossRef]
 Zeng, Y.; Liao, Z.; Dai, Y.; Wang, R.; Li, X.; Yuan, B. Hybrid intelligence for dynamic jobshop scheduling with deep reinforcement learning and attention mechanism. arXiv 2022, arXiv:2201.00548. [Google Scholar]
 Luo, S.; Zhang, L.; Fan, Y. Dynamic multiobjective scheduling for flexible job shop by deep reinforcement learning. Comput. Ind. Eng. 2021, 159, 107489. [Google Scholar] [CrossRef]
 Kaelbling, L.P.; Littman, M.L.; Moore, A.W. Reinforcement learning: A survey. J. Artif. Intell. Res. 1996, 4, 237–285. [Google Scholar] [CrossRef][Green Version]
 Schulman, J.; Moritz, P.; Levine, S.; Jordan, M.; Abbeel, P. Highdimensional continuous control using generalized advantage estimation. arXiv 2015, arXiv:1506.02438. [Google Scholar]
 Waschneck, B.; Altenmüller, T.; Bauernhansl, T.; Kyek, A. Production Scheduling in Complex Job Shops from an Industry 4.0 Perspective: A Review and Challenges in the Semiconductor Industry. In Proceedings of the SAMI@ iKNOW, Graz, Austria, 19 October 2016; pp. 1–12. [Google Scholar]
 Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal policy optimization algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar]
 Mönch, L.; Fowler, J.W.; Mason, S.J. Production Planning and Control for Semiconductor Wafer Fabrication Facilities: Modeling, Analysis, and Systems; Springer Science & Business Media: Berlin, Germany, 2012; Volume 52. [Google Scholar]
 Silver, D.; Lever, G.; Heess, N.; Degris, T.; Wierstra, D.; Riedmiller, M. Deterministic policy gradient algorithms. In Proceedings of the International Conference on Machine Learning, PMLR, Beijing, China, 21–26 June 2014; pp. 387–395. [Google Scholar]
 Schulman, J.; Levine, S.; Abbeel, P.; Jordan, M.; Moritz, P. Trust region policy optimization. In Proceedings of the International Conference on Machine Learning, PMLR, Lille, France, 6–11 July 2015; pp. 1889–1897. [Google Scholar]
 Boebel, F.; Ruelle, O. Cycle time reduction program at ACL. In Proceedings of the IEEE/SEMI 1996 Advanced Semiconductor Manufacturing Conference and Workshop. ThemeInnovative Approaches to Growth in the Semiconductor Industry. ASMC 96 Proceedings, Cambridge, MA, USA, 12–14 November 1996; pp. 165–168. [Google Scholar]
 Schoemig, A.K. On the corrupting influence of variability in semiconductor manufacturing. In Proceedings of the 31st Conference on Winter Simulation: Simulation—A Bridge to the Future, Phoenix, AZ, USA, 5–8 December 1999; Volume 1, pp. 837–842. [Google Scholar]
Action Type  Description 

${A}_{waiting}$  Dispatcher waits at its current position. 
${A}_{S\to M}$  Dispatcher takes an undisposed order from a source to a machine. 
${A}_{M\to S}$  Dispatcher takes a disposed order from a machine to a sink. 
Parameter  Value 

Learning rate  0.001 
Batch size  128 
Epoch number  5 
Gamma $\gamma $  0.9 
Lamda $\lambda $  0.95 
Clipping $\u03f5$  0.01 
Parameter  Default Scenario  Scenario 1  Scenario 2 

Dispatcher speed factor  1  0.3  1 
Machine buffer factor  6  0.5  1 
MTBF $\beta $  1000  1000  1000 
MTOL $\beta $  200  200  200 
${\omega}_{1}$  0.5  0.5  0.5 
${\omega}_{2}$  0.5  0.5  0.5 
Heuristic  Scenario 1  
$U(\%)$  $WT\left(s\right)$  $\alpha $  
Random  $38.93\pm 8.28$  $203.76\pm 54.76$  $5.21\pm 3.40$ 
FIFO  $46.15\pm 3.68$  $182.58\pm 17.44$  $2.94\pm 0.78$ 
NJF  $50.84\pm 5.29$  $196.48\pm 19.07$  $2.57\pm 0.87$ 
Heuristic  Scenario 2  
$U(\%)$  $WT\left(s\right)$  $\alpha $  
Random  $54.86\pm 10.71$  $138.79\pm 57.54$  $1.57\pm 1.10$ 
FIFO  $70.72\pm 6.82$  $125.18\pm 22.51$  $0.48\pm 0.16$ 
NJF  $72.99\pm 7.35$  $125.68\pm 23.57$  $0.38\pm 0.11$ 
PPO  Scenario 1  
$U(\%)$  $WT\left(s\right)$  $\alpha $  
${R}_{const}$  $43.20\pm 3.72$  $119.30\pm 11.04$  $2.30\pm 0.63$ 
${R}_{\omega uti}$  $44.21\pm 3.60$  $130.65\pm 11.51$  $2.37\pm 0.59$ 
${R}_{\omega wt}$  $43.68\pm 4.11$  $126.61\pm 12.02$  $2.38\pm 0.71$ 
${R}_{hybird}$  $43.35\pm 3.67$  $124.53\pm 19.15$  $2.32\pm 0.62$ 
PPO  Scenario 2  
$U(\%)$  $WT\left(s\right)$  $\alpha $  
${R}_{const}$  $62.29\pm 5.02$  $80.79\pm 14.87$  $0.56\pm 0.15$ 
${R}_{\omega uti}$  $66.31\pm 7.09$  $99.87\pm 20.55$  $0.54\pm 0.18$ 
${R}_{\omega wt}$  $62.03\pm 5.98$  $80.10\pm 15.63$  $0.57\pm 0.18$ 
${R}_{hybird}$  $62.75\pm 6.99$  $80.56\pm 17.12$  $0.54\pm 0.19$ 
$\mathit{U}(\%)$  $\mathit{W}\mathit{T}\left(\mathit{s}\right)$  $\mathit{\alpha}$  

${\omega}_{1}=0.1$, ${\omega}_{2}=0.9$  $61.89\pm 5.81$  $80.99\pm 16.14$  $0.57\pm 0.16$ 
${\omega}_{1}=0.25$, ${\omega}_{2}=0.75$  $62.30\pm 6.08$  $80.35\pm 14.69$  $0.56\pm 0.17$ 
${\omega}_{1}=0.5$, ${\omega}_{2}=0.5$  $62.75\pm 6.99$  $80.56\pm 17.12$  $0.54\pm 0.19$ 
${\omega}_{1}=0.75$, ${\omega}_{2}=0.25$  $68.46\pm 7.02$  $106.22\pm 19.30$  $0.48\pm 0.16$ 
${\omega}_{1}=0.9$, ${\omega}_{2}=0.1$  $69.79\pm 7.16$  $104.88\pm 20.29$  $0.44\pm 0.16$ 
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. 
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Zhang, M.; Lu, Y.; Hu, Y.; Amaitik, N.; Xu, Y. Dynamic Scheduling Method for JobShop Manufacturing Systems by Deep Reinforcement Learning with Proximal Policy Optimization. Sustainability 2022, 14, 5177. https://doi.org/10.3390/su14095177
Zhang M, Lu Y, Hu Y, Amaitik N, Xu Y. Dynamic Scheduling Method for JobShop Manufacturing Systems by Deep Reinforcement Learning with Proximal Policy Optimization. Sustainability. 2022; 14(9):5177. https://doi.org/10.3390/su14095177
Chicago/Turabian StyleZhang, Ming, Yang Lu, Youxi Hu, Nasser Amaitik, and Yuchun Xu. 2022. "Dynamic Scheduling Method for JobShop Manufacturing Systems by Deep Reinforcement Learning with Proximal Policy Optimization" Sustainability 14, no. 9: 5177. https://doi.org/10.3390/su14095177