Large Language Model-Assisted Deep Reinforcement Learning from Human Feedback for Job Shop Scheduling
Abstract
:1. Introduction
2. Literature Review
2.1. Job Shop Scheduling Problem
2.2. Deep Reinforcement Learning
2.3. Reward Function Design
2.4. Kolmogorov–Arnold Network
3. The HFLLMDRL Framework for the JSSP Based on Disjunctive Graphs
3.1. Constructing the Framework for HFLLMDRL
3.2. Modeling the Disjunctive Graphs for the JSSP
3.3. Generating a Reward Function Based on ROSES
- Captures the adjacency matrix of priority constraints between tasks;
- Indicates the task-to-machine mapping assigned to the machine for each task;
- The task duration is normalized by the maximum processing time;
- The reward function must take these state characteristics into account and align with the specific characteristics of job shop scheduling, such as the need to manage task priorities and minimize delays throughout the production process. This culture ensures that the generated reward functions are specific to the scheduling domain.
- Use only the environment variables explicitly defined in the JSSP environment class;
- Be compatible with TorchScript, requiring the use of torch. It should use a tensor for all variables and ensure device consistency between tensors;
- Returns a single output, the total reward, formatted as a string of Python code. Rewards may include penalties for task delay, incentives for the early completion of tasks, and components of balancing machine workloads. For example, transformations such as torch.exp can be applied to normalize rewards or emphasize specific scheduling priorities.
- Identify key variables from the environment, such as task dependencies, processing time, and machine allocation;
- Define mathematical expressions for the reward components, ensuring that they capture trade-offs inherent in job shop scheduling (for example, penalizing idle machine time or task completion delays);
- Normalized and scaled rewards where necessary, introducing temperature parameters to control the sensitivity of the converted reward components;
- Format the final reward function as a TorchScript-compatible string of Python code and ensure that all implementation constraints are met.
3.4. Evaluating and Selecting for Reward Functions
- Correctness verification via human-guided rules: Human feedback was initially used to establish rules for checking the correctness of the reward function. This includes defining acceptable syntax, logical structure, and environment-specific constraints. Automated tools, such as Python’s ast library, are then used to verify that the generated code obeyed these rules. Any reward function that fails syntactic or logical validation is excluded from further evaluation.
- Operational testing based on human-defined criteria: Reward functions that pass syntactic and logical checks are executed in the environment. We use human feedback to define correctness evaluation criteria, such as whether a reward function meets the intended goal or produces a meaningful output. At this stage, functions that cause runtime exceptions or deviations from human-defined correctness criteria are discarded.
- Objective metrics defined by human goals: In the job shop scheduling problem (JSSP), minimizing the total completion time (makespan) is the main objective. During reinforcement learning, human feedback informs the selection of key metrics, such as maximum, minimum, and average completion times, to evaluate the effectiveness of the reward function in achieving this goal.
- Reward quality: While reward metrics (such as maximum, minimum, and average rewards) are recorded, they are not directly used for ranking as they are themselves influenced by the reward function. Human feedback helps to identify this limitation and prioritize objective metrics.
- Correctness filtering: Human feedback is embedded in the correctness verification process to ensure that only valid and executable reward functions, with correctness = 0, are considered for further evaluation.
- Ranking based on human-defined goals: The reward functions are ranked according to the goal metrics defined by the human goals. Functions that achieve smaller makespan values are preferred.
- Elite feedback mechanism: The top N selected elite reward functions are fed back into the LLM framework to guide subsequent iterations. This iterative refinement ensures that the new reward function is consistent with the human-defined goals and builds on previous successful designs.
3.5. Establishing the Human Feedback Iteration Mechanism
- If the span values consistently lead to “NaN” (not a number—an undefined or unrepresentable value), the feedback suggests rewriting the entire reward function;
- If the value of a reward component changes very little, the LLM is guided to perform the following steps:
- Adjust the scale or temperature parameters of the components;
- Overwrite or replace invalid components;
- A component is discarded if it contributes little to the optimization.
- For reward components with significantly larger magnitudes compared to the other components, the LLM is instructed to rescale it to the appropriate range. These actionable insights guide the LLM to refine its reward function design approach.
3.6. Designing DRL Based on KANs and LLMs
Algorithm 1. HFLLMDRL Framework | |
Input: | |
Output: | |
1. | Initialize,Task context TC, human-defined criteria HC and human goals HG. |
2. | while do: |
3. | Reward Generation Phase (Section 3.3) Prompt engineering: Candidate generation: Validation: |
4. | Elite Selection Phase (Section 3.4) with HG |
5. | |
6. | DRL Training Phase (Section 3.6) for do: State representation Action selection Environment transition Experience storage Value estimation Parameter update |
7. | Human Feedback Loop (Section 3.5) Equations (6) and (7) |
8. | return Computational Complexity |
4. Case Study
4.1. Experimental Setup
4.2. The Convergence of HFLLMDRL with Different Instances
4.3. The Convergence of HFLLMDRL with Different Reward Functions
4.4. Comparison with Different Selection Policies
4.5. Visualization and Performance of the Actor Network
- Node features: start time, end time, process time, and scheduling status.
- Edge features: start node, end node, weight, and arc type.
- Graph features: the critical path and the distance of the critical path.
- The ninth feature (critical path of the disjunctive graph) is represented by differing symbols in the top and bottom subgraphs. This inconsistency arises because a disjunctive graph can contain multiple critical paths, resulting in varied symbolic representations for this feature;
- The tenth feature (distance of the critical path) does not generate a symbol. This is because once the critical path is identified, the distance becomes a fixed value, leaving no variability to be symbolically represented.
- For small-scale problems, the performance differences between the KAN and the MLP are negligible;
- For large-scale problems, the KAN exhibits a clear advantage, achieving optimal solutions with fewer iterations. This efficiency in complex scheduling problems highlights the robustness of the KAN in handling higher-dimensional feature spaces.
4.6. Computational Performance Analysis
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
Abbreviations
LLMs | Large language models |
JSSP | Job Shop Scheduling Problem |
KAN | Kolmogorov–Arnold Network |
DRL | Deep Reinforcement Learning |
HFLLMDRL | Human feedback-based Large Language Model-assisted deep reinforcement learning framework |
Appendix A
- (1)
- If the makespen are always near nan, then you must rewrite the entire reward function
- (2)
- If the values for a certain reward component are near identical throughout, then this means RL is not able to optimize this component as it is written. You may consider
- (a)
- Changing its scale or the value of its temperature parameter
- (b)
- Re-writing the reward component
- (c)
- Discarding the reward component
- (3)
- If some reward components’ magnitude is significantly larger, then you must re-scale its value to a proper range
References
- Xiong, H.; Shi, S.; Ren, D.; Hu, J. A Survey of Job Shop Scheduling Problem: The Types and Models. Comput. Oper. Res. 2022, 142, 105731. [Google Scholar] [CrossRef]
- Chen, R.; Li, W.; Yang, H. A Deep Reinforcement Learning Framework Based on an Attention Mechanism and Disjunctive Graph Embedding for the Job-Shop Scheduling Problem. IEEE Trans. Ind. Inf. 2023, 19, 1322–1331. [Google Scholar] [CrossRef]
- Wang, X.; Wang, S.; Liang, X.; Zhao, D.; Huang, J.; Xu, X.; Dai, B.; Miao, Q. Deep Reinforcement Learning: A Survey. IEEE Trans. Neural Netw. Learn. Syst. 2022, 35, 5064–5078. [Google Scholar] [CrossRef] [PubMed]
- Eschmann, J. Reward Function Design in Reinforcement Learning. In Reinforcement Learning Algorithms: Analysis and Applications; Belousov, B., Abdulsamad, H., Klink, P., Parisi, S., Peters, J., Eds.; Springer International Publishing: Cham, Switzerland, 2021; pp. 25–33. ISBN 978-3-030-41188-6. [Google Scholar]
- Sutton, R.S.; Barto, A.G. Reinforcement Learning, Second Edition: An Introduction; MIT Press: Cambridge, MA, USA, 2018; ISBN 978-0-262-35270-3. [Google Scholar]
- Yuan, E.; Wang, L.; Cheng, S.; Song, S.; Fan, W.; Li, Y. Solving Flexible Job Shop Scheduling Problems via Deep Reinforcement Learning. Expert Syst. Appl. 2024, 245, 123019. [Google Scholar] [CrossRef]
- Parker-Holder, J.; Rajan, R.; Song, X.; Biedenkapp, A.; Miao, Y.; Eimer, T.; Zhang, B.; Nguyen, V.; Calandra, R.; Faust, A.; et al. Automated Reinforcement Learning (AutoRL): A Survey and Open Problems. J. Artif. Int. Res. 2022, 74, 517–568. [Google Scholar] [CrossRef]
- Chiang, H.-T.L.; Faust, A.; Fiser, M.; Francis, A. Learning Navigation Behaviors End-to-End with AutoRL. IEEE Rob. Autom. Lett. 2019, 4, 2007–2014. [Google Scholar] [CrossRef]
- Wang, Y.; Pan, Y.; Yan, M.; Su, Z.; Luan, T.H. A Survey on ChatGPT: AI–Generated Contents, Challenges, and Solutions. IEEE Open J. Comput. Soc. 2023, 4, 280–302. [Google Scholar] [CrossRef]
- Zhao, Z.; Lee, W.S.; Hsu, D. Large Language Models as Commonsense Knowledge for Large-Scale Task Planning. Adv. Neur. In. 2023, 36, 31967–31987. [Google Scholar]
- Otter, D.W.; Medina, J.R.; Kalita, J.K. A Survey of the Usages of Deep Learning for Natural Language Processing. IEEE Trans. Neural Netw. Learn. Syst. 2021, 32, 604–624. [Google Scholar] [CrossRef]
- Romera-Paredes, B.; Barekatain, M.; Novikov, A.; Balog, M.; Kumar, M.P.; Dupont, E.; Ruiz, F.J.R.; Ellenberg, J.S.; Wang, P.; Fawzi, O.; et al. Mathematical Discoveries from Program Search with Large Language Models. Nature 2024, 625, 468–475. [Google Scholar] [CrossRef]
- Kwon, M.; Xie, S.M.; Bullard, K.; Sadigh, D. Reward Design with Language Models. arXiv 2023, arXiv:2303.00001. [Google Scholar]
- Song, W.; Chen, X.; Li, Q.; Cao, Z. Flexible Job-Shop Scheduling via Graph Neural Network and Deep Reinforcement Learning. IEEE Trans. Ind. Inf. 2023, 19, 1600–1610. [Google Scholar] [CrossRef]
- Seifi, C.; Schulze, M.; Zimmermann, J. A New Mathematical Formulation for a Potash-Mine Shift Scheduling Problem with a Simultaneous Assignment of Machines and Workers. Eur. J. Oper. Res. 2021, 292, 27–42. [Google Scholar] [CrossRef]
- Ozolins, A. Bounded Dynamic Programming Algorithm for the Job Shop Problem with Sequence Dependent Setup Times. Oper. Res. 2020, 20, 1701–1728. [Google Scholar] [CrossRef]
- Gui, L.; Li, X.; Zhang, Q.; Gao, L. Domain Knowledge Used in Meta-Heuristic Algorithms for the Job-Shop Scheduling Problem: Review and Analysis. Tsinghua Sci. Technol. 2024, 29, 1368–1389. [Google Scholar] [CrossRef]
- Syarif, A.; Pamungkas, A.; Kumar, R.; Gen, M. Performance Evaluation of Various Heuristic Algorithms to Solve Job Shop Scheduling Problem (JSSP). Int. J. Intell. Eng. Syst. 2021, 14, 334–343. [Google Scholar] [CrossRef]
- Kotary, J.; Fioretto, F.; Hentenryck, P.V. Fast Approximations for Job Shop Scheduling: A Lagrangian Dual Deep Learning Method. Proc. AAAI Conf. Artif. Intell. 2022, 36, 7239–7246. [Google Scholar] [CrossRef]
- Mazyavkina, N.; Sviridov, S.; Ivanov, S.; Burnaev, E. Reinforcement Learning for Combinatorial Optimization: A Survey. Comput. Oper. Res. 2021, 134, 105400. [Google Scholar] [CrossRef]
- Kallestad, J.; Hasibi, R.; Hemmati, A.; Sörensen, K. A General Deep Reinforcement Learning Hyperheuristic Framework for Solving Combinatorial Optimization Problems. Eur. J. Oper. Res. 2023, 309, 446–468. [Google Scholar] [CrossRef]
- Yuan, E.; Cheng, S.; Wang, L.; Song, S.; Wu, F. Solving Job Shop Scheduling Problems via Deep Reinforcement Learning. Appl. Soft Comput. 2023, 143, 110436. [Google Scholar] [CrossRef]
- Serrano-Ruiz, J.C.; Mula, J.; Poler, R. Job Shop Smart Manufacturing Scheduling by Deep Reinforcement Learning. J. Ind. Inf. Integr. 2024, 38, 100582. [Google Scholar] [CrossRef]
- Liu, R.; Piplani, R.; Toro, C. A Deep Multi-Agent Reinforcement Learning Approach to Solve Dynamic Job Shop Scheduling Problem. Comput. Oper. Res. 2023, 159, 106294. [Google Scholar] [CrossRef]
- Ma, Y.J.; Liang, W.; Wang, G.; Huang, D.-A.; Bastani, O.; Jayaraman, D.; Zhu, Y.; Fan, L.; Anandkumar, A. Eureka: Human-Level Reward Design via Coding Large Language Models. arXiv 2023, arXiv:2310.12931. [Google Scholar]
- Singh, S.; Lewis, R.L.; Barto, A.G. Where Do Rewards Come From? In Proceedings of the International Symposium on AI-Inspired Biology, Leicester, UK, 29 March–1 April 2010; pp. 111–116. [Google Scholar]
- Wirth, C.; Akrour, R.; Neumann, G.; Fürnkranz, J. A Survey of Preference-Based Reinforcement Learning Methods. J. Mach. Learn. Res. 2017, 18, 4945–4990. [Google Scholar]
- Arora, S.; Doshi, P. A Survey of Inverse Reinforcement Learning: Challenges, Methods and Progress. Artif. Intell. 2021, 297, 103500. [Google Scholar] [CrossRef]
- Cao, Y.; Zhao, H.; Cheng, Y.; Shu, T.; Liu, G.; Liang, G.; Zhao, J.; Li, Y. Survey on Large Language Model-Enhanced Reinforcement Learning: Concept, Taxonomy, and Methods. IEEE Trans. Neural Netw. Learn. Syst. 2024. [Google Scholar] [CrossRef]
- Carta, T.; Oudeyer, P.-Y.; Sigaud, O.; Lamprier, S. EAGER: Asking and Answering Questions for Automatic Reward Shaping in Language-Guided RL. Adv. Neural Inf. Process. Syst. 2022, 35, 12478–12490. [Google Scholar]
- Hu, H.; Sadigh, D. Language Instructed Reinforcement Learning for Human-AI Coordination. In Proceedings of the 40th International Conference on Machine Learning, Honolulu, HI, USA, 23–29 July 2023; Volume 202, pp. 13584–13598. [Google Scholar]
- Yan, Z.; Ge, J.; Wu, Y.; Li, L.; Li, T. Automatic Virtual Network Embedding: A Deep Reinforcement Learning Approach with Graph Convolutional Networks. IEEE J. Sel. Areas Commun. 2020, 38, 1040–1057. [Google Scholar] [CrossRef]
- Li, S.E. Deep Reinforcement Learning. In Reinforcement Learning for Sequential Decision and Optimal Control; Li, S.E., Ed.; Springer Nature: Singapore, 2023; pp. 365–402. ISBN 978-981-19-7784-8. [Google Scholar]
- Li, K.; Chen, J.; Yu, D.; Dajun, T.; Qiu, X.; Lian, J.; Ji, R.; Zhang, S.; Wan, Z.; Sun, B.; et al. Deep Reinforcement Learning-Based Obstacle Avoidance for Robot Movement in Warehouse Environments. In Proceedings of the 2024 IEEE 6th International Conference on Civil Aviation Safety and Information Technology (ICCASIT), Hangzhou, China, 23–25 October 2024. [Google Scholar]
- Liu, Z.; Wang, Y.; Vaidya, S.; Ruehle, F.; Halverson, J.; Soljačić, M.; Hou, T.Y.; Tegmark, M. KAN: Kolmogorov-Arnold Networks. arXiv 2024, arXiv:2404.19756. [Google Scholar]
- Schmidt-Hieber, J. The Kolmogorov–Arnold Representation Theorem Revisited. Neural Netw. 2021, 137, 119–126. [Google Scholar] [CrossRef]
- Vaca-Rubio, C.J.; Blanco, L.; Pereira, R.; Caus, M. Kolmogorov-Arnold Networks (KANs) for Time Series Analysis. arXiv 2024, arXiv:2405.08790. [Google Scholar]
- Zhang, C.; Song, W.; Cao, Z.; Zhang, J.; Tan, P.S.; Chi, X. Learning to Dispatch for Job Shop Scheduling via Deep Reinforcement Learning. Adv. Neural Inf. Process. Syst. 2020, 33, 1621–1632. [Google Scholar]
- Nasuta, A.; Kemmerling, M.; Lütticke, D.; Schmitt, R.H. Reward Shaping for Job Shop Scheduling. In Proceedings of the Machine Learning, Optimization, and Data Science, Grasmere, UK, 22–26 September 2023; Nicosia, G., Ojha, V., La Malfa, E., La Malfa, G., Pardalos, P.M., Umeton, R., Eds.; Springer Nature: Cham, Switzerland, 2024; pp. 197–211. [Google Scholar]
- Samsonov, V.; Kemmerling, M.; Paegert, M.; Lütticke, D.; Sauermann, F.; Gützlaff, A.; Schuh, G.; Meisen, T. Manufacturing Control in Job Shop Environments with Reinforcement Learning. In Proceedings of the 13th International Conference on Agents and Artificial Intelligence, Online, 4–6 February 2021; pp. 589–597. [Google Scholar]
- Tassel, P.; Kovács, B.; Gebser, M.; Schekotihin, K.; Stöckermann, P.; Seidel, G. Semiconductor Fab Scheduling with Self-Supervised and Reinforcement Learning. In Proceedings of the 2023 Winter Simulation Conference (WSC), San Antonio, TX, USA, 10–13 December 2023. [Google Scholar]
- Zhu, Y.; Zhao, D. Online Minimax Q Network Learning for Two-Player Zero-Sum Markov Games. IEEE Trans. Neural Netw. Learn. Syst. 2022, 33, 1228–1241. [Google Scholar] [CrossRef]
- Zhou, K.; Oh, S.-K.; Qiu, J.; Pedrycz, W.; Seo, K.; Yoon, J.H. Design of Hierarchical Neural Networks Using Deep LSTM and Self-Organizing Dynamical Fuzzy-Neural Network Architecture. IEEE Trans. Fuzzy Syst. 2024, 32, 2915–2929. [Google Scholar] [CrossRef]
Case | ft06 | ft10 | ft20 | swv19 | swv20 | ta51 | ta71 |
---|---|---|---|---|---|---|---|
n | 6 | 10 | 20 | 50 | 50 | 50 | 100 |
m | 6 | 10 | 5 | 10 | 10 | 15 | 20 |
Optimal Solution | 55 | 930 | 1165 | 2843 | 2823 | 3394 | 6098 |
Case | Reward Function | Description |
---|---|---|
nasuta | : the current completion time at t. | |
zhang | ||
tassel | at t. m: the numbers of the machine. | |
samsonov | : the optimal completed time and equal to the “Optimal Solution”. |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Zeng, Y.; Lou, P.; Hu, J.; Fan, C.; Liu, Q.; Hu, J. Large Language Model-Assisted Deep Reinforcement Learning from Human Feedback for Job Shop Scheduling. Machines 2025, 13, 361. https://doi.org/10.3390/machines13050361
Zeng Y, Lou P, Hu J, Fan C, Liu Q, Hu J. Large Language Model-Assisted Deep Reinforcement Learning from Human Feedback for Job Shop Scheduling. Machines. 2025; 13(5):361. https://doi.org/10.3390/machines13050361
Chicago/Turabian StyleZeng, Yuhang, Ping Lou, Jianmin Hu, Chuannian Fan, Quan Liu, and Jiwei Hu. 2025. "Large Language Model-Assisted Deep Reinforcement Learning from Human Feedback for Job Shop Scheduling" Machines 13, no. 5: 361. https://doi.org/10.3390/machines13050361
APA StyleZeng, Y., Lou, P., Hu, J., Fan, C., Liu, Q., & Hu, J. (2025). Large Language Model-Assisted Deep Reinforcement Learning from Human Feedback for Job Shop Scheduling. Machines, 13(5), 361. https://doi.org/10.3390/machines13050361