Next Article in Journal
A Dual-Branch Transformer Framework for Trace-Level Anomaly Detection via Phase-Space Embedding and Causal Message Propagation
Previous Article in Journal
Artificial Intelligence in Data Governance for Financial Decision-Making: A Systematic Review
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Monte Carlo Tree Search with Reinforcement Learning and Graph Relational Attention Network for Dynamic Flexible Job Shop Scheduling Problem

by
Yu Jia
*,
Rui Yang
and
Qiuyu Zhang
*
School of Computer Science and Artificial Intelligence, Lanzhou University of Technology, Lanzhou 730050, China
*
Authors to whom correspondence should be addressed.
Big Data Cogn. Comput. 2026, 10(1), 9; https://doi.org/10.3390/bdcc10010009
Submission received: 16 October 2025 / Revised: 17 December 2025 / Accepted: 24 December 2025 / Published: 26 December 2025
(This article belongs to the Topic Generative AI and Interdisciplinary Applications)

Abstract

The dynamic flexible job shop scheduling problem (DFJSP) with machine faults, considering the recovery condition and variable processing time, is studied to determine the rescheduling scheme when machine faults occur in real time. The Monte Carlo Tree Search (MCTS) algorithm with reinforcement learning and the relational-enhanced graph attention network (MGRL) is presented to address the DFJSP with machine faults, considering the recovery condition and variable processing time. The MCTS with the skip-node restart strategy, which utilizes local optimal solutions found during the Monte Carlo sampling process, is designed to enhance the optimization efficiency of MCTS in real time. A relational graph attention network (RGAT), a relational-enhanced and transformer-integrated graph network in the MGRL, is designed to analyze the scheduling disjunctive graph, guide the Monte Carlo sampling method to improve sampling efficiency, and enhance the quality of MCTS optimization decisions. Experimental results demonstrate the effectiveness of the RGAT and the skip-node restart strategy. Further application analysis results show that the MGRL is optimal among all comparison methods when algorithms solve the DFJSP.

1. Introduction

The operation of flexible job shops is impeded by numerous uncertain events, which significantly reduce the production efficiency and prolong the completion time. Particularly in the Intelligent Manufacturing System for Aluminum Profiles, machine faults, considering machine recovery and variable processing time, pose a significant challenge. In practical scenarios, flexible production lines require additional time for machine replacement and maintenance following such machine faults. In the Intelligent Manufacturing System for Aluminum Profiles, the machine processing time frequently varies due to wear and tear during the production of aluminum profiles. The dynamic flexible job shop scheduling problem (DFJSP) [1] with machine faults considering recovery conditions and variable processing time (MFRVT-DFJSP) becomes particularly important. These types of problems are collectively known as DFJSP. The DFJSP is a variation of the flexible job shop problem (FJSP) [2]. Among the DFJSP, the FJSP [3] and the job shop problem (JSP) [4] are NP-hard combinatorial optimization problems in the fields of computer science and operations research, presenting significant challenges.
The DFJSP rescheduling scheme must be generated in real time based on incomplete job and machine information after dynamic events [1] occur. There are three types of methods for generating DFJSP rescheduling schemes: exact methods, meta-heuristic methods, and heuristic methods. Exact methods, such as integer linear programming [5], guarantee the optimal solution. However, it is not feasible to solve the DFJSP in a reasonable time, making it impossible to solve such problems with exact algorithms. Meta-heuristic methods, such as the genetic algorithm [6], particle swarm optimization [7], and tabu search algorithm [8], obtain approximate solutions in a feasible time. These meta-heuristic methods solve the rescheduling scheme of the DFJSP, which makes it difficult to generate high-quality DFJSP solutions in real time. Heuristic methods, represented by the priority dispatching rule (PDR) [9], generate the rescheduling scheme of the DFJSP in real time. However, heuristic methods represented by the PDR have difficulty producing scheduling schemes that meet actual quality requirements. The Monte Carlo Tree Search algorithm (MCTS) [10] employs the Monte Carlo method to build search trees and find approximate solutions in a feasible time. The MCTS is efficiently combined with reinforcement learning, providing a method to solve rescheduling schemes that generate high-quality DFJSP solutions in real time. Therefore, this research focuses on an MCTS algorithm framework combining relational graph attention networks (RGAT) [11] and reinforcement learning [12] (MGRL) to solve high-quality DFJSP rescheduling schemes in real-time.
Obtaining high-quality scheduling knowledge is a key challenge in solving the DFJSP in real time. Scheduling knowledge is integrated into various optimization methods through different forms. For example, PDR methods rely on pre-defined scheduling knowledge by experts and employ scheduling knowledge in the form of rules. The pre-defined scheduling knowledge of experts in the form of rules does not accurately reflect the optimal mapping relationship between job processes and machines, resulting in low-quality scheduling schemes generated by PDR. Accurately analyzing the scheduling disjunctive graph using traditional Graph Neural Networks (GNNs) and acquiring high-quality suggestions for optimizing the scheduling disjunctive graph by applying scheduling knowledge are challenging tasks. This is because the optimization of the makespan objective in the scheduling disjunctive graph is a global graph task rather than a traditional local graph task. Traditional GNNs, which aggregate neighborhood nodes of the target node in the graph, are unsuitable for addressing global graph tasks, especially analyzing the scheduling disjunctive graph, due to the over-squashing [13] problem of information being compressed or distorted while passing among distant nodes. To address this problem, the MGRL employs the RGAT that integrates attention-based transformer models [14] and relational-enhanced graph encoders to analyze the complex global relationships in the scheduling disjunctive graph [15], propose suggestions for optimizing the scheduling disjunctive graph [16], and obtain high-quality scheduling schemes. In particular, the relational-enhanced graph encoder in the RGAT is used to enhance the graph structure representation and strengthen the correlation modeling between pairs of nodes in the transformer model, thereby improving the quality of suggestions for optimizing the scheduling disjunctive graph. Experiments show that the MGRL with RGAT is effective in improving the quality of scheduling schemes in real time.
When solving the DFJSP in real time, enhancing the search capability of MCTS by efficiently utilizing high-quality scheduling knowledge learned through reinforcement learning and improving its running efficiency are key challenges. The MCTS is adept at making high-quality next-step optimization decisions for the current scheduling scheme and solving the DFJSP. In the MCTS algorithm with a learning mechanism-assisted Monte Carlo sampling method [17], the Monte Carlo sampling process frequently wastes a lot of time and makes only one move, although the Monte Carlo sampling process improves the quality of next-step optimization decisions for the current scheduling scheme by understanding the neighborhood region. At the same time, a key observation is that the Monte Carlo sampling process frequently identifies potential solutions that are better than the current solution. How to utilize these potential solutions found during the Monte Carlo sampling process to enable the MCTS to move towards the global optimum more efficiently by skipping some nodes, while avoiding excessive reliance on local optima that could trap the algorithm in a local optimal region, is a challenging problem. To address this problem, the MGRL employs a skip-node restart strategy that allows MCTS to skip some nodes when conditions are satisfied and directly select the local optimal node to move, which significantly reduces the search times of MCTS and improves its local search ability. At the same time, the skip-node restart strategy employs a restart method [18] to prevent the MCTS from getting trapped in local optimal regions due to a decline in the quality of optimization decisions. The experiments show that the MGRL with the skip-node restart strategy is effective in improving the quality of scheduling schemes, indicating that the skip-node restart strategy significantly enhances the ability to utilize high-quality scheduling knowledge obtained through reinforcement learning.
In summary, solving the DFJSP with machine faults, considering the recovery condition and the variable processing time, is key to improving production efficiency. The focus of this paper is on the design of an efficient MCTS and the proposal of the MGRL algorithm framework for generating real-time rescheduling schemes to minimize the makespan. The main research problems and contributions are as follows:
  • To address the problem of requiring high-quality scheduling knowledge and globally analyzing the scheduling disjunctive graph, the MGRL employs the RGAT that integrates the attention-based transformer model and the relational-enhanced graph encoder.
  • To address the problem of efficiently utilizing high-quality scheduling knowledge and improving the running efficiency of MCTS by leveraging local optimal solutions found during the Monte Carlo sampling process, the MGRL employs the skip-node restart strategy to skip some nodes, directly select the optimal node for finding high-quality scheduling schemes faster, and avoid the MGRL from stepping into the local optimal area due to excessive use of local optima.
  • A transformer-integrated and constraint-enhanced RGAT is designed to analyze the scheduling disjunctive graph, guide the Monte Carlo sampling method to improve sampling efficiency, and enhance the quality of MCTS optimization decisions.
  • The relational-enhanced graph encoder in the RGAT is designed to further improve the ability of RGAT to acquire high-quality scheduling knowledge.
  • A skip-node restart strategy that utilizes local optimal solutions found during the Monte Carlo sampling process is designed to enhance the optimization efficiency of the MCTS in real time.
The remaining sections of this paper are structured as follows: Section 2 discusses the existing literature review. An overview of the DFJSP is provided in Section 3. A detailed description of the proposed method is presented in Section 4. Experimental results are offered in Section 5. Further discussions are provided in Section 6. Finally, the paper is concluded in Section 7.

2. Literature Review

2.1. MFRVT-DFJSP and MGRL

MFRVT-DFJSP Motivation. It is a complex task to deal with a machine fault [19], considering the recovery condition and the variable processing time [20] events in real time, and generate high-quality rescheduling solutions to DFJSP. The existing literature has considered a variety of heuristics to solve the DFJSP. Feng et al. [19] focus on a multi-objective DFJSP considering a machine fault. However, these methods focus on static scenarios and can not dynamically adapt to real-time machine faults. Wang et al. [21] and Zhang et al. [20] focus on DFJSP under a variable processing time. High-quality rescheduling under variable processing times and recovery constraints remains challenging. Thus, this paper focuses on the MFRVT-DFJSP.
DFJSP Research Background. Among the existing literature for DFJSP, the PDR is a widely employed method. Appropriate scheduling rules are selected by the PDR algorithm to assign a job to a machine in practical applications. Common scheduling rules include FIFO (first in, first out), SPT (shortest processing time), EDD (earliest deadline), ATC (explicit cost of delay), WINQ (work in queue), and so on. Fan et al. [9] summarized and induced 113 different rules. The application of PDR in addressing the FJSP [22], however, fails to achieve the optimal solution quality. The reason is that the knowledge embedded in the rules is not optimally suited for addressing the specific problem. A challenging task emerges from acquiring and applying such knowledge. Optimal rule selection necessitates comprehensive consideration of both the current operational status of the workshop based on the available knowledge and establishing the relationship between this status and the appropriate rule. Rami et al. [23] employed a Q-learning approach for rule selection to address the challenge of selecting optimal rules. While this method enables knowledge acquisition through the execution of reinforcement learning, it does not incorporate the design of novel rules tailored to specific problem characteristics, and the further enhancement in solution quality is limited. The reason why PDR and Q-learning are effective in solving job shop problems is that they efficiently utilize scheduling knowledge. However, obtaining high-quality scheduling knowledge is a hard issue in solving the DFJSP. Kexin et al. [24] employed an effective MCTS algorithm to minimize the makespan and acquire promising optimization results in the DFJSP. The MCTS is an optimization algorithm that is good at combining with neural networks [17].
Novelties of MGRL. However, traditional MCTS has limitations in handling dynamic events in real time and lacks the ability to fully utilize the high-quality scheduling knowledge obtained through reinforcement learning. How to collaboratively design MCTS with a neural network for analyzing the scheduling disjunctive graph, in order to obtain high-quality scheduling knowledge and efficiently utilize it, has become the focus of this paper. To address these limitations, the proposed MGRL framework integrates the MCTS with reinforcement learning and a relational-enhanced graph attention network, which can dynamically adapt to real-time machine faults and variable processing times, and efficiently utilize the high-quality scheduling knowledge learned through reinforcement learning to generate high-quality rescheduling solutions.

2.2. GNNs for Scheduling Disjunctive Graph and a Novel RGAT

GNNs Research Background. Achieving high-quality scheduling knowledge needs comprehensive job shop information that accurately reflects the problem characteristics. The current limitation lies in the fact that employing machine and job state metrics as the state of reinforcement learning makes it hard to acquire high-quality knowledge. Wen et al. [25] addressed the FJSP by proposing a heterogeneous graph structure to represent the disjunctive graph. They employed a graph neural network to encode this heterogeneous graph structure, capturing the intricate relationship between operations and machines, which yielded promising results. Wan et al. [26] proposed an end-to-end deep reinforcement learning method based on heterogeneous graph neural networks with meta-paths to solve the FJSP, thereby improving the accuracy and efficiency of production scheduling in intelligent manufacturing. Liu et al. [15] used a reinforcement learning approach with an integrated graph attention mechanism [27] to solve the dynamic job shop scheduling problem.
Transformer-based GNNs Research Background. Accurately analyzing the scheduling disjunctive graph using traditional GNNs such as Wen et al. [25], Liu et al. [15], Wan et al. [26], and Chien et al. [28] is hard, because the optimization of the makespan objective in the scheduling disjunctive graph is a global graph task rather than a traditional local graph task. Traditional GNNs, which aggregate neighborhood nodes of the target node in the graph, are unsuitable for addressing global graph tasks. This limitation causes an over-squashing [13] problem when analyzing scheduling disjunctive graphs: distant node information becomes compressed and distorted. The attention-based transformer model is a new method that excels at handling global tasks and aggregating information from distant nodes [29].
Novelties of the RGAT. Overall, the attention-based transformer model [14] struggles with convergence when analyzing scheduling disjunctive graphs and learning high-quality scheduling knowledge. To address this challenge, MGRL integrates the attention-based transformer models and the constraint relational-enhanced encoder into a novel framework termed RGAT. The key innovation of RGAT lies in leveraging constraint relationship information to enhance the correlation between the scheduling disjunctive graph and the improved actions of the scheduling disjunctive graph. This approach improves the pattern analysis and knowledge learning capabilities by capturing complex global relationships within the scheduling disjunctive graph. Consequently, it enhances the convergence of the attention-based transformer model. This RGAT not only improves the transformer model’s ability to analyze scheduling disjunctive graphs but also overcomes the limitations of the over-squashing [13] problem in handling global graph tasks, significantly enhancing scheduling decision quality.
Differences of the RGAT. Chien et al. [28] utilized the traditional graph attention network to represent the scheduling disjunctive graph. However, this method struggles to overcome the over-squashing problem [13]. In contrast, RGAT employs the transformer model to represent the scheduling disjunctive graph. By calculating attention scores between all node pairs and modeling their correlations, the transformer model enables distant node aggregation, thereby mitigating the over-squashing issue. Chen et al. [30] proposed a graph embedding deep reinforcement learning framework based on the transformer model and node2vec to effectively solve the job shop scheduling problem in a sequence-to-sequence manner, thereby improving the operational efficiency of manufacturing systems. However, this method has limitations. Firstly, it is unable to iteratively and continuously optimize the scheduling disjunctive graph. Secondly, it does not provide an explanatory theory to elucidate the reasons for the performance improvement of the transformer. RGAT considers utilizing constraint relationships in the scheduling disjunctive graph to enhance the capability of the transformer model and has the ability to continuously optimize scheduling solutions. Zhang et al. [31] proposed a solution to the large-scale fuzzy job shop scheduling problem based on the transformer model and an approximate policy optimization algorithm. This method focuses on reducing the complexity of the transformer model to achieve faster convergence in the scheduling disjunctive graph analysis task. However, it does not deeply integrate the problem characteristics of graph-based combinatorial optimization problems in the algorithm design. The design of RGAT focuses on addressing the issue of insufficient constraint understanding when the transformer is used to handle graph-based combinatorial optimization problems.

2.3. Restart Method of Heuristic Algorithms and a Novel Skip-Node Restart Strategy

Restart Operation Research Background. The heuristic algorithm [32] is a class of algorithms used to solve complex optimization problems by employing empirical rules and knowledge to find approximate optimal solutions. Restart operations are important in heuristic optimization algorithms, which enhance the searching ability of algorithms to escape local optima and enhance searching capabilities. Liu et al. [18] proposed a multi-restart iterative greedy algorithm for the multi-AGV scheduling problem in automated manufacturing factories to reduce total costs and improve scheduling efficiency. Li et al. citerestart-kbs proposed a restart local search algorithm to solve the minimum k-dominating set problem, which improved the solution quality and computational efficiency. Liu et al. [18] and Li et al. [33] used iterative greedy algorithms and local search algorithms, respectively, combining the characteristics of heuristic algorithms with restart strategies to improve the solution quality and computational efficiency for application problems. Therefore, how to adjust the restart positions according to the characteristics of the algorithm and effectively utilize historical information to accelerate convergence and improve the quality of solutions is a key trend in the design of restart operations for heuristic optimization algorithms.
MCTS Research Background. The MCTS, as a heuristic algorithm, is a method that combines the Monte Carlo method to search for the optimal action. After the MCTS integrating RL was introduced by Coulom et al. [34] in 2010, the MCTS has been widely adopted in various domains such as sorting [10], gaming [35], and others. Kexin et al. [24] employed an effective MCTS algorithm to minimize the makespan in dynamic flexible job-shop scheduling problems, considering four common dynamic events: changes in the processing time of operations, the arrival of new jobs, machine faults, and job cancellations. The fundamental idea behind the MCTS is to simulate subsequent situations through the Monte Carlo sampling method and to decide the optimal next step. When solving combinatorial optimization problems, the MCTS algorithm starts from an initial solution and gradually approaches an approximate optimal solution in the solution space. In the work of Coulom et al. [34], the Monte Carlo sampling method is guided by neural networks. This neural network-based sampling method achieves importance sampling, allowing for the exploration of more desirable neighboring nodes with fewer samples.
Novelties of Skip-node Restart Strategies. The Monte Carlo sampling process combined with neural networks is the key to the success of the MCTS algorithm proposed by Coulom et al. [34] and Daniel et al. [10]. This sampling process significantly improves the decision quality of the next action. Although the neural network-based method for deciding the next action enhances the decision quality, it also significantly increases the decision time for the next action. Given the same action space design and neural network model with the same decision quality, if the MCTS can make more moves within a limited time, it has the potential to achieve better search performance within that time frame. In this skip-node restart strategy, skip-node operations that are multiple steps away from the current node enable more steps to be executed within a limited time. Moreover, due to the sampling of nodes guided by the trained neural network in the MCTS [10], these nodes are closer to the approximate optimal solution. Executing the skip-node operation on these sampled nodes allows for more movements in the solution space within a limited time, based on the guarantee of relatively high decision quality, thereby accelerating the search process. However, the skip-node operation risks converging to local optima due to its inferior decision quality relative to traditional child-node selection. To address this limitation, the restart operation is introduced to escape local optima. Consequently, the proposed skip-node restart strategy integrates three key components: skip-node operations for improved efficiency, child-node operations for high-quality decisions, and restart operations for escaping local optima.
Differences in Skip-node Restart Strategies. The skip-node restart strategy consists of the skip-node operation and the restart operation. Studies including Coulom et al. [34] and Kexin et al. [24] successfully employed MCTS to enhance the quality of the next-step decision. However, the problem that the Monte Carlo sampling process frequently wastes a lot of time and makes only one move is not considered. The skip-node restart strategy utilizes the skip-node operation not only to move more steps in limited time but also to maintain reasonable quality of the next-step decision. Jia et al. [36] proposed an improved reptile search algorithm, which effectively enhanced the performance and robustness in solving global optimization problems by introducing ghost antisymmetric learning and restart strategies. Song et al. [37] proposed a new competition-guided multi-neighborhood local search algorithm for solving the curriculum-based course timetabling problem. The algorithm innovates by selecting neighborhoods, determining neighborhood selection probabilities, and designing restart strategies. Jia et al. [36] and Song et al. [37] do not employ MCTS, which excels at combining with deep reinforcement learning. To our knowledge, there are few studies about MCTS with restart operations solving combination optimization problems. Additionally, Song et al. [37] does not consider designing a restart operation that restarts not only at multi-neighborhood points but also at global points. The skip-node restart strategy focuses on designing an algorithm that restarts in three kinds of regions: global points, multi-step points in neighborhoods, and single-step points in neighborhoods.

3. MFRVT-DFJSP Formulation

3.1. Optimization Objective

The job in a flexible job shop is composed of a sequential set of operations, where each operation exclusively requires one machine from a pool of candidates. Each operation is executed on a single machine at any given time, ensuring that each machine handles only one operation concurrently.
  • n : total number of jobs.
  • m : total number of machines.
  • O : the total set of operations of jobs.
  • Ω : the total set of machines.
  • i : the machine serial number, i = 1 , 2 , 3 , , m .
  • j : the job serial number, j = 1 , 2 , 3 , , n .
  • n j : the total number of operations for the job j .
  • m i : the total number of operations for the machine i .
  • Ω j h : an optional set of processing machines for the operation h of the job j .
  • m j h : the number of optional processing machines for the operation h of the job j .
  • O j : the operation set of the job j .
  • O j h : the operation h of the job j .
  • Ω i : the set of the machine i .
  • OM i k : the operation is processed on the position k of the machine i .
  • O i j h : the operation h of the job j is processed on the machine i .
  • T 0 : the start time of the scheduling scheme.
  • T end : the end time of the scheduling scheme.
  • Tm i : the current processing end time of the machine i .
  • p i j h : the processing time of the operation h of the job j on machine i .
  • L : a sufficiently large positive number.
  • C j : the completion time of each job.
  • C max : the maximum completion time.
  • s j h : the processing start time of the operation h of the job j .
  • c j h : The processing completion time of the process h for the job j .
  • x i j h = 1 , if the operation O j h select machine i 0 , otherwise .
  • y i j h k l = 1 , if O i j h precedes O i k l processing 0 , otherwise .
The following notations and assumptions are needed to formulate the FJSP whose objective is the makespan.
f = min ( max 1 j n ( C j ) ) ,
such that
s j h + x i j h × p i j h c j h ,
c i h s j ( h + 1 ) ,
c i h j C max ,
s j h + p i j h s k l + L × ( 1 y i j h k l ) ,
c j h s j ( h + 1 ) + L × ( 1 y i k l j ( h + 1 ) ) ,
i = 1 m j h x i j h = 1 ,
j = 1 n h = 1 h j y i j h k l = x i k l ,
k = 1 n l = 1 h j y i j h k l = x i j h ,
s j h 0 , c j h 0 .

3.2. Rescheduling Scheme for the Machine Fault Considering the Recovery and the Variable Processing Time

Machine faults are common and unexpected events in job shop operations. The initial scheduling scheme becomes unfeasible, and a new rescheduling scheme is required to continue processing jobs when a machine fault occurs. The recovery of the machine plays a crucial role in devising the rescheduling schemes following a machine failure. In practice, executing the new rescheduling scheme does not guarantee the completion of all jobs following a machine fault if the new rescheduling scheme does not consider the machine recovery time. Therefore, it is crucial for the rescheduling scheme to consider the machine recovery time and prioritize scheduling the recovered machine for immediate production. During the production process, the processing time of all operations is likely to change. These variations may be caused by factors such as equipment wear and tear, operator error, or environmental conditions. The variable processing time led to a decline in the scheduling quality of the previous scheduling scheme when applied to the production line with the changed processing time.

4. The MGRL for MFRVT-DFJSP

4.1. Algorithm Framework of the MGRL

The MGRL algorithm is designed to solve the MFRVT-DFJSP in real time. Subsequent to a machine failure on machine m i at time t 0 , the reordering operation is enforced by the MGRL, thereby resuming the terminated operation and utilizing the updated processing time parameters. The operation of the MGRL is divided into an online part and an offline part. As shown in Figure 1, the online part includes five steps:
  • Extracting a real-time disjunctive graph from the flexible job shop: In the scenario of aluminum alloy profile processing, when machine m i fails at time t 0 , a real-time disjunctive graph is extracted from the flexible job shop based on the unfinished jobs and the updated processing time parameters of the jobs.
  • Acquiring initial rescheduling scheme and building MCTS: The initial scheduling scheme is obtained based on real-time job shop information, which includes the current status of jobs and machines. This initial scheme is then transformed into a disjunctive graph representation. The disjunctive graph captures the constraints and relationships between jobs and machines, providing a structured format for further optimization.
  • Executing the MGRL: The MCTS is initialized using the disjunctive graph as the starting point. The MCTS constructs a search tree, where each node represents a comprehensive scheduling scheme, and each edge represents an optimizing action that modifies the job sequence and machine allocation. The MCTS algorithm iteratively explores the search space by expanding nodes and evaluating potential scheduling schemes. During each iteration, the MCTS performs iterations to assess the quality of different scheduling schemes. These iterations are guided by the RGAT network, which provides estimates of the makespan for each potential scheme. The MCTS uses a combination of selection, expansion, evaluation, and backpropagation operations to navigate the search tree and identify the optimal scheduling scheme.
  • Decoding the optimal rescheduling scheme as a disjunctive graph: Once the optimal scheduling scheme is identified, it is decoded back into a disjunctive graph representation. This step ensures that the optimized scheduling scheme is in a format that can be easily applied to the job shop environment.
  • Resuming the operation of the stopped job shop by applying the rescheduling scheme: The optimized rescheduling scheme is applied to the job shop, allowing the production process to resume efficiently. The application of the rescheduling scheme ensures that the job shop operates with minimal disruption, even in the presence of dynamic events such as machine failures and variable processing times.
In the offline part, the RL [38] algorithm is executed using the online collected data to train the RGAT. The MGRL employs the MCTS to iteratively solve the DFJSP. The RL employed by MGRL is the Deep Q-Network (DQN) [39], and the neural network is modified. The key steps in the offline part include the following:
  • Data collection: Data from the online phase, including states, actions, and rewards, are collected and stored in a dataset. This dataset provides the necessary information for training the RL algorithm. The specific details of the training data are presented in Section 4.3.2.
  • Training by the RL algorithm: The collected data are used to train the RL algorithm, specifically the DQN. The DQN learns to predict the expected returns of different actions in various states, enabling the MCTS to make more informed decisions during the online phase. The training process involves maximizing the reward, which is computed based on the optimization objective and obtained during the online phase.
The MGRL algorithm integrates the MCTS with RGAT, which enhances the scheduling disjunctive graph representation through relational information and attention mechanisms. This integration guides the MCTS search and is continuously updated via RL training. This ensures that the RGAT network provides accurate estimates of actions, guiding the MCTS towards optimal scheduling decisions. By combining the real-time optimization capabilities of the online part with the MCTS and improvement of the offline part and integrating the MCTS with the RGAT, the MGRL algorithm effectively addresses the complexities of the DFJSP, providing robust and efficient rescheduling solutions in dynamic environments.

4.2. MCTS with Skip-Node Restart Strategies

4.2.1. Tree Structure

The MCTS with skip-node restart strategies builds a search tree representing subsequent search situations { s t , a t , , s t + n } , and evaluates them to improve the quality of action a t . Consequently, it achieves the goal of generating high-quality scheduling schemes in real time. Each node in the search tree represents a scheduling scheme s t = { S e q job , S e q mac } . The root node indicates the initial scheduling scheme s 0 , which is generated by the random method. A branch a t connects two nodes { s t , s t + 1 } , with the starting node s t of each branch being the sub-root node and the ending node s t + 1 being the child node. A branch represents an optimizing operation a t . Executing each branch means that it reorders the job sequence and reallocates the machine allocation. This job sequence and this machine allocation are the scheduling scheme of the sub-root node within the branch. As shown in Figure 2e,f, the operation of the branch is to reorder the job sequence and reallocate the machine in these two positions to find a locally optimal scheduling scheme. A scheduling scheme undergoes the branch, which means swapping two positions in the job sequence and reallocating the machines corresponding to these two positions in the job sequence according to the rule of minimizing the total processing time on the machines among the feasible machines. The branches are the same as actions of the RL in the MGRL, specifically discussed in Section 4.3.1.

4.2.2. The MGRL

Algorithm 1 presents the MGRL with the skip-node restart strategy for solving the MFRVT-DFJSP. This algorithm begins by initializing parameters, including the restart node count r e s t a r t N , branch number b r a n c h N , and branch length b r a n c h L e n . This algorithm also initializes an initial optimal scheduling scheme n o d e g l o b a l , a restart node pool n o d e r e s t a r t p o o l , and an initial sub-root node n o d e s u b r o o t , which is selected from n o d e r e s t a r t p o o l . The main step loop runs until the current time t i m e c u r exceeds the time limit t i m e e n d . Each step in the step loop consists of a simulation stage, which samples neighborhood nodes, and a step stage, which makes the current node step to the next node. As shown in Algorithm 1 (3) and (4), the design purposes of these two stages in this decision process are as follows.
  • Simulation stage: In the simulation stage, the MGRL executes the Monte Carlo sampling process guided by the RGAT; this combination method enables an accurate assessment of the quality of the next-step action. This Monte Carlo sampling method achieves better sampling of important nodes with fewer sampling counts. During the sampling process, the involvement of the learning mechanism increases the probability of sampling potential solutions. Utilizing these potential solutions to move more steps in a limited time and to improve the optimization efficiency of the selection operation is the core idea of the skip-node restart strategy. The specific details of the simulation stage are described in Section 4.2.3.
  • Step stage: The task of the step stage is to determine the next action a t and the next state s t + 1 . The step stage is primarily composed of the skip-node selection operation and the restart operation. These two operations are both core operations of the skip-node restart strategy. Specifically, the skip-node decision operation generates n o d e c h i l d and n o d e s k i p . The restart operation decides whether the next action a t should be a child-step action which steps to n o d e c h i l d , a skip-step action which steps to n o d e s k i p , or a restart action. The step stage thus determines the next action. The specific details of the restart strategy are described in Section 4.2.4.
Algorithm 1 The MGRL with the skip-node restart strategy solving DFJSP
 1:
Input:   { r e s t a r t N , b r a n c h N , b r a n c h L e n }
 2:
Output:  X g l o b a l
 3:
(1) Initialization parameters
 4:
n o d e r e s t a r t p o o l RandomTreeNode ( r e s t a r t N )
 5:
n o d e s u b r o o t Choosing ( n o d e r e s t a r t p o o l )
 6:
n o d e g l o b a l Updating ( n o d e r e s t a r t p o o l )
 7:
(2) Start the step loop
 8:
while  t i m e c u r t i m e e n d  do
 9:
    
10:
    (3) Executing the simulation stage
11:
     n o d e l o c a l n o d e s u b r o o t
12:
     n o d e s i m u l a t i o n p o o l n o d e s u b r o o t
13:
    for  i = 1 to b r a n c h N  do
14:
         n o d e t e m p Selection ( n o d e s u b r o o t )
15:
         n o d e s i m u l a t i o n p o o l Adding ( n o d e t e m p )
16:
         n o d e s i m u l a t i o n p o o l Evaluation ( n o d e s i m u l a t i o n p o o l )
17:
         n o d e s i m u l a t i o n p o o l Expansion ( n o d e s i m u l a t i o n p o o l )
18:
        Updating( n o d e s i m u l a t i o n p o o l ) Backpropagation ( n o d e s i m u l a t i o n p o o l )
19:
         n o d e g l o b a l Updating ( n o d e s i m u l a t i o n p o o l )
20:
    end for
21:
    
22:
    (4) Executing the step stage
23:
     n o d e c h i l d , n o d e s k i p Skip - node decision ( n o d e s u b r o o t , n o d e s i m u l a t i o n p o o l )
24:
     n o d e r e s t a r t p o o l Adding ( n o d e c h i l d , n o d e s k i p )
25:
     n o d e s u b r o o t M a x ( S c o r e r e s t a r t )
26:
     n o d e r e s t a r t p o o l C h o o s i n g r a n d o m ( S c o r e r e s t a r t )
27:
end while
28:
X g l o b a l n o d e g l o b a l

4.2.3. The Simulation Stage in the MGRL

As shown in Algorithm 1 (3), in the simulation stage, this algorithm performs b r a n c h N Monte Carlo sampling step decisions guided by the RGAT to product n o d e s k i p and n o d e c h i l d . The simulation stage of MGRL mainly consists of four operations, which include the selection operation, evaluation operation, expansion operation, and backpropagation operation. Selection operation, evaluation operation, expansion operation, and backpropagation operation, depicted in Figure 2a–d, are the same as literature [10]. The key detail of these four operations in the simulation stage is depicted as follows.
  • Selection: The selection operation begins from the n o d e s u b r o o t and proceeds up to B r a n c h L e n , selecting among nodes that have not yet been created.
  • Evaluation: The probability of each child node of the current node is calculated according to the RGAT network; the details of the RGAT are located in Section 4.3.3.
  • Expansion: If the selected node is not created, the MCTS-driven MGRL will create the node. Each step decision of the MCTS-driven MGRL requires expanding b r a n c h N branches, as illustrated in Figure 2g. The maximum depth in the step decision from sub-nodes to child nodes is limited to b r a n c h L e n . At each step decision, a maximum of b r a n c h N × b r a n c h L e n nodes are expanded.
  • Backpropagation: The tree is updated employing R e w a r d s u m , i , t + 1 backtracking based on the node with the maximum fitness of the round as shown in Equation (11).
    R e w a r d s u m , i , t + 1 = R e w a r d s u m , i , t + α D ( i , j ) · R e w a r d j ,
    where D ( i , j ) is the depth between the node i and the node j on the search tree, R e w a r d j is the value defined in the Markov process of the MGRL, and α is the discount factor.
For more background knowledge on the MCTS, please refer to the literature [10].

4.2.4. The Step Stage in the MGRL

The step stage in the MGRL algorithm is crucial for determining the next action and updating the restart node pool. As shown in Algorithm 1 (4), the skip-node decision operation first selects n o d e c h i l d and n o d e s k i p . Subsequently, the algorithm updates the restart node pool by adding n o d e c h i l d and n o d e s k i p . It then selects the next sub-root node using S c o r e r e s t a r t , which is computed by a Hybrid Three-Evaluation (HTE) method that includes the fitness metric, the node reward difference metric, and the node reward mean metric. Finally, it maintains the restart pool size. The algorithm terminates when the time limit is reached and returns the optimal scheduling scheme X g l o b a l .
The skip-node selection operation. To our knowledge, the skip-node decision operation is a unique design specifically incorporated into the MGRL. The skip-node decision operation used in MGRL differs from the traditional MCTS, which just selects child nodes as the next nodes. The skip-node decision not only selects a n o d e c h i l d that is a child node of the previous sub-root node, which indicates that MGRL has executed a child-step action, but also selects a n o d e s k i p that is a subsequent distant node of the previous sub-root node, indicating that the MGRL has executed a skip-step action. The n o d e c h i l d is selected according to S c o r e c h i l d n o d e in Equations (15) and (14), which is the same as in [10] in deciding the next node. Moreover, the n o d e s k i p is selected according to S c o r e s k i p n o d e in Equations (12) and (13). A crucial question arises: how to design S c o r e s k i p n o d e to avoid local optima while enabling more movements in the solution space. The mean reward of a node is a crucial metric for evaluating the potential of a node to serve as a sub-root node and to find near-optimal solutions through search. According to the principle of calculating the node reward mean, the characteristic of reflecting the search potential is affected by the number of samples. The fewer the samples a node has, the more difficult it is to judge the search potential of that node as a sub-root node based on the reward mean of the node. Therefore, based on the skip-node decision operation, nodes are probabilistically selected as the starting points for the recursive selection operation to increase the sampling frequency of local optimal nodes, thereby enhancing the accuracy of the node reward mean as an indicator of the search potential of that node.
S c o r e s k i p n o d e = F i t n e s s R e w a r d m e a n ,
n o d e skip = arg min i { S c o r e skip - node , i } ,
where R e w a r d m e a n is the average reward of the node, and F i t n e s s is the makespan value of the node. This S c o r e s k i p n o d e helps in deciding whether to skip for sampling. The detailed design of the next node selection operation is as follows.
R e w a r d m e a n = R e w a r d s u m N v i s i t ,
Score c h i l d n o d e = R e w a r d m e a n + α · log ( N p a r e n t _ v i s i t ) N v i s i t · p r o b q ,
where N v i s i t is the number of visits of the current node on the search tree, R e w a r d s u m is the total sum of the reward values of the child nodes obtained through this node, N p a r e n t _ v i s i t is the number of visits to the parent node of this node, and p r o b q is the probability value normalized from the Q-value calculated by DQN. Score c h i l d n o d e is computed by the Equation (15) for evaluating the value of state–action pairs. The selection operation randomly selects nodes based on the probability value of normalizing Score c h i l d n o d e .
The restart operation. How to choose the sub-root node for the next step decision from the restart node pool is a key issue. The essence of the action decision operation is to avoid the problem of falling into local optima due to frequent use of the skip-node exploration strategy. Therefore, this HTE has been designed to select the sub-root node for the next step decision from the restart node pool. Equations (16)–(19) together constitute this HTE. The three node evaluation metrics include the fitness metric in Equation (17), the node reward difference metric in Equation (18), and the node reward mean metric in Equation (19). The node reward difference metric judges the potential for continued search based on the node, according to the optimization quality from the node’s parent to the node itself. The node reward mean metric judges the potential for a continued search based on the node, according to the quality of the sampled nodes from that node. Sampled nodes from each step decision are added to the restart node pool. The restart node pool uses the fitness metric to select nodes and maintain a fixed number of nodes in the pool.
S c o r e r e s t a r t = { f r e s t a r t f i t n e s s ( N o d e r e s t a r t p o o l ) , f r e s t a r t r e w a r d e r r ( N o d e r e s t a r t p o o l ) , f r e s t a r t r e w a r d m e a n ( N o d e r e s t a r t p o o l ) } c ,
f r e s t a r t f i t n e s s ( N o d e r e s t a r t p o o l ) = 1 f m a k e s p a n ( N o d e r e s t a r t p o o l ) ,
f r e s t a r t r e w a r d e r r ( N o d e r e s t a r t p o o l ) = f n o r m ( f m a k e s p a n p a r e n t ( N o d e r e s t a r t p o o l ) f m a k e s p a n ( N o d e r e s t a r t p o o l ) ) ,
f r e s t a r t r e w a r d m e a n ( N o d e r e s t a r t p o o l ) = f n o r m ( f r e w a r d m e a n ( N o d e r e s t a r t p o o l ) ) ,
where f m a k e s p a n calculates the makespan of the nodes, f n o r m normalizes the reward values, and the subscript c in { } c indicates the condition for selecting from the set, which is determined using a random number based on a normal distribution.

4.3. Reinforcement Learning of the MGRL

4.3.1. Markov Decision Process

Reinforcement learning is an algorithm that enables an agent to learn an optimal policy by interacting with its environment. In the offline phase, the agent in the MGRL continuously adjusts its policy to maximize the rewards by using the DQN method and interacts with its environment. The specific Markov Decision Process (MDP) is defined as follows:
  • State: The state S = { X , A , A type } = f G ( { S e q job , S e q mac } ) includes scheduling disjunctive graph information, as shown in Figure 3. S e q job denotes the job sequence, and S e q mac denotes the machine allocation sequence. f G transforms the scheduling scheme { S e q job , S e q mac } into the disjunctive graph G = ( V , E ) . The disjunctive graph G = ( V , E ) is a graph generated by the scheduling scheme { S e q job , S e q mac } , with a set of nodes V = { v 1 , , v N } and a set of constraint relational edges E V × V . The node information includes the job number, operation number, allocated machine number, executing number on the allocated machine, and executing time of the allocated machine. The node features are organized in a compact matrix X R N × D , with each row representing the feature vector of one node. The relational information comprises both job-constrained relationships e i , j A job and machine-constrained relationships e i , j A mac . Relational type information comprises both job-constrained relationships A job R N × N , which denote the adjacency matrix where e i j = 1 if there is a job-constrained relationship e i j from node i to node j, and 0 otherwise, and machine-constrained relationships A mac R N × N , which denote the adjacency matrix, where e i j = 1 if there is a machine-constrained relationship e i j from node i to node j. Machine-constraint relationships A mac represent sequence constraints between connected nodes on a single machine, while job-constraint relationships A job indicate the executive sequence between connected nodes.
  • Action: The action involves altering the job sequence and machine assignments in the scheduling scheme, as illustrated in Figure 2e. Specifically, the action swaps the sequence of two adjacent operations at a specific position in the job sequence and reallocates the execution machines for these two operations from their respective sets of feasible machines. Since the job sequence adjustment action only changes the order of adjacent operations, it will not generate an infeasible job sequence [40]. The adjustment of the execution machine for the operations only selects from the legal executable machines for the operations; so, it will not generate an infeasible machine assignment. Therefore, the action will not produce an infeasible solution. Let S t = G ( S job , S mac ) denote that the job sequence S job and machine allocation sequence S mac are transformed into the disjunctive graph by G ( · ) . The action includes changing the job ordering f js ( · ) and changing the machine assignment f ms ( · ) , as shown in Equation (20).
    S t + 1 = f ms ( f js ( S t , a t ) ) ,
    O j 1 h 1 , O j 2 h 2 , S t = f js ( S t , a t ) = f js ( f G ( { S e q job , S e q mac } t ) , a t ) = f swap ( S e q job , a t ) ,
    O j 1 h 1 , O j 2 h 2 , S t = f js ( S t , a t ) ,
    S t + 1 = f ms ( O j 1 h 1 , O j 2 h 2 , S t ) = f mac - ass ( f mac - ass ( S t , O j 1 h 1 ) , O j 2 h 2 ) ,
    S O i j h = f mac - ass ( S t , O j h ) = arg min i p i = j = 1 n h = 1 n j p i j h i Ω j h ,
    where a t denotes the index in the job sequence S job , f js ( S t , a t ) denotes that the job sequence S job is changed by swapping two job processes at a t , f ms ( · ) selects an available machine whose processing time is the smallest among those scheduled according to O j 1 h 1 , O j 2 h 2 , S t , and f mac - ass ( · ) selects an available machine whose processing time is the smallest among those scheduled according to O j h .
  • Reward: The MGRL employs the reward function defined by Liu et al. [28]. The reward function of Liu et al. [28] is defined as the difference between the previous state completion time and the current completion time.
    Reward t + 1 = fitness ( S t ) fitness ( S t + 1 ) ,
    where f i t n e s s ( · ) denotes computing the makespan.

4.3.2. The Training Process in the Offline

During the offline training phase, the agent in the MGRL algorithm learns an optimal policy by interacting with the environment, which is the job shop scene with the machine fault, considering the recovery condition and the variable processing time problem. Specifically, the agent first initializes the policy parameters θ and randomly generates a batch of initial scheduling schemes considering the MFRVT-DFJSP { S e q j o b , S e q m a c } D t a t s e t s e q , which is transformed into the initial state representation of the disjunctive graph S 0 = f G ( { S e q job , S e q mac } ) D t a t s e t G for reinforcement learning. Subsequently, based on the current policy π ( a | s ; θ ) , the agent selects a series of actions a t , which involve adjusting the job sequence and machine allocation sequence to optimize the scheduling disjunctive graph. After executing each action a t , the agent observes the new state S t + 1 = f ms ( f js ( S t , a t ) ) and the corresponding reward value R t + 1 = fitness ( S t ) fitness ( S t + 1 ) , where fitness ( · ) denotes the computation of the makespan of the scheduling scheme. The agent stores these state transitions ( S t , a t , R t + 1 , S t + 1 ) in the experience replay buffer D . Once a sufficient amount of experience has been accumulated, the agent utilizes these experiences to update the policy parameters using the DQN method. The DQN randomly samples a mini-batch of experiences { ( S i , a i , R i + 1 , S i + 1 ) } from the experience replay buffer D to construct the training dataset. The network parameters θ are optimized using backpropagation to enable the Q-value network to better predict the expected returns of each action in different states, thereby gradually approaching the optimal policy. This process is repeated until the q-value network converges, meaning the agent can stably generate high-quality scheduling schemes in the job shop environment, maximizing the cumulative reward. This completes the offline training process and provides an effective Q-value network for solving the dynamic flexible job shop scheduling problem online.

4.3.3. The Relational-Enhanced Graph Attention Network

The core concept of our RGAT is to integrate the constraint relational information and the strength of transformer models to handle global graph tasks more effectively. This constraint relationship information, enhancing the ability to model the correlation relationships between all pairs of nodes, was implemented in the relational-enhanced encoder of MGRL. This relational-enhanced encoder f relational - enhanced ( · ) improves the capability of the transformer model to analyze scheduling disjunctive graphs and address the over-squashing issue. By enhancing the representation of structural characteristics in the scheduling disjunctive graph, the RGAT provides high-quality scheduling decisions. The relational-enhanced encoder f relational - enhanced ( · ) in the RGAT is the core design that leverages constraint relationships to enhance the transformer’s global graph task analysis capabilities. The architecture of RGAT is shown in Equation (26).
Prob a t = f RGAT ( S t ) = f value f transformer f cat f node ( X t ) , f relational - enhanced ( X t , A , A type ) ,
where Prob a t denotes the probability distribution of performing an action, f RGAT ( · ) denotes the neural network, which consists of the relational-enhanced encoder f relational - enhanced ( · ) and the node encoder f node ( · ) , f cat ( · ) integrates the output information from the relational-enhanced encoder f relational - enhanced ( · ) and the node encoder f node ( · ) by combining the concatenation method, the layer normalization, and the ReLU activation function, f transformer ( · ) denotes the transformer encoding model, which contains residual connections, layer normalization, and ReLU activation functions, and f value denotes the value network, which is a fully connected neural network. Specifically, the node encoder f node ( · ) embeds the attributes of each node into a vector space as follows:
h i ( 0 ) = E job ( X i , 0 ) + E op ( X i , 1 ) + E mac ( X i , 2 ) + E exec ( X i , 3 ) + E time ( X i , 4 ) + E node ( X i , 5 ) ,
where E denotes the embedding layer for different attributes, and h i ( 0 ) represents the initial embedding of node i.
The relational-enhanced encoder f relational - enhanced ( · ) performs convolution operations on different types of relational information separately. The convolution results are quantized, and the integer values are embedded using an embedding method to form the output of the relational-enhanced encoder. This approach ensures that the structural characteristics of the scheduling disjunctive graph are effectively encoded.
X relational - enhanced = f relational - enhanced ( X t , A , A type ) = Embedding Long A job · X + Embedding Long A mac · X ,
where Embedding denotes the vector embedding operation, A job represents the job-type relationships in the scheduling disjunctive graph, A mac represents the machine-type relationships, X represents the node attributes, and Long ( · ) denotes the integer operation.

5. Experimental Results and Analysis

5.1. Experimental Setup

Computer experiments and an analysis of the results were conducted to evaluate the performance of the proposed MGRL approach for solving Flexible Job Shop Problems. This section analyzes the (1) RGAT convergence, (2) MGRL hyperparameters, (3) effectiveness of the skip-node restart strategy, and (4) simulation experiments to aluminum profile manufacturing systems. In the comparison experiment with other algorithms, the MGRL was compared against three well-known scheduling scheme generation algorithms: the rule-based algorithm, Q-learning-based algorithm, MCTS-based algorithm, and meta-heuristics-based algorithm. All algorithms were implemented using the Python 3.10 (Torch 2.3.0 + cu121) programming language and executed on a machine equipped with an Intel I5-10400F 2.90 GHz CPU and an Nvidia RTX 3060 GPU running the Ubuntu 22.04 64-bit OS. The code is available at https://gitee.com/polar-power/mgrl.git (accessed on 23 December 2025).

5.2. Metric

Evaluation metrics are employed to evaluate the efficiency difference between our algorithm and others.
  • Gap: Gap is a metric employed to evaluate the performance of a scheduling algorithm. It reflects the gap between the scheduling solution generated by the algorithm and the optimal scheduling solution. When the value of the Gap is smaller, the performance of the algorithm is closer to the optimal solution. As shown in Equation (29), A M m i n is the minimum average makespan, and A M k is one of the values of the average makespan.
    g a p = A M k A M m i n A M m i n
  • Win Rate (WR): The WR is defined as the ratio of the number of times a scheduling solution generated by the algorithm wins in all evaluations to the total number of evaluations. A higher value of WR indicates a better performance of the algorithm. As shown in Equation (30), Number of wins represents the number of times the scheduling scheme generated by the algorithm outperforms in all evaluations, and Total number of evaluations represents the overall number of evaluations. The WR equation is as follows.
    WR = Number of wins Total number of evaluations
  • Average Reward: The average reward is defined as the mean of the sum of the scheduling scheme rewards of a batch of output RGAT networks. A higher value of the average reward indicates a better performance of the algorithm, as shown in Equation (31).
    Average Reward = i = 1 B a t c h S i z e r e w a r d i BatchSize
  • Effective Optimization Ratio: The effective optimization ratio is defined as the ratio of the number of scheduling schemes that input the RGAT network to the number of scheduling schemes that make the makespan smaller due to the action with the highest probability output by the RGAT network after input. A higher value of the effective optimization ratio indicates a better performance of the algorithm. As shown in Equation (32), N s m a l l e r represents the number of scheduling schemes that make the makespan smaller due to the action with the highest probability output by the RGAT network after input, and BatchSize represents the overall number of inputting the RGAT network. The effective optimization ratio equation is as follows.
    Effective optimization ratio = N s m a l l e r BatchSize

5.3. Baseline Algorithm

Three types of algorithms were employed in the comparison experiment to assess the performance of the MGRL, including the rule-based heuristic algorithm, the learning-based heuristic algorithm, and the meta-heuristic algorithm. These algorithms are listed as follows:
  • The rule-based heuristic algorithm. Generating scheduling schemes based on scheduling rules is a classical heuristic method. In this experiment, SPT (the job with the shortest processing time) and LPT (the job with the longest processing time) are employed. The machine allocation scheduling rules include the MFF (the machine that finishes fastest).
  • The GNN-RL algorithm: Liu [15] is a recent SOTA method to solve the dynamic job shop scheduling problem by combining RL and GNNs. The MFF is employed as the machine allocation scheduling rules in this paper. Since this algorithm addresses the job shop scheduling problem considering machine failures and random job arrivals based on GNN and RL methods, which are also selected in the MGRL, and is a recent research combining RL and GNNs, it is selected as one of the comparison algorithms.
  • The MCTS Algorithm: The experiment chooses the MCTS algorithm as the comparison algorithm, because the MGRL algorithm employs the MCTS algorithm to integrate the evaluation network. Kexin et al. [24] apply MCTS to solve the dynamic job-shop scheduling problem with four kinds of dynamic events.
  • The Genetic Algorithm (GA): The GA is one of the most representative meta-heuristics for solving the DFJSP. The mutation rate and crossover rate are key parameters determining solution quality in genetic algorithms. Guo et al. [40] proposed a method for dynamically adjusting genetic operators to further enhance the convergence performance and search capability in genetic algorithms. The GA sets the population sizes to 100.
  • The Memetic Algorithm (MA): Li et al. [3] proposed a learning-based reference vector MA, which is one of the meta-heuristics, for the flexible job shop scheduling problem with fuzzy processing times to minimize the makespan. The experiment chooses the MA algorithm as the comparison algorithm, because the MGRL algorithm solves the DFJSP with a variable processing time event. This algorithm sets the number of population sizes to 100.
  • The Double Deep Q-Network Algorithm (DDQN): The double deep Q-network algorithm is one of the classic RL methods. Renke et al. [41] proposed a method using the double deep Q-network algorithm to solve dynamic flexible job-shop scheduling problems, achieving real-time scheduling decisions and improving the learning efficiency and scheduling effectiveness. To validate the performance of MGRL, this algorithm was selected as a comparison algorithm.
  • The Transformer-based Reinforcement Learning Algorithm (Transformer-RL): Chen et al. [30] proposed an end-to-end deep reinforcement learning framework based on transformer and graph embedding to solve job-shop scheduling problems. Since the MGRL employs the transformer model, this algorithm was chosen as a comparison algorithm.
  • The Hyper-Heuristic Algorithm (HA): Lara et al. [42] proposed a hyper-heuristic reinforcement learning algorithm for solving job-shop scheduling problems. This algorithm was selected as a comparison algorithm.

5.4. Convergence Analysis

To verify the convergence of the RGAT integrated into MGRL, an experiment for convergence analysis of the RGAT training process was designed and is presented in this section. The experimental parameters are shown in Table 1. The experiment was repeated five times for embedding dimensions 8, 16, and 32. The training convergence process is shown in Figure 4 and Figure 5. Figure 4 shows the average reward during the training convergence process, and Figure 5 shows the effective optimization ratio based on the makespan. The average reward and effective optimization ratio curves in Figure 4 and Figure 5 show an increasing trend followed by stabilization. The level of the average reward and the effective optimization ratio parameters under the termination conditions are higher than the initial levels, indicating that the RGAT has convergence after 4000 iterations.

5.5. Hyperparameter Analysis of the MGRL

To verify the effectiveness of the MGRL hyperparameter settings in subsequent experiments and provide references for readers to set MGRL hyperparameters according to actual needs, experiments were designed for the MCTS hyperparameters of the MGRL integrated with RGAT and the embedding dimension hyperparameters of the RGAT. The experiment employed problem sizes of 40 × 20, 50 × 25, and 70 × 35, with 20 instances for each size. Under the condition of a 60 s runtime, 50 time runs were executed to calculate the average time, verifying the reasonable values of the embedding dimensions, b r a n c h N , and b r a n c h L e n hyperparameters. As shown in Figure 6, the heatmap has the horizontal axis representing the b r a n c h L e n values and the vertical axis representing the b r a n c h N values, with the values indicating the win rate of 60 instances of different problem sizes. The sum of the values on the heat map is not 1, because the same instance shows the same makespan value after calculation with different values of b r a n c h L e n and b r a n c h N . Since the smaller b r a n c h L e n and b r a n c h N have less computing time for one iteration, the smaller b r a n c h L e n and b r a n c h N should be selected when the win rate is close. The optimal combination is b r a n c h N 30 and b r a n c h L e n 5; so, subsequent experiments are based on this parameter combination. Readers can repeat the experiments according to the scale of factory instances to determine the hyperparameters of the MGRL algorithm. The experimental results recorded in Table 2 are generated when the MGRL with integrated embedding dimension 32 and the MGRL with integrated embedding dimension 32 are executed in problem sizes of 40 × 20, 50 × 25, and 70 × 35. Since in Table 2, the MGRL with an integrated embedding dimension of 32 has the highest win rate, the embedding dimension parameter is selected as 32. The upper limit of r e s t a r t N is set to a fixed value of 50. The optimal hyperparameters are shown in Table 3.

5.6. Effectiveness Analysis of the Skip-Node Restart Strategy and RGAT in the MGRL

To analyze the effectiveness of the key components integrated in the MGRL, ablation experiments were conducted focusing on the skip-node restart strategy and the RGAT. The specific ablation plans are detailed as follows:
  • Ablation-RSNN: To verify the effectiveness of the skip-node restart strategy, the Ablation-RSNN experiment was designed by removing the skip-node restart strategy from the model.
  • Ablation-TRM: The RGAT integrates a transformer and a relational-enhanced encoder to analyze the scheduling disjunctive graph and provide high-quality decisions for optimization. To verify the effectiveness of the transformer within RGAT, the Ablation-TRM experiment was designed by removing the transformer component.
  • Ablation-RHE: To further verify the effectiveness of the relational-enhanced encoder within RGAT, the Ablation-RHE experiment was designed by removing this specific encoder component.
The experimental parameter settings are as shown in Table 3. For instance, problem sizes of 40 × 20, 50 × 25, and 70 × 35, 20 instances were used for each size. Each instance generated 100 dynamic events, and each dynamic event was executed 50 times to obtain the mean value. The runtime was set to 5, 30, and 60 s, respectively. The win rate is calculated as the ratio of the number of times the MGRL outperforms its ablation versions to the total number of evaluations, as shown in Equation (30).
As shown in Figure 7, the heatmap has the horizontal axis representing the algorithm runtime and the vertical axis representing the ablation algorithms, with the values indicating the win rate of the MGRL in 60 instances of different scales. In Figure 7, a value representing the winning rate higher than 0.5 indicates that the MGRL is superior to the ablation version of MGRL. From left to right in Figure 7, it can be seen that as the runtime of the algorithm increases, the trend of the win rate of MGRL increases significantly. The trend that the win rate of MGRL gradually increases with the increase in the algorithm execution time indicates the effectiveness of the skip-node restart strategy and RGAT in the MGRL.
To further validate the statistical significance of these results, paired samples t-tests were conducted, with the significance level defined as α = 0.05 . Table 4 presents the results of the paired samples t-tests, including the mean difference, standard deviation, standard error mean, and p-value. For Ablation-RHE, the mean difference is 0.73276 with a standard deviation of 4.32676. The 95% confidence interval ranges from 0.27658 to 1.18894, and the t-test statistic is 3.159 with a p-value of 0.02, suggesting a statistically significant difference. For Ablation-TRM, the mean difference is 0.93391 with a standard deviation of 4.17536. The confidence interval ranges from 0.49369 to 1.37413, and the t-test statistic is 4.173 with a p-value of 0.01, indicating a highly significant difference. For Ablation-RSNN, the mean difference is 1.52874 with a standard deviation of 4.8148. The confidence interval ranges from 0.50256 to 2.55491, and the t-test statistic is 2.962 with a p-value of 0.04, which is also statistically significant. Overall, these results suggest that the ablation types significantly impact the outcome variable compared to the MGRL.
To analyze the effectiveness and rationality of the node encoder design, ablation experiments were conducted based on the removal of five components of the encoder. As shown in Figure 8, Ablation-None represents MGRL using the node encoder without any component ablation. Ablation-JobId, Ablation-OpId, Ablation-MacId, Ablation-ExecId, and Ablation-Time, respectively, indicate the ablation of the components encoding JobId information, opId information, MacId information, ExecId information, and processing time information in the node encoder. For instance, problem sizes of 40 × 20, 50 × 25, and 70 × 35, 20 instances were used for each size. Each instance generated 100 dynamic events, and each dynamic event was executed 50 times to obtain the mean value. The runtime was set to 5, 30, and 60 s, respectively. The win rate is calculated as the ratio of the number of times MGRL outperforms its ablation versions to the total number of evaluations, as shown in Equation (30). As shown in Figure 8, the win rate of Ablation-None is the highest at 5 s, 30 s, and 60 s. This experimental result demonstrates the effectiveness of each component in the node encoder of MGRL.
To analyze the effectiveness of the skip-node restart strategy, ablation experiments were conducted based on the removal of five components of the encoder. Ablation-None represents the skip-node restart strategy without any component ablation. Ablation-SK indicates the skip-node restart strategy without the skip-node operation. Ablation-RS indicates the skip-node restart strategy without the restart operation. For instance, problem sizes of 40 × 20, 50 × 25, and 70 × 35, 20 instances were used for each size. Each instance generated 100 dynamic events, and each dynamic event was executed 50 times to obtain the mean value. The runtime was set to 5, 30, and 60 s, respectively. As shown in Figure 9, the win rate of Ablation-None is the highest at 5 s, 30 s, and 60 s. This experimental result demonstrates the effectiveness of the skip-node operation and the restart operation in MGRL.

5.7. Simulation Experiment for MGRL in the Intelligent Manufacturing System for Aluminum Profiles

To verify the efficiency of the MGRL, a comparative experiment was conducted in a simulation environment of the aluminum profile processing, which takes into account the machine fault, machine recovery, and variable processing time. As shown in Figure 10, which is simplified from a real aluminum profile processing job shop in Northwest China, the aluminum profile processing scenarios involve three stages, including pressure casting, cooling, and post-processing. The post-processing stage executes different quantities of operations according to the designs of aluminum profiles. The scheduling process aims to optimize the production sequence and resource allocation across these stages to enhance efficiency and quality. The settings for the aluminum profile processing simulation environment are as follows:
  • Instance: Hyperparameters for generating instances are shown in Table 5 and Table 6. The simulation environment is primarily based on the three main stages of simplified aluminum profile processing, namely pressure casting, cooling, and post-processing. Table 6 displays the machine types and processing times for the post-processing stage. Based on Table 6, the operations for the post-processing stage of each job in the instance, along with the feasible machines and processing times, are determined.
  • Machine fault events: In the actual production process of the workshop, the probability of machine failure is influenced by the busy time of each machine (BTM). Machines with higher usage time are more prone to failure. The machine fault formula proposed in the literature [43] was adopted in this study. The probability p k of machine m k failing is approximated using Equation (33). The stop time τ k at which a machine fault occurs is determined by Equation (34).
    p k = M B T k M B T t o t ,
    τ k = [ α 1 M B T k , α 2 M B T k ] ,
    where M B T k is the busy time of machine k, M B T t o t is the total busy time of all machines, and α 1 and α 2 are coefficients ranging from 0 to 1.
  • Machine recovery events: In this experiment, the recovery time follows a uniform distribution, as shown in Equation (35). The lower bound a of the recovery time uses the smallest time unit 1, and the upper bound b uses the maximum machine process time.
    Recovery Time Uniform ( a , b )
  • Variable processing time: In the process of aluminum profile machining, the machine processing time changes due to wear or break-in, and a fluctuation of 10% in machine processing time is considered normal. The p i j h was defined as follows.
    p i j h = p i j h × ( 1 + ϵ i j h ) ,
    where p i j h is the original processing time, p i j h is the fluctuated processing time, and ϵ i j k represents the rate of change in processing time, which follows a uniform distribution in the interval [ 0.1 , 0.1 ] , The time unit of p i j h is minutes.
In the instance scale of 50 × 25, 80 × 40, there are 20 instances for each scale, each instance generates 100 specific dynamic events, and the mean value is calculated 50 times for each specific dynamic event. The running times of algorithms are set as 5, 30, and 60, respectively. The experiment fully considered the meta-heuristic algorithm MA, the latest graph-based reinforcement learning algorithm GNN-RL, the classic PDR, and the recent MCTS for solving the DFJSP as comparison algorithms to illustrate the performance of MGRL. The experimental result is shown in Table 7. The distribution of the experimental result in Table 7 is shown in Figure 11 and Figure 12. Table 7 shows the efficiency analysis results of the MGRL in solving the DFJSP. It was seen from this table that MGRL ranks first in the win rate under different problem scales and CPU time constraints. In Figure 11 and under the 30 s and 60 s CPU time constraints, the MGRL algorithm outperforms other algorithms, maintaining a lower median makespan. This demonstrates its ability to efficiently utilize the available computational time to find better solutions.
To further analyze the performance of the MGRL, comparisons were made between the MGRL and three SOTA algorithms: the DDQN Algorithm, Transformer-RL, and HA algorithms. Figure 13 illustrates the results of these comparison experiments, where the MGRL and the comparison algorithms were executed on the simulation benchmark. In Figure 13, the horizontal axis represents the problem size, while the vertical axis denotes the win rate. A higher WR value indicates that the algorithm’s performance is closer to the optimal solution. Figure 13 presents the complete calculation results for CPU times of 5, 30, and 60 s. As depicted in Figure 13, the MGRL outperforms the other three algorithms in the 100 × 50 problem size when running for 60 s on the simulation benchmark. Although the Transformer-RL algorithm surpasses the other three algorithms in the 20 × 10 problem size, MGRL outperforms Transformer-RL in the larger 100 × 50 problem size when executed for 60 s.

6. Further Discussion

6.1. Discussion About Generalization for MGRL

The paper considers machine failures and recoveries as well as processing time variations, because the changes in the number of machines and the processing time parameters of machines are the variation parameters in many common and dynamic events in scheduling. The paper focuses on verifying the real-time optimization ability of MGRL through classic and complex dynamic events. The MGRL is not limited to these two types of dynamic events. In order to verify the generalization and real-time optimization ability of MGRL, we conducted comparative experiments based on the classic FJSP problem. The experimental parameter settings are as shown in the Table 3. The running times are set as 5, 30, and 60, respectively. The comparative experiments were carried out based on the public benchmark MK and the simulation benchmark. The MK benchmark comes from Brandimarte [44] including MK01, MK02, MK03, MK04, MK05, MK06, MK07, MK08, MK09, and MK10 instances. The simulation benchmark includes the instance problem sizes of 40 × 20, 50 × 25, and 70 × 35. For instance, problem sizes of 40 × 20, 50 × 25, and 70 × 35, 20 instances were used for each size. Each instance generated 100 dynamic events, and each dynamic event was executed 50 times to obtain the mean value. The hyperparameters for the simulation dataset are shown in Table 5.
Figure 14 shows the results of the MGRL and comparison algorithms executed on the public benchmark MK. In Figure 14, the horizontal axis is the name of the public benchmark, mk01 to mk10, and the vertical axis is the Gap value. When the value of the Gap is smaller, the performance of the algorithm is closer to the optimal solution. Figure 15 shows the results of the MGRL and the comparison algorithm executed on the simulation benchmark. As shown in Figure 15, the horizontal axis is the problem size, and the vertical axis is the WR. When the value of the WR is higher, the performance of the algorithm is closer to the optimal solution. Figure 14 and Figure 15 list all the calculation results of CPU time 5, 30, 60 s.
As shown in Figure 14, the MGRL outperforms the MCTS algorithm in 10 out of 10 instances when running for 60 s in the public benchmark. The results of MGRL comprehensively outperforming MCTS on public datasets once again prove the effectiveness of the improvements made by the MGRL. Since the maximum number of workpieces in the MK benchmark is 20, which belongs to a small-scale dataset, the ability of MGRL based on neural networks to accelerate optimization with high-quality optimization decisions is difficult to demonstrate. As shown in Figure 15, MGRL outperforms all baseline algorithms in four out of five instance problem sizes when running for 60 s in the simulation benchmark, which is larger than the MK benchmark in the instance problem size. The time for a population of the GA algorithm to perform a single iteration optimization operation is faster than the time for MGRL to perform a single iteration optimization operation. The MGRL achieves better optimization results than GA within 60 s, because the Monte Carlo sampling method-based MGRL provides higher-quality optimization decisions than the GA. Therefore, even though the time for the MGRL to perform a single iteration optimization operation is slower than that of the GA, the MGRL still achieves better optimization results in real-time than the GA. Thus, under reasonably scaled instances and within real-time constraints, the fact that MGRL outperforms all comparison algorithms in solving the classical FJSP proves that the improvement of the MGRL in optimization capability is versatile for solving FJSP.
Although the MGRL has demonstrated outstanding performance in solving the DFJSP, considering machine breakdowns and processing time variability, as well as the classic FJSP, its potential extends far beyond these applications. The framework of the MGRL, which combines MCTS with a skip-node restart strategy and RL, along with the Graph Attention Network design based on constraint relation enhancement and the transformer model, is not dependent on specific dynamic events. This makes it easily adaptable to DFJSPs, considering a wider range of dynamic events, thereby endowing the MGRL with strong adaptability and generalization ability. This capability enables the MGRL not only to efficiently handle the specific dynamic events involved in the current study but also to have the potential to be applied to more FJSP-derived problems and DFJSPs with more dynamic events.
Specifically, the results achieved by the MGRL in solving the FJSP and the DFJSP, considering machine breakdowns and processing time variability, have proven its efficiency and robustness in optimizing scheduling solutions. These achievements indicate that the MGRL can effectively utilize the high-quality scheduling knowledge obtained through reinforcement learning, conduct global analysis of complex scheduling problems via the relation-enhanced Graph Attention Network, and efficiently search for high-quality solutions in large-scale solution spaces through the Monte Carlo Tree Search. These characteristics enable the MGRL to quickly adapt and generate high-quality scheduling solutions when faced with other types of FJSP-derived problems, such as Distributed Flexible Job Shop Scheduling Problems, and different types of dynamic events, such as emergency order insertion.
Moreover, the application analysis of the MGRL in real-world industrial scenarios further validates its industrial value in complex dynamic environments. This demonstrates that the MGRL not only excels in theoretical research but also has the capability to solve complex scheduling problems in actual production settings. Therefore, the MGRL is expected to be applied to a broader range of FJSP-derived problems in future research, providing new ideas and methods for solving complex scheduling problems in actual production and further promoting the development of intelligent scheduling technologies.

6.2. Discussion About Device Difference

Generating high-quality rescheduling solutions in real time and solving the DFJSP are the core focuses of this study when dynamic events occur. In this paper, the upper limit for real-time computation is set at 60 s. Moreover, in Section 5, experiments validated the performance of MGRL within this time limit. To ensure fairness in the experiments, all comparative and ablation experiments in Section 5 were conducted on the same computing platform, which is the Intel I5-10400F and Nvidia RTX 3060/12 GB. This platform is referred to as I3060 in Table 8. However, the impact of different platforms on the execution of MGRL under the constraint of this time limit is not clear. Therefore, the following discussion focuses on the influence of different devices on the execution of the MGRL.
Based on the 100 40 × 20 and 100 100 × 50 scale cases generated in Section 5.7, I3060 was compared with four common platforms from cloud service providers. The details of these four computing devices are shown in Table 8. The selection criteria for these four devices were based on computing devices with an hourly cost of less than CNY 2 provided by AutoDL, one of the most popular cloud computing providers in China. This price point is suitable for online deployment by industrial enterprises. In particular, E5-3060 was chosen, because its CPU differs from that of I3060. This difference can illustrate the impact of CPU variations on the execution of the MGRL. In these 200 40 × 20 and 100 × 50 scale cases, each case simulated 10 dynamic events. Moreover, each dynamic event was executed 100 times on each of the five devices. Subsequently, the number of MGRL iterations within the 60 s time limit was recorded for I3060 and the other four devices. Thereafter, paired sample t-tests were conducted to assess the significance of the results between I3060 and the other four devices. The experimental results are shown in Table 9.
As shown in Table 9, each row represents a set of experiments, totaling four sets. The computing platform information involved in each set of experiments is listed in the Device Name column. The Mean column records the mean number of MGRL iterations within the 60 s time limit for each set of experiments. The Std. Error Mean column records the standard error of the mean for the number of MGRL iterations in each set of experiments. The 95% Confidence Interval column records the 95% confidence interval of the difference for the number of MGRL iterations in each set of experiments. The Sig. (2-tailed) column records the significance level for the number of MGRL iterations in each set of experiments. The significance levels, all below 0.05, indicate that there is no significant difference among the four sets of experiments. This result further demonstrates that the MGRL-related experimental data in Section 5 are practically meaningful for low-cost computing devices.

6.3. Discussion About Simulation Environment Difference

In the experiments designed to validate the algorithm’s performance, a simulation environment based on aluminum profile processing scenarios was employed. Discussing the extent to which this simulation environment replicates real-world conditions is crucial for further analyzing the algorithm’s applicability. Generating FJSP instances that conform to aluminum profile production scenarios is the main contribution of this simulation environment. To demonstrate that the generated FJSP instances can reasonably simulate real-world FJSP instances, statistical tests based on job-related metrics were conducted.
Data from a factory in a province in Northwest China, including 312 types of aluminum profile jobs and the processing quantities of each job, were used as the basis for real-world aluminum profile processing environment FJSP instances. The processing machines for these real-world job data are all listed in Table 6. The statistical experiments involved randomly sampling instances of 80 × 40, 100 × 50, and 200 × 100 problem sizes from the real-world job data and statistically comparing them with instances of the same problem sizes generated by the simulation environment. When randomly sampling instances of 80 × 40, 100 × 50, and 200 × 100 problem sizes from the real-world job data, the experiments used duplicate-type machines to fill in the insufficient machine scales. Job-related statistical metrics in the instances include the average maximum processing time per job T m a x , the average minimum processing time per job T m i n , and the average maximum number of operations per job X o p e r a t i o n within an instance.
The maximum processing time per job, the average minimum processing time per job, and the remaining number of operations per job are key metrics for PDR-based job scheduling decisions. This also indicates that these three metrics are crucial factors affecting the design of scheduling schemes. The average maximum processing time per job T m a x , the average minimum processing time per job T m i n , and the average maximum number of operations per job X o p e r a t i o n within an instance are calculated as the means of these three commonly used PDR metrics for all jobs within the instance. The specific calculation formulas for these metrics are as follows:
  • The average maximum processing time per job within an instance T m a x :
    T m a x = 1 n j = 1 n h = 1 n j max ( Ω j h ) .
    In this formula, Ω j h represents the optional set of processing machines for operation h of job j. max ( Ω j h ) indicates the maximum processing time within the optional set of processing machines.
  • The average minimum processing time per job within an instance T m i n :
    T m i n = 1 n j = 1 n h = 1 n j min ( Ω j h ) .
  • The average maximum number of operations per job within an instance X o p e r a t i o n :
    X o p e r a t i o n = 1 n j = 1 n n j .
Specifically, the Simulation Environment Difference Analysis involves conducting independent-samples t-tests on instances of 80 × 40, 100 × 50, and 200 × 100 problem sizes randomly sampled from the real-world job data mentioned above and instances of the same problem sizes generated by the simulation environment. For each problem size, 300 instances were generated. The independent-samples t-tests were performed based on the assumption of equal variances. The results of these independent-samples t-tests are shown in Table 10. These results indicate that there are no significant differences in the classic job metrics T m a x , T m i n , and X o p e r a t i o n calculated based on instances between those generated by the simulation environment and those sampled from real-world scenarios. This conclusion further demonstrates that the instances generated by the simulation environment reasonably mimic those sampled from real-world scenarios. Conducting experiments and scheduling research based on instances generated by this simulation environment hold practical significance.

6.4. Limitations

Despite the performance improvements demonstrated by the MGRL in solving the DFJSP in real time, there are several limitations in the MGRL. First, MGRL’s performance on extremely large-scale problems requires further optimization. The key reason the MGRL outperforms the baseline algorithms is its superior decision-making efficiency. This means that the algorithm can make higher-quality optimization decisions within a limited time frame. The primary means by which the MGRL enhances the quality of optimization decisions is through the use of deep learning methods. Larger-scale instances pose greater challenges to the model’s convergence and online computation time. Second, the current implementation of MGRL focuses primarily on minimizing the makespan. Extending the optimization objective to consider multiple objectives, such as energy consumption, tardiness, and machine utilization, is an important direction for broader applicability. Additionally, the acquisition and efficient utilization of high-quality scheduling knowledge still need to be further addressed.
To address these limitations and further enhance the generalizability of MGRL, several future research directions are proposed. First, it is a good idea to optimize the MCTS with GNN and RL for extremely large-scale and distributed scheduling problems by exploring more efficient searching strategies and parallel computing techniques. Second, it would be good to extend the optimization objective to enable the MGRL to address a wider range of practical scheduling scenarios, such as the cascading scheduling scenarios of a flow shop and a flexible job shop. Finally, future work will continue to explore the efficient integration method combining optimization algorithms and neural networks to acquire and utilize learned knowledge, thereby improving searching capabilities.

7. Conclusions and Future Work

The MGRL algorithm, integrating MCTS with RL and the relational-enhanced graph attention network, has been proposed to address the DFJSP with machine faults, recovery conditions, and variable processing times. The RGAT enhances the representation of the scheduling disjunctive graph by incorporating constraint relational information and transformer models, thereby improving the quality of scheduling decisions. The skip-node restart strategy effectively utilizes solutions found during learning-based sampling to accelerate convergence towards the global optimum while avoiding local traps. Extensive experiments on public benchmarks and simulation datasets of aluminum profile processing scenarios demonstrate that the MGRL outperforms traditional rule-based, meta-heuristic, RL algorithms, and MCTS-based algorithms in terms of the win rate metric based on the makespan optimization objective when the execution time is set to 60 s and the problem size is larger than 50 × 25. However, further enhancement is required for the performance of the MGRL on extremely large-scale problems and its ability to handle multiple objectives. Future work will focus on optimizing the MGRL for the cascading scheduling problem of a flow shop and a flexible job shop, extending the optimization objectives to include energy consumption and machine utilization, and exploring more efficient integration methods between optimization algorithms and neural networks to improve real-time decision-making and optimization capabilities.

Author Contributions

Conceptualization, Y.J.; methodology, Y.J.; software, Y.J.; validation, Q.Z.; Formal analysis, Y.J. and R.Y.; Investigation, Y.J.; Resources, R.Y.; writing—original draft preparation, Y.J.; writing—review and editing, R.Y. and Q.Z.; Visualization, Y.J.; Supervision, Q.Z.; Project administration, Y.J., R.Y. and Q.Z.; Funding acquisition, Q.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the National Natural Science Foundation of China under Grant 61862041 and in part by the Natural Science Foundation of Gansu Province under Grant 21JR7RA120.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original data presented in the study are openly available in MGRL at https://gitee.com/polar-power/mgrl.git (accessed on 23 December 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Yang, X.; Han, Y.; Wang, Y.; Li, H.; Gao, K.; Liu, Y. From fluid relaxations to double deep Q-network for Dynamic Multiplicity Flexible Job-Shop Scheduling. Appl. Soft Comput. 2025, 177, 113231. [Google Scholar] [CrossRef]
  2. Gao, K.; Cao, Z.; Zhang, L.; Chen, Z.; Han, Y.; Pan, Q. A Review on Swarm Intelligence and Evolutionary Algorithms for Solving Flexible Job Shop Scheduling Problems. IEEE-CAA J. Autom. Sin. 2019, 6, 904–916. [Google Scholar] [CrossRef]
  3. Li, R.; Gong, W.; Lu, C.; Wang, L. A Learning-Based Memetic Algorithm for Energy-Efficient Flexible Job-Shop Scheduling With Type-2 Fuzzy Processing Time. IEEE Trans. Evol. Comput. 2023, 27, 610–620. [Google Scholar] [CrossRef]
  4. Zhang, J.; Ding, G.; Zou, Y.; Qin, S.; Fu, J. Review of job shop scheduling research and its new perspectives under Industry 4.0. J. Intell. Manuf. 2019, 30, 1809–1830. [Google Scholar] [CrossRef]
  5. Fatemi-Anaraki, S.; Tavakkoli-Moghaddam, R.; Foumani, M.; Vahedi-Nouri, B. Scheduling of Multi-Robot Job Shop Systems in Dynamic Environments: Mixed-Integer Linear Programming and Constraint Programming Approaches. Omega 2023, 115, 102770. [Google Scholar] [CrossRef]
  6. Jazmin Escamilla-Serna, N.; Carlos Seck-Tuoh-Mora, J.; Medina-Marin, J.; Barragan-Vite, I.; Ramon Corona-Armenta, J. A Hybrid Search Using Genetic Algorithms and Random-Restart Hill-Climbing for Flexible Job Shop Scheduling Instances with High Flexibility. Appl. Sci. 2022, 12, 8050. [Google Scholar] [CrossRef]
  7. Ding, H.; Gu, X. Improved particle swarm optimization algorithm based novel encoding and decoding schemes for flexible job shop scheduling problem. Comput. Oper. Res. 2020, 121, 106751. [Google Scholar] [CrossRef]
  8. Niroumandrad, N.; Lahrichi, N.; Lodi, A. Learning tabu search algorithms: A scheduling application. Comput. Oper. Res. 2024, 170, 106751. [Google Scholar] [CrossRef]
  9. Hua-li, F.; He-gen, X.; Guo-zhang, J.; Gong-fa, L. Survey of the Selection and Evaluation for Dispatching Rules in Dynamic Job Shop Scheduling Problem. In Proceedings of the 2015 Chinese Automation Congress (CAC), Wuhan, China, 27–29 November 2015; pp. 1926–1931. [Google Scholar]
  10. Mankowitz, D.J.; Michi, A.; Zhernov, A. Faster sorting algorithms discovered using deep reinforcement learning. Nature 2023, 618, 257–263. [Google Scholar] [CrossRef]
  11. Jiang, W.; Luo, J. Graph neural network for traffic forecasting: A survey. Expert Syst. Appl. 2022, 207, 117921. [Google Scholar] [CrossRef]
  12. Pal, C.V.; Leon, F. A Brief Survey of Model-Based Reinforcement Learning Techniques. In Proceedings of the 2020 24Th International Conference on System Theory, Control and Computing (ICSTCC), Sinaia, Romania, 8–10 October 2020; pp. 92–97. [Google Scholar]
  13. Giraldo, J.H.; Skianis, K.; Bouwmans, T.; Malliaros, F.D. On the Trade-off between Over-smoothing and Over-squashing in Deep Graph Neural Networks. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management, CIKM 2023, Birmingham, UK, 15–21 October 2023; ACM Special Interest Grp Informat Retrieval; Association for Computing Machinery, ACM SIGWEB: New York, NY, USA, 2023; pp. 566–576. [Google Scholar] [CrossRef]
  14. Wang, J.; Yang, Y.; Yang, H.; Lian, C.; Xu, Z.; Sun, J. MD-GraphFormer: A Model-Driven Graph Transformer for Fast Multi-Contrast MR Imaging. IEEE Trans. Comput. Imaging 2023, 9, 1018–1030. [Google Scholar] [CrossRef]
  15. Liu, C.L.; Tseng, C.J.; Weng, P.H. Dynamic Job-Shop Scheduling via Graph Attention Networks and Deep Reinforcement Learning. IEEE Trans. Ind. Inform. 2024, 20, 8662–8672. [Google Scholar] [CrossRef]
  16. Zhang, C.; Cao, Z.; Song, W.; Wu, Y.; Zhang, J. Deep reinforcement learning guided improvement heuristic for job shop scheduling. In Proceedings of the Twelfth International Conference on Learning Representations, Vienna, Austria, 7–11 May 2024. [Google Scholar]
  17. Silver, D.; Huang, A.; Maddison, C.J.; Guez, A.; Sifre, L.; Van Den Driessche, G.; Schrittwieser, J.; Antonoglou, I.; Panneershelvam, V.; Lanctot, M.; et al. Mastering the game of Go with deep neural networks and tree search. Nature 2016, 529, 484–489. [Google Scholar] [CrossRef] [PubMed]
  18. Liu, Z.J.; Sang, H.Y.; Zheng, C.Z.; Chi, H.; Gao, K.Z.; Han, Y.Y. An Effective Multi-Restart Iterated Greedy Algorithm for Multi-Agvs Dispatching Problem in the Matrix Manufacturing Workshop. Expert Syst. Appl. 2024, 252, 124223. [Google Scholar] [CrossRef]
  19. Feng, Y.; Lin, Y.; Yang, Z.; Xu, Y.; Li, D.; Li, X.; Yang, D. A Two-Stage Individual Feedback NSGA-III for Dynamic Many-Objective Flexible Job Shop Scheduling Problem. IEEE Trans. Autom. Sci. Eng. 2024, 22, 1673–1683. [Google Scholar] [CrossRef]
  20. Zhang, L.; Feng, Y.; Xiao, Q.; Xu, Y.; Li, D.; Yang, D.; Yang, Z. Deep reinforcement learning for dynamic flexible job shop scheduling problem considering variable processing times. J. Manuf. Syst. 2023, 71, 257–273. [Google Scholar] [CrossRef]
  21. Wang, J.; Liu, Y.; Ren, S.; Wang, C.; Ma, S. Edge computing-based real-time scheduling for digital twin flexible job shop with variable time window. Robot. Comput.-Integr. Manuf. 2023, 79, 102435. [Google Scholar] [CrossRef]
  22. Li, R.; Gong, W.; Wang, L.; Lu, C.; Zhuang, X. Surprisingly Popular-Based Adaptive Memetic Algorithm for Energy-Efficient Distributed Flexible Job Shop Scheduling. IEEE Trans. Cybern. 2023, 53, 8013–8023. [Google Scholar] [CrossRef]
  23. Naimi, R.; Nouiri, M.; Cardin, O. A Q-Learning Rescheduling Approach to the Flexible Job Shop Problem Combining Energy and Productivity Objectives. Sustainability 2021, 13, 13016. [Google Scholar] [CrossRef]
  24. Li, K.; Deng, Q.; Zhang, L.; Fan, Q.; Gong, G.; Ding, S. An effective MCTS-based algorithm for minimizing the makespan in dynamic flexible job shop scheduling problem. Comput. Ind. Eng. 2021, 155, 107211. [Google Scholar] [CrossRef]
  25. Song, W.; Chen, X.; Li, Q.; Cao, Z. Flexible Job-Shop Scheduling via Graph Neural Network and Deep Reinforcement Learning. IEEE Trans. Ind. Inform. 2023, 19, 1600–1610. [Google Scholar] [CrossRef]
  26. Wan, L.; Fu, L.; Li, C.; Li, K. Flexible job shop scheduling via deep reinforcement learning with meta-path-based heterogeneous graph neural network. Knowl.-Based Syst. 2024, 296, 111940. [Google Scholar] [CrossRef]
  27. Zhang, Z.; Cui, P.; Zhu, W. Deep Learning on Graphs: A Survey. IEEE Trans. Knowl. Data Eng. 2022, 34, 249–270. [Google Scholar] [CrossRef]
  28. Liu, C.L.; Huang, T.H. Dynamic Job-Shop Scheduling Problems Using Graph Neural Network and Deep Reinforcement Learning. IEEE Trans. Syst. Man Cybern.-Syst. 2023, 53, 6836–6848. [Google Scholar] [CrossRef]
  29. Wang, Y.; Long, H.; Zheng, L.; Shang, J. Graphformer: Adaptive graph correlation transformer for multivariate long sequence time series forecasting. Knowl.-Based Syst. 2024, 285, 111321. [Google Scholar] [CrossRef]
  30. Chen, R.; Li, W.; Yang, H. A Deep Reinforcement Learning Framework Based on an Attention Mechanism and Disjunctive Graph Embedding for the Job-Shop Scheduling Problem. IEEE Trans. Ind. Inform. 2023, 19, 1322–1331. [Google Scholar] [CrossRef]
  31. Zhang, W.; Zhao, F.; Feng, B.; Mei, X. A Reinforcement Learning Control Framework Based on Scalable Graph Transformer for Large-Scale Fuzzy Job Shop Scheduling Problems. IEEE Trans. Neural Netw. Learn. Syst. 2025, 36, 16521–16533. [Google Scholar] [CrossRef]
  32. Zhao, F.; Du, Y.; Zhuang, C.; Wang, L.; Yu, Y. An Iterative Greedy Algorithm for Solving a Multiobjective Distributed Assembly Flexible Job Shop Scheduling Problem with Fuzzy Processing Time. IEEE Trans. Cybern. 2025, 55, 2302–2315. [Google Scholar] [CrossRef]
  33. Li, R.; Liu, S.; Wang, F.; Gao, J.; Liu, H.; Hu, S.; Yin, M. A Restart Local Search Algorithm with Relaxed Configuration Checking Strategy for the Minimum K-Dominating Set Problem. Knowl.-Based Syst. 2022, 254, 109619. [Google Scholar] [CrossRef]
  34. Huang, S.C.; Coulom, R.; Lin, S.S. Time Management for Monte-Carlo Tree Search Applied to the Game of Go. In Proceedings of the International Conference on Technologies and Applications of Artificial Intelligence (TAAI 2010), Hsinchu, Taiwan, 18–20 November 2010; pp. 462–466. [Google Scholar]
  35. Vinyals, O.; Babuschkin, I.; Czarnecki, W.M. Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature 2019, 575, 350–354. [Google Scholar] [CrossRef]
  36. Jia, H.; Lu, C.; Wu, D.; Wen, C.; Rao, H.; Abualigah, L. An Improved Reptile Search Algorithm with Ghost Opposition-based Learning for Global Optimization Problems. J. Comput. Des. Eng. 2023, 10, 1390–1422. [Google Scholar] [CrossRef]
  37. Song, T.; Chen, M.; Xu, Y.; Wang, D.; Song, X.; Tang, X. Competition-guided Multi-Neighborhood Local Search Algorithm for the University Course Timetabling Problem. Appl. Soft Comput. 2021, 110, 107624. [Google Scholar] [CrossRef]
  38. Zhang, M.; Lu, Y.; Hu, Y.; Amaitik, N.; Xu, Y. Dynamic Scheduling Method for Job-Shop Manufacturing Systems by Deep Reinforcement Learning with Proximal Policy Optimization. Sustainability 2022, 14, 5177. [Google Scholar] [CrossRef]
  39. Yu, H.; Gao, K.Z.; Ma, Z.F.; Pan, Y.X. Improved meta-heuristics with Q-learning for solving distributed assembly permutation flowshop scheduling problems. Swarm Evol. Comput. 2023, 80, 101335. [Google Scholar] [CrossRef]
  40. Guo, K.; Yang, M.; Zhu, H. Application research of improved genetic algorithm based on machine learning in production scheduling. Neural Comput. Appl. 2020, 32, 1857–1868. [Google Scholar] [CrossRef]
  41. Liu, R.; Piplani, R.; Toro, C. Deep reinforcement learning for dynamic scheduling of a flexible job shop. Int. J. Prod. Res. 2022, 60, 4049–4069. [Google Scholar] [CrossRef]
  42. Lara-Cárdenas, E.; Silva-Gálvez, A.; Ortiz-Bayliss, J.C.; Amaya, I.; Cruz-Duarte, J.M.; Terashima-Marín, H. Exploring Reward-based Hyper-heuristics for the Job-shop Scheduling Problem. In Proceedings of the 2020 IEEE Symposium Series on Computational Intelligence (SSCI), Canberra, ACT, Australia, 1–4 December 2020; pp. 3133–3140. [Google Scholar] [CrossRef]
  43. Al-Hinai, N.; ElMekkawy, T.Y. Robust and stable flexible job shop scheduling with random machine breakdowns using a hybrid genetic algorithm. Int. J. Prod. Econ. 2011, 132, 279–291. [Google Scholar] [CrossRef]
  44. Brandimarte, P. Routing and scheduling in a flexible job shop by tabu search. Ann. Oper. Res. 1993, 41, 157–183. [Google Scholar] [CrossRef]
Figure 1. The executing process of the MGRL.
Figure 1. The executing process of the MGRL.
Bdcc 10 00009 g001
Figure 2. The execution process of the step decision in the MGRL.
Figure 2. The execution process of the step decision in the MGRL.
Bdcc 10 00009 g002
Figure 3. The graph network architecture of the MGRL.
Figure 3. The graph network architecture of the MGRL.
Bdcc 10 00009 g003
Figure 4. Convergence analysis by the average reward.
Figure 4. Convergence analysis by the average reward.
Bdcc 10 00009 g004
Figure 5. Convergence analysis by the effective optimization ratio.
Figure 5. Convergence analysis by the effective optimization ratio.
Bdcc 10 00009 g005
Figure 6. Hyperparameter analysis of the branchN and branchLen in the hotmap.
Figure 6. Hyperparameter analysis of the branchN and branchLen in the hotmap.
Bdcc 10 00009 g006
Figure 7. Effectiveness analysis for MGRL.
Figure 7. Effectiveness analysis for MGRL.
Bdcc 10 00009 g007
Figure 8. Effectiveness analysis for the node encoder in the MGRL.
Figure 8. Effectiveness analysis for the node encoder in the MGRL.
Bdcc 10 00009 g008
Figure 9. Effectiveness analysis for the skip-node restart strategy.
Figure 9. Effectiveness analysis for the skip-node restart strategy.
Bdcc 10 00009 g009
Figure 10. Aluminum profile processing scenarios.
Figure 10. Aluminum profile processing scenarios.
Bdcc 10 00009 g010
Figure 11. Efficiency analysis for MGRL solving DFJSP in the 80 × 40 problem size.
Figure 11. Efficiency analysis for MGRL solving DFJSP in the 80 × 40 problem size.
Bdcc 10 00009 g011
Figure 12. Efficiency analysis for MGRL solving DFJSP in the 50 × 25 problem size.
Figure 12. Efficiency analysis for MGRL solving DFJSP in the 50 × 25 problem size.
Bdcc 10 00009 g012
Figure 13. Efficiency analysis for MGRL solving DFJSP by the win rate.
Figure 13. Efficiency analysis for MGRL solving DFJSP by the win rate.
Bdcc 10 00009 g013
Figure 14. Efficiency analysis for MGRL solving FJSP in the public benchmark.
Figure 14. Efficiency analysis for MGRL solving FJSP in the public benchmark.
Bdcc 10 00009 g014
Figure 15. Efficiency analysis for MGRL solving FJSP in the simulation benchmark.
Figure 15. Efficiency analysis for MGRL solving FJSP in the simulation benchmark.
Bdcc 10 00009 g015
Table 1. Hyperparameters in the experiment of the convergence analysis of the RGAT.
Table 1. Hyperparameters in the experiment of the convergence analysis of the RGAT.
HyperparametersValues
Number of the transformer layer1
Learning rate 1 × 10 4
Number of training4000
Number of evaluations in the training process800
Batch size180
Discount factor0.9
Clipping ratio0.2
OptimizerAdam
Table 2. Hyperparameter analysis of the dimension in the RGAT.
Table 2. Hyperparameter analysis of the dimension in the RGAT.
Problem SizeHyperparameterWin Rate
40 × 2032 7.50 × 10 1
50 × 2532 8.75 × 10 1
70 × 3532 7.24 × 10 1
Table 3. Optimal hyperparameters of the MGRL.
Table 3. Optimal hyperparameters of the MGRL.
HyperparametersValues
Number of the transformer layer1
Embedding dimension32
BranchLen5
BranchN30
RestartN50
Table 4. Paired samples t-test.
Table 4. Paired samples t-test.
Ablation TypeMeanStd. DeviationStd. Error Meanp-Value
Ablation-RHE0.732764.326760.23194 0.02 0.05
Ablation-TRM0.933914.175360.22382 0.01 0.05
Ablation-RSNN1.528744.81480.5162 0.04 0.05
Table 5. Hyperparameters of the simulation benchmark.
Table 5. Hyperparameters of the simulation benchmark.
HyperparametersValues
Maximum processing time60
Maximum number of operations per job15
Minimum number of operations per job5
Upper limit of the repetitive machine10
Random distribution of job simulations X N ( 0 , 1 )
Machine fault eventsEquations (33) and (34)
Machine recovery eventsEquation (35)
Variable processing timeEquation (36)
Machine types in aluminum profile processingTable 6
Table 6. Hyperparameters of the operation and machine.
Table 6. Hyperparameters of the operation and machine.
MachineOperationProcessing Time
Sandblasting machine (dry/wet)Sandblasting/Polishing3–8 min/ m 2
Polishing machine (vibratory/tumbler)Sandblasting/Polishing5–15 min/m
Belt sander (high-gloss treatment)Sandblasting/Polishing5–15 min/m
PhosphatingChemical conversion treatment15–30 min
Chromium platingChemical conversion treatment15–30 min
Silane treatmentChemical conversion treatment15–30 min
Electrostatic powder coatingPowder coating10–20 min/piece
Air spray robotLiquid coating20–40 min/piece
High-pressure airless spray machineLiquid coating20–40 min/piece
Transfer film laminatorWood grain transfer printing15–25 min/piece
Thermal press furnaceWood grain transfer printing15–25 min/piece
Automatic coating line (5–10 nozzles)Fluorocarbon coating30–50 min/piece
Hot air recirculation oven (electric/gas heating)Curing treatment15–30 min/furnace
Infrared curing furnace (energy-saving type)Curing treatment15–30 min/furnace
UV curing machine (LED UV lamp)Curing treatment2–5 min/piece
High-speed precision saw (hard alloy saw blade)Cutting/Sawing1–3 min/piece
CNC band saw (irregular material cutting)Cutting/Sawing1–5 min/m
Laser cutting machine (fiber/ CO 2 )Cutting/Sawing1–5 min/m
End milling machine (single-axis/dual-axis)Milling/Drilling1–3 min/end
CNC machining center (3-axis/5-axis)Milling/Drilling2–8 min/piece
Multi-axis drill (8–32 spindles)Milling/Drilling2–8 min/piece
CNC tube bender (hydraulic servo drive)Bending/Forming5–20 min/piece
Roll bender (three-roll/four-roll)Bending/Forming5–20 min/piece
Press brake (servo-electrohydraulic)Bending/Forming5–20 min/piece
Coordinate measuring machine (CMM)Dimensional inspection5–15 min/piece
Coating thickness gauge (magnetic/eddy current)Coating thickness detection1–3 min/batch
Cross-cutting deviceAdhesion testing5–10 min/group
Pull-off testerAdhesion testing5–10 min/group
Spectrophotometer (D65 light source)Color difference detection2–4 min/batch
Automatic packaging line (robot + stretch film)Film lamination/Labeling3–10 min/piece
Labeling machine (adhesive/thermal transfer)Film lamination/Labeling3–10 min/piece
Table 7. Efficiency analysis for MGRL solving DFJSP.
Table 7. Efficiency analysis for MGRL solving DFJSP.
Problem SizeCPU TimeMGRL RankMGRL Win Rate
80 × 4051st 4.19 × 10 1
80 × 40301st 4.91 × 10 1
80 × 40601st 4.74 × 10 1
50 × 2551st 4.12 × 10 1
50 × 25301st 5.03 × 10 1
50 × 25601st 4.76 × 10 1
Table 8. Device details.
Table 8. Device details.
Device NameCPU and GPU
E5-1080Intel Xeon(R) E5-2680 v4 and NVIDIA RTX 1080Ti/11 GB
E5-3060Intel Xeon(R) E5-2680 v4 and NVIDIA RTX 3060/12 GB
X-2080Intel Xeon(R) Platinum 8336C and NVIDIA RTX 2080Ti/11 GB
X-3080Intel Xeon(R) Platinum 8352V and NVIDIA RTX 3080/10 GB
I3060Intel I5-10400F and Nvidia RTX 3060/12 GB
Table 9. Device difference analysis.
Table 9. Device difference analysis.
Device NameMeanStd. Error Mean95% Confidence IntervalSig. (2-Tailed)
E5-1080 and I30601.760000.54082[0.536, 2.983]0.071
E5-3060 and I30600.820000.35428[0.018, 1.621]0.743
X-2080 and I3060−0.730000.25168[−1.299, −0.160]0.059
X-3080 and I3060−2.120000.90453[−4.166, −0.073]0.064
Table 10. Simulation environment difference analysis.
Table 10. Simulation environment difference analysis.
IndicatorsMean Diff.Std. Error Diff.95% Confidence IntervalSig. (2-Tailed)
T m a x −73.6370063.86081[−207.80359, 60.52959]0.263958 > 0.05
T m i n −5.5320015.18274[−37.42976, 26.36576]0.719834 > 0.05
X o p e r a t i o n −1.2121980.737865[−2.750196, 0.350196]0.121262 > 0.05
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Jia, Y.; Yang, R.; Zhang, Q. A Monte Carlo Tree Search with Reinforcement Learning and Graph Relational Attention Network for Dynamic Flexible Job Shop Scheduling Problem. Big Data Cogn. Comput. 2026, 10, 9. https://doi.org/10.3390/bdcc10010009

AMA Style

Jia Y, Yang R, Zhang Q. A Monte Carlo Tree Search with Reinforcement Learning and Graph Relational Attention Network for Dynamic Flexible Job Shop Scheduling Problem. Big Data and Cognitive Computing. 2026; 10(1):9. https://doi.org/10.3390/bdcc10010009

Chicago/Turabian Style

Jia, Yu, Rui Yang, and Qiuyu Zhang. 2026. "A Monte Carlo Tree Search with Reinforcement Learning and Graph Relational Attention Network for Dynamic Flexible Job Shop Scheduling Problem" Big Data and Cognitive Computing 10, no. 1: 9. https://doi.org/10.3390/bdcc10010009

APA Style

Jia, Y., Yang, R., & Zhang, Q. (2026). A Monte Carlo Tree Search with Reinforcement Learning and Graph Relational Attention Network for Dynamic Flexible Job Shop Scheduling Problem. Big Data and Cognitive Computing, 10(1), 9. https://doi.org/10.3390/bdcc10010009

Article Metrics

Back to TopTop