Flexible Job Shop Scheduling with Job Precedence Constraints: A Deep Reinforcement Learning Approach

Li, Yishi; Yu, Chunlong

doi:10.3390/jmmp9070216

Open AccessArticle

Flexible Job Shop Scheduling with Job Precedence Constraints: A Deep Reinforcement Learning Approach

by

Yishi Li

and

Chunlong Yu

^*

School of Mechanical Engineering, Tongji University, Shanghai 201804, China

^*

Author to whom correspondence should be addressed.

J. Manuf. Mater. Process. 2025, 9(7), 216; https://doi.org/10.3390/jmmp9070216

Submission received: 15 May 2025 / Revised: 23 June 2025 / Accepted: 24 June 2025 / Published: 26 June 2025

(This article belongs to the Special Issue Smart Manufacturing in the Era of Industry 4.0, 2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

The flexible job shop scheduling problem with job precedence constraints (FJSP-JPC) is highly relevant in industrial production scenarios involving assembly operations. Traditional methods, such as mathematical programming and meta-heuristics, often struggle with scalability and efficiency when solving large instances. We propose a deep reinforcement learning (DRL) approach to minimize makespan in FJSP-JPC. The proposed method employs a heterogeneous disjunctive graph to represent the system state and a multi-head graph attention network for feature extraction. An actor–critic framework, trained using proximal policy optimization (PPO), is adopted to make operation sequencing and machine assignment decisions. The effectiveness of the proposed method is validated through comparisons with several classic dispatching rules and a state-of-the-art DRL approach. Additionally, the contributions of key mechanisms, such as information diffusion, node features, and action space, are analyzed through a full factorial design of experiments.

Keywords:

flexible job shop scheduling; job precedence constraints; graph neural network; disjunctive graph; deep reinforcement learning

1. Introduction

The flexible job shop scheduling problem (FJSP) is an important challenge widely faced in manufacturing industries. In contrast to the classic scheduling problem (JSP), operations in FJSP can be processed on a set of eligible machines. Additionally, the job precedence constraint (JPC) is another practical feature that often exists in various types of production scenarios. More specifically, the JPC indicates that some jobs can start processing after one or several other jobs have been completed. That is, the precedence relationship exists not only among operations in the job as in classic job shop scheduling, but also among different jobs.

An example of the JPC is shown in Figure 1. In this figure, each job (rectangle) represents an item, which could be either a product or a component, and each operation (circle) corresponds to a manufacturing process for that item. Edges indicate precedence relationships between operations or jobs. In this example,

J_{6}

is the final product, and operation

O_{6, 1}

represents its final assembly. The components of

J_{6}

include

J_{1}

,

J_{2}

, and

J_{5}

, while

J_{2}

itself is composed of parts

J_{3}

and

J_{4}

. These hierarchical assembly relationships form a strict tree-like structure among the jobs, similar to a Bill of Materials (BOM).

Several methodologies have been proposed to solve FJSP [1]. One of the most important is mathematical programming, including mixed integer linear programming (MILP) [2,3] and constraint programming [4]. Yet, due to the NP-hardness of FJSP, the direct solving of these models tends to be difficult even for medium-scale instances [5]. Recent works for FJSP have focused on meta-heuristic algorithms, including genetic algorithm [6], Grey Wolf Algorithm [7], particle swarm optimization [8], etc. Li and Gao [9] proposed a method combining the population-based global search of genetic algorithm with the local improvement of tabu search, achieving a higher solution performance and shorter computational time for FJSP. Xie et al. [10] proposed a hybrid algorithm for the distributed FJSP that combines the global search capability of genetic algorithms with the local search strength of tabu search. He et al. [11] developed a method based on the ant colony optimization algorithm, incorporating heuristic rules to address both scheduling and transportation tasks in the FJSP.

Although traditional operations research methods can generally produce promising results, they often involve complex and time-consuming processes. Their computational burden grows rapidly with problem size, making them less suitable for dynamic environments where quick decision-making is essential. To balance computational cost and solution quality, recent research in JSP and FJSP has increasingly shifted toward artificial intelligence approaches, including reinforcement learning (RL) [12] and deep reinforcement learning (DRL) [13]. DRL, in particular, leverages neural networks to map environmental information to optimal decision actions [14]. Johnson et al. [15] proposed a multi-agent reinforcement learning (MARL) method for solving real-time FJSP in robotic assembly cells, demonstrating high flexibility and efficiency in dynamic environments. Luo et al. [16] introduced a hierarchical multi-agent method based on proximal policy optimization (PPO), tailored for real-time scheduling in discrete flexible manufacturing systems. Du et al. [17] studied a multi-objective FJSP with crane transportation and preparation time constraints. They developed a double deep Q-network algorithm with a specialized network architecture to minimize both makespan and total energy consumption. Han and Yang [18] proposed an end-to-end DRL framework based on the 3D disjunctive graph and an improved pointer network, which integrates both static and dynamic features for FJSP. Xu et al. [19] developed a scheduling framework based on the transformer and PPO for FJSP, which captures the relationships between state features and enhances performance through composite dispatching rules and a dense reward function.

These DRL methods rely on manually designed environmental features, which may overlook important information and struggle to handle the complex constraint relationships between operations and machines. To address this limitation, Graph Neural Networks (GNNs) have gradually been applied to FJSP. By representing operations, machines, and constraints as a graph structure, GNNs can effectively capture the structural characteristics of scheduling problems [20]. Park et al. [21] proposed a GNN-PPO framework for solving JSP, which models the problem structure using graphs and demonstrates strong generalization capabilities across untrained datasets of varying sizes. Song et al. [22] developed an end-to-end approach based on GNN and PPO for FJSP, employing a heterogeneous graph to capture the complex relationships between operations and machines. Wang et al. [23] proposed a DRL framework based on a dual attention network and PPO for FJSP. It constructs attention blocks for operation messages and machine messages to accurately represent their interconnections and performs well on large-scale instances. Lei et al. [24] introduced an end-to-end DRL framework for FJSP, utilizing disjunctive graphs to represent the local system state and designing a multi-action PPO algorithm that learns job and machine action policies, effectively handling instances of various sizes.

Besides the aforementioned studies, relatively few works have addressed the FJSP with job precedence constraints. Xiong et al. [25] proposed novel scheduling rules that account for job tardiness in addressing dynamic job shop scheduling problems with job batch releases and extended technological precedence constraints. Zhu and Zhou [26] investigated a job shop scheduling problem with job precedence constraints for bicycle assembly and proposed a multi-micro-swarm leadership hierarchy-based optimization algorithm to address this problem. Zhang et al. [27] studied the FJSP with multi-level assembly structures and proposed a distributed ant colony optimization algorithm to optimize makespan and total tardiness. Lin et al. [28] introduced the FJSP with job precedence constraints accommodating both machining and assembly operations, designing a genetic algorithm with an innovative two-dimensional encoding method to solve it.

Table 1 summarizes some recent works mentioned above on the job shop scheduling problem and compares the differences between our method and them. In summary, most current methods for the FJSP-JPC are based on mathematical programming and metaheuristic algorithms, which may struggle to adapt to dynamic scenarios due to their relatively long computational times. DRL-based methods, on the other hand, demonstrate significant potential for improving the resolution efficiency while maintaining the solution quality. However, to the best of our knowledge, no DRL-based approach has yet been proposed for FJSP-JPC. To fill this gap, this paper investigates the FJSP-JPC with the objective of minimizing makespan. The main contributions of this paper are as follows:

1.: We formulate the FJSP-JPC as a MILP model and develop a heterogeneous disjunctive graph model for improving problem representation.
2.: We propose a DRL-based approach to solve the FJSP-JPC, which outperforms traditional priority dispatching rules (PDRs) as well as a state-of-the-art DRL-based method.
3.: We analyze the impact of various factors in the proposed DRL framework, such as node embeddings, information diffusion range, and action space selection, on the algorithm performance.

The remainder of this paper is organized as follows. Section 2 provides the definition and mathematical model of FJSP-JPC. Section 3 introduces the proposed method. Section 4 reports the experimental results. Section 6 concludes this paper.

2. Problem Formulation

2.1. Problem Description

A set of n jobs

J = {J_{1}, J_{2}, \dots, J_{n}}

are to be processed by a set of m machines

M = {M_{1}, M_{2}, \dots, M_{m}}

in the job shop. Each job

J_{i} \in J

consists of a set of

n_{i}

consecutive operations

{O_{i 1}, O_{i 2}, \dots, O_{i n_{i}}}

. Each operation, say

O_{i j}

, can be processed by a machine k selected from the eligible subset

M_{ij} \subseteq M

with a processing time

p_{i j k}

.

In accordance with the process routing, there are precedence constraints among the jobs. Let

P_{i}

be the set of precedent jobs for job

J_{i}

, meaning that

J_{i}

can start processing only after all the jobs in

P_{i}

are completed. Note that

P_{i}

could be empty. The objective of the FJSP-JPC is to minimize the makespan

C_{m a x}

. Makespan is defined as the total time span from the initiation of the first operation to the completion of the last operation. Other characteristics of the problem are as follows.

1.: Each machine can process at most one operation at a time.
2.: Machines are reliable, and no breakdowns can occur.
3.: Machine setup times are included in the processing times.
4.: Transportation between machines is not considered.

2.2. Mathematical Programming Model

In this section, we formulate the problem using a MILP model as follows.

2.2.1. Parameters

n: total number of jobs.
m: total number of machines.
$J$ : set of jobs.
$M$ : set of machines.
$J_{i}$ : the ith job, $1 ⩽ i ⩽ n$ .
$M_{k}$ : the kth machine, $1 ⩽ k ⩽ m$ .
$n_{i}$ : total number of operations belonging to job $J_{i}$ .
$O_{i j}$ : the jth operation of job $J_{i}$ , $1 ⩽ j ⩽ n_{i}$ .
$M_{ij}$ : set of eligible machines that can process operation $O_{i j}$ , $M_{ij} \subseteq M$ .
$p_{i j k}$ : processing time of operation $O_{i j}$ on machine $M_{k}$ .
$P_{i}$ : set of precedent jobs for job $J_{i}$ .
H: a large number.

2.2.2. Decision Variables

$α_{i j k}$ : 1 if $O_{i j}$ is assigned to machine $M_{k}$ ; 0 otherwise.
$β_{i j i' j'}$ : 1 if $O_{i j}$ is scheduled before $O_{i' j'}$ ; 0 otherwise.
$s t_{i j}$ : starting time of operation $O_{i j}$ .
$C_{m a x}$ : makespan of the entire production process.

2.2.3. Objective and Constraints

m i n C_{m a x}

(1)

\sum_{k \in M_{ij}} α_{i j k} = 1, \forall 1 ⩽ i ⩽ n, 1 ⩽ j ⩽ n_{i}

(2)

s t_{i j} ⩾ s t_{i (j - 1)} + \sum_{k \in M_{i (j - 1)}} α_{i (j - 1) k} p_{i (j - 1) k}, \forall i, j

(3)

\begin{matrix} s t_{i j} ⩾ s t_{i' j'} + p_{i' j' k} - (2 - α_{i j k} - α_{i' j' k} + β_{i j i' j'}) H \\ \forall i, i', j, j', (i, j) \neq (i', j'), k \in M_{ij} \cap M_{i' j'} \end{matrix}

(4)

\begin{matrix} s t_{i' j'} ⩾ s t_{i j} + p_{i j k} - (3 - α_{i j k} - α_{i' j' k} - β_{i j i' j'}) H \\ \forall i, i', j, j', (i, j) \neq (i', j'), k \in M_{ij} \cap M_{i' j'} \end{matrix}

(5)

C_{m a x} ⩾ s t_{i n_{i}} + \sum_{k \in M_{i n_{i}}} p_{i n_{i} k} α_{i n_{i} k}, \forall i

(6)

s t_{i 1} ⩾ s t_{i' n_{i'}} + \sum_{k \in M_{i' n_{i'}}} p_{i' n_{i'} k} α_{i' n_{i'} k}, \forall i, i' \in P_{i}

(7)

α_{i j k} \in \{0, 1\}, \forall i, j, k \in M_{ij}

(8)

β_{i j i' j'} \in \{0, 1\}, \forall i, j, i', j'

(9)

The objective (1) is to minimize the makespan

C_{m a x}

. Constraint (2) ensures that each operation must be assigned to one eligible machine. Constraint (3) ensures that an operation can start after the completion of its predecessor in the same job. Constraints (4) and (5) together ensure that two distinct operations,

O_{i j}

and

O_{i' j'}

, assigned to the same machine

M_{k}

, do not overlap in processing time and adhere to the precedence order determined by

β_{i j i' j'}

. Constraint (6) defines the makespan. Constraint (7) ensures that a job cannot start until all its precedent jobs have been completed.

3. Methodology

FJSP-JPC involves two types of decisions: assigning an operation to one of its eligible machines, and sequencing the operations on the same machine. Though FJSP-JPC is a combinatorial optimization problem, the schedule construction can be viewed as a sequential decision-making procedure. This allows for the possibility to formulate the schedule procedure as a Markov decision process (MDP), where the optimal policy can be estimated by RL algorithms. In this section, the details of our proposed approach are presented. Firstly, we explain the scheduling environment for FJSP-JPC. Then, the definitions of state, action, reward function, and state transition logic are described, along with several improvement strategies we propose.

3.1. MDP Formulation of the Scheduling Problem

At the first scheduling step (

t = 1

), the system clock is initialized to

T = 0

, with all machines idle and all operations unscheduled. Let

s_{t}

denote the current system state. Based on

s_{t}

, the agent takes an action

a_{t}

, which involves selecting an unscheduled operation and scheduling its start time on one of its eligible machines. This action transitions the system to a new state

s_{t + 1}

, and the environment provides a reward

r_{t}

.

Typically, only a subset of operations can be scheduled at time T, such as those whose predecessors have been completed [23]. However, the definition of this subset is not unique. If no such operations are available, or all machines that can process these operations are busy, the system clock advances to the earliest time at which a machine completes a scheduled operation. An operation is considered completed once its scheduled completion time is earlier than the system clock. This process continues until all operations have been scheduled and completed.

3.2. State

The state representation of FJSP typically summarizes the environment information at time T. Traditional methods often use manually crafted features, such as job completion rate and machine load rate, as state descriptors [13]. The advantage of this approach lies in its ability to abstract information into fixed-size feature vectors, making it applicable to systems of different scales and allowing the agent a certain degree of generalization capability. However, the limitations of this method are also evident. In principle, the state representation provided to the agent should comprehensively capture all of the relevant information about the environment at time T. In practice, the complexity of manufacturing systems makes it difficult for manually designed features to include all critical aspects. Important factors such as job precedence constraints and machine-operation compatibility may be overlooked. To address this issue and better capture the complex characteristics of the environment, we propose a state representation method based on a heterogeneous disjunctive graph, combined with a multi-head graph attention network (GAT), to learn rich and structured features automatically.

3.2.1. Heterogeneous Disjunctive Graph

The disjunctive graph is a graph model used to represent the scheduling process of JSP [29]. The typical disjunctive graph is a directed acyclic graph written as

G = (O, A, E)

. Here,

O = \{O_{i j} ∣ \forall i, j\} \cup \{S t a r t, E n d\}

is the set of nodes consisting of all operations and two dummy operations representing the start and end of processing.

A

is the set of directed conjunctive arcs, representing the precedence relationship between operations.

E

is the set of disjunctive arcs connecting the operations that can be processed on the same machine

M_{k}

. Solving JSP involves fixing the direction of each disjunctive arc and, thus, arranging the sequence of operations on the same machine.

Unlike JSP, each operation in FJSP can have more than one eligible machines. To better represent this relationship between operations and machines, a heterogeneous disjunctive graph

H = (O, M, A, D)

is designed. As shown in Figure 2a, the graph retains operation nodes

O

and directed conjunctive arcs

A

, while introducing a new type of node, i.e., machine nodes

M

, to represent the machines. The disjunctive arcs

D

build connections between the operations and their eligible machines.

The scheduling process involves selecting one disjunctive arc for each operation node, converting it into a conjunctive arc, and determining the operation start time, which is stored in the node features. A possible solution is illustrated in Figure 2b. The detailed step-by-step scheduling procedure can be found in Appendix A.

We embed a set of features into the graph to represent scheduling information. Table 2 reports the features for each operation node in

O

, machine node in

O

, and undirected arc in

D

. Most of these features are identical to those used in [23] for FJSP; we modify a subset of them to accommodate the JPC characteristics of our problem.

More specifically, compared to those proposed in Wang et al. [23], the newly designed features are (o2), (o8), (o9), and (a3). These features are specifically adapted to the tree-like JPC in our problem. More specifically, when computing the estimated completion time of operation

O_{i j}

in (o2), we consider all operations related to

O_{i j}

through JPC across the entire graph. Instead of assuming

C_{i 0} = 0

, we determine the estimated starting time of

C_{i 0}

by Equation (20). Similarly, when calculating the number of remaining operations (o8) and the remaining work (o9), we account for all operations connected to

O_{i j}

via JPC, i.e., all unscheduled operations along the directed path from

O_{i j}

to End. On the other hand, the calculation of feature (a3) differs from that in Wang et al. [23]. Since we use a different action space, the set of candidate operations that a machine, say

M_{k}

, can process at a given time T is larger than that in the original version.

3.2.2. Multi-Head GAT

Traditional neural network architectures, such as the multi-layer perceptron (MLP), can only process inputs with fixed-size dimensions. However, in FJSP-JPC, the size of the graph structures representing the states varies between instances. To tackle this difficulty, we utilize a multi-head GAT to process the graph features. The multi-head GAT uses a graph of arbitrary size as input and outputs a graph of the same topology with updated feature vectors in each node and arc. The GAT assigns different attention weights to neighboring nodes, allowing it to focus more on important neighbors during the feature aggregation process. Meanwhile, the graph attention layer utilizes attention mechanism to learn critical structural features of the graph during the embedding process and places greater emphasis on the local information by applying attention masks. The attention mechanism is described in the following subsection.

3.2.3. Attention Mechanism

The heterogeneous disjunctive graph can essentially be viewed as a superposition of two distinct subgraphs, i.e., the operation relationship graph and the machine relationship graph. Hence, two separate attention blocks are employed in the model to embed the features of operations and machines, respectively. This design facilitates the learning of complex connections both between operation nodes and between operation nodes and machine nodes. While the two attention blocks share similar structural designs, they have different dimensions and parameters.

Operation attention block. The inputs to this block are the features of the operation nodes. The precedence constraints between operation nodes are represented by an adjacency matrix, which is used to propagate information between nodes with strong correlations. Specifically, each operation

O_{i j} \in O

is associated with an input feature

h_{O_{i j}}

. Let

N_{i j}

denote the set of direct predecessor and successor nodes that are less than one edge away from

O_{i j}

, including

O_{i j}

itself. Notably,

N_{i j}

may also include cross-job operations that do not belong to the same job as

O_{i j}

. For example, in Figure 3a, the operation node

O_{41}

aggregates information from two of its predecessors in different jobs, namely

O_{12}

and

O_{22}

. This mechanism enables information to diffuse across the graph, enhancing overall performance.

The methodology for computing the attention coefficient between

O_{i j}

and

O_{i' j'}

is as follows:

\begin{matrix} e_{i j i' j'}^{o} & = LeakyReLU ({\vec{a}}^{⊤} [(W h_{O_{i j}}) ∥ (W h_{O_{i' j'}})]), \\ \forall O_{i' j'} \in N_{i j}, \end{matrix}

(10)

where

W \in R^{d_{o}^{'} \times d_{o}}

is a linear transformation that projects the node features

h_{O_{i j}}

and

h_{O_{i' j'}}

into a higher-dimensional space. The transformed features are concatenated and then passed through a single-layer feedforward neural network with a

LeakyReLU

activation function. The network weights are denoted by

\vec{a} \in R^{2 d_{o}^{'}}

. Afterwards, we normalize attention coefficients by the softmax function as follows:

\begin{matrix} μ_{i j i' j'}^{o} & = Softmax (e_{i j i' j'}^{o}) \\ = \frac{exp (e_{i j})}{\sum_{O_{i' j'} \in N_{i j} ∖ U_{i j}} exp (e_{i' j'})}, \forall O_{i j} \in N_{i j} . \end{matrix}

(11)

where

U_{i j}

represents the set of completed and dummy operation nodes whose effect on the scheduling process is not considered. Having the attention coefficients, we update the features for each operation node by

h_{O_{i j}}^{'} = ELU (\sum_{O_{i' j'} \in N_{i j}} μ_{i j i' j'}^{o} W h_{O_{i j}}),

(12)

where

ELU

is the exponential linear unit nonlinear activation function.

This approach updates the node embeddings by aggregating information from their neighbors. Let

h_{O_{i j}}^{(l)}

represent the node embedding after passing through l attention layers. By doing so,

h_{O_{i j}}^{(l)}

incorporates information from a broader range of nodes, thereby expanding the receptive field.

Machine attention block. The attention block for machines is similar to that for operations; the difference is in the consideration of neighbors. In FJSP, machines are related to each other due to the competition of unscheduled operations. More specifically, if an unscheduled operation can be processed by two machines, say

M_{k}

and

M_{k'}

, there can be a competitive relationship between them. As shown in Figure 3b,

M_{1}

is related to

M_{2}

through the competition for

O_{12}

, while it is related to

M_{3}

by the competition for

O_{21}

and

O_{32}

. Let

C_{k k'}

be the set of operations that

M_{k}

and

M_{k'}

compete for, and

N_{k}

be the set of machines competing with machine

M_{k}

(including

M_{k}

itself). The parameter that measures the competitive intensity between

M_{k}

and

M_{k'}

can be defined as follows:

κ_{k k'} = \sum_{O_{i j} \in C_{k k'}} h_{O_{i j}} .

(13)

Then, the attention coefficient between machines

M_{k}

and

M_{k'}

can be computed by

\begin{matrix} e_{k k'}^{m} = & LeakyReLU \\ ({\vec{b}}^{⊤} [(P h_{M_{k}}) ∥ (P h_{M_{k'}}) ∥ (Q κ_{k k'})]), \\ \forall M_{k}, M_{k'} \in N_{k} . \end{matrix}

(14)

where

P \in R^{d_{m}^{'} \times d_{m}}

is a linear transformation scaling up the machine features vectors, and

Q

is a matrix evaluating the influence of a competitive intensity on non-self-attention. Specifically, if

C_{k k'}

is empty,

M_{k}

and

M_{k'}

do not affect each other, and the attention coefficients between them are set to zero by a mask. Finally, we update the machine features

h_{M_{k}}

by

μ_{k k'}^{m} = Softmax (e_{k k'}^{m}),

(15)

h_{M_{k}}^{'} = ELU (\sum_{M_{k} \in N_{k}} μ_{k}^{m} P h_{M_{k}}) .

(16)

3.2.4. Multi-Head Attention

According to Vaswani et al. [30], the expressive power of the attention mechanism can be enhanced by capturing information across different dimensions using multiple independent attention heads. To leverage this, we employ

H

attention heads within both the operation and machine attention block. More specifically, each head uses the same raw features of operation nodes (or machine nodes) as inputs, while the attention coefficients

μ_{i j i' j'}^{o}

s (or

μ_{k k'}^{m}

s) are computed in parallel and independently. Then, the outputs of each head are combined by an aggregation operator. Following GAT, we use concatenation except for the last attention layer, which is combined by an averaging operator.

3.2.5. Global Average Pooling

Besides the node features, we apply an average pooling on all the active nodes to obtain a global feature vector

h_{g_{t}}^{(l)}

describing the overall state of the graph, which is given by

h_{g_{t}}^{(l)} = [(\frac{1}{|𝒪|} \sum_{O_{i j} \in 𝒪} h_{O_{i j}}^{(l)}) ∥ (\frac{1}{|M|} \sum_{M_{k} \in M} h_{M_{k}}^{(l)})] .

(17)

3.2.6. The Overall Architecture

The overall architecture of the multi-head GAT is illustrated in Figure 4. The input consists of features of nodes and operation–machine arcs. After passing through the attention layers, the node features are updated. The output is a graph with the same structure but with updated features, which serve as the state to guide scheduling decisions.

3.3. Action

In prior studies on DRL-based FJSP [13], priority dispatching rules (PDRs) are often employed as actions for decision-making. However, this action space, derived from human experience, while capable of quickly generating suboptimal solutions, fails to fully encompass all feasible actions in

s_{t}

, potentially resulting in performance limitations.

To maintain the degree of exploration, we define the action set

A_{t}

as all feasible operation–machine pairs that can be selected at step t. A similar approach is adopted in [31]. More specifically, an operation–machine pair

(O_{i j}, M_{k})

is feasible at step t when

O_{i j}

is unscheduled and all its predecessors are completed or being processed at time

T (t)

. Note that this action space is different from the classic one used in prior studies (e.g., [23]), which only considers operations whose predecessors are completed. This results in a non-delay schedule and may lose the coverage of the optimal schedule. Our scheme generates a semi-active schedule, as the machines are allowed to stay idle while an operation is waiting for processing. The benefit of this scheme is illustrated by numerical results.

3.4. Reward

The reward after one step of state transition is defined by

r_{t} = {\bar{C}}_{m a x} (t) - {\bar{C}}_{m a x} (t + 1) .

(18)

Here,

{\bar{C}}_{m a x} (t)

represents the estimated makespan in step t, which is defined as the maximum completion time among all operations. The completion times for completed and ongoing operations are known, while those for unscheduled operations are estimated by

{\bar{C}}_{i j} = max_{i' j' \in P_{i j}} {\bar{C}}_{i' j'} + min_{k \in M_{i j}} p_{i j k} .

(19)

where

P_{i j}

is the set of predecessor operations of

O_{i j}

.

After the scheduling process is completed and the discount factor

γ = 1

, the cumulative reward is given by

G = \sum_{t = 0}^{|O|} {\bar{C}}_{\max} (t) = {\bar{C}}_{\max} (0) - C_{\max} .

(20)

where

{\bar{C}}_{\max} (0)

is a constant for a specific instance, and

C_{m a x}

is the actual makespan. In this formulation, G is negatively correlated with

C_{\max}

, which means that maximizing the cumulative reward G is equivalent to minimizing

C_{\max}

.

3.5. Decision-Making

We adopt an actor–critic framework parameterized by

θ

and

ϕ

for decision-making, as shown in Figure 5. The actor network aims to generate a policy function

π_{θ} (a_{t} | s_{t})

for actions. More specifically, for each

a_{t}

in

A_{t}

, say

a_{t} = (O_{i j}, M_{k})

, the extracted features of

O_{i j}

and

M_{k}

, the global features, and the operation–machine arc features are concatenated into a single vector and fed into the actor network, which outputs the score by

s c (a_{t} | s_{t}) = {MLP}_{`} [h_{O_{i j}}^{(l)} ∥ h_{M_{k}}^{(l)} ∥ h_{g_{t}}^{(l)} ∥ h_{a r c_{i j k}}] .

(21)

Then, the probability of selecting

a_{t}

is

π_{θ} (a_{t} | s_{t}) = Softmax (s c (a_{t} | s_{t})) .

(22)

The critic network is used to estimate the state value function. This estimation serves as a baseline to stabilize and improve the efficiency of the training process by reducing variance in the policy gradient. The state value function is estimated by

s v (s_{t}) = {MLP}_{ϕ} (h_{g_{t}}^{(l)}) .

(23)

The loss function of [32] is used, which consists of three parts: the policy loss computed by Generalized Advantage Estimation (GAE), the value loss, and the entropy loss used to encourage exploration:

L_{t o t a l} = ϵ_{1} L_{p o l i c y} + ϵ_{2} L_{v a l u e} + ϵ_{3} L_{e n t r o p y} .

(24)

3.6. Training Process

The actor–critic framework was trained using the Proximal Policy Optimization (PPO) algorithm. The pseudocode is given in Algorithm 1. To improve training stability, we used a reference action network

θ_{o l d}

with the same structure as the actor network with lagged parameters. Also, clipping was used to limit the update magnitude of the policy.

We began by initializing the model parameters

{ω, θ, ϕ}

and sampling

B

training instances along with

V

testing instances from the simulation environment. The model was then trained over

N_{i t}

iterations, with each iteration starting by setting

θ_{old} = θ

. For each instance, a simulation is performed in which the agent interacts with the environment. At each decision point t, an action is sampled from

π_{θ_{old}}

based on the current state

s_{t}

, and the resulting state transition data is stored in the memory buffer

D

. The model parameters are updated K times using the collected data. Additionally, the training instances are resampled every

N_{r e}

iterations, and the policy is evaluated on the testing instances. Finally, the memory buffer is cleared at the end of each iteration. We name the proposed approach Multi-Head GAT-based PPO for FJSP-JPC (MGPPO).

Algorithm 1 Multi-Head GAT-based PPO for FJSP-JPC

Input:: Mulit-Head GAT, actor network, and critic network with trainable parameters $ω$ , $θ$ , and $ϕ$ ; reference behavior actor network $θ_{o l d}$ , Memory $D$
1:: Sample a batch of $V$ testing instances;
2:: Sample a batch of $B$ training instances;
3:: for $i t e r = 1, 2, \dots, N_{i t}$ do
4:: $θ_{o l d} \leftarrow θ$ ;
5:: for $b = 1, 2, \dots, B$ do
6:: Initialize state $s_{t}$ based on instance b;
7:: while $s_{t}$ is not terminal do
8:: Sample action $a_{t} \sim π_{θ_{o l d}} (\cdot ∣ s_{t})$ ;
9:: Receive reward $r_{t}$ and next state $s_{t + 1}$ ;
10:: Collect the transition ( $s_{t}, a_{t}, r_{t}, s_{t + 1}$ ) in $D$ ;
11:: Update state $s_{t} \leftarrow s_{t + 1}$ ;
12:: end while
13:: Compute the generalized advantage estimates $\hat{A_{t}}$ for each step using collected transitions;
14:: end for
15:: for $k = 1, 2, \dots, K$ do
16:: Compute the total loss function $L$ with the data in $D$ ;
17:: Update the parameters $θ$ , $ω$ , and $ϕ$ ;
18:: end for
19:: if $i t e r % N_{v a l} = 0$ then
20:: Test $π_{θ}$ on $V$ testing instances;
21:: end if
22:: if $i t e r % N_{r e} = 0$ then
23:: Resample a batch of $B$ training instances;
24:: end if
25:: Empty $D$ ;
26:: end for

4. Numerical Results

In this section, we first describe the experimental setup. Then, we explore the impact of several mechanisms on the scheduling performance of MGPPO. Finally, we validate the performance of MGPPO by comparing it to some benchmarks.

4.1. Experimental Setup

Test instances. Since there is no established benchmark for the FJSP-JPC, we generated test instances based on the real-world case of a hybrid bicycle assembly job shop presented in [26]. In our formulation, the jobs making up a product are referred to as a job group, which is generated as follows:

1.: Each job group contains 12 operations distributed across six jobs. The number of operations per job is sampled without replacement from the set $1, 1, 2, 2, 3, 3$ .
2.: There are three manufacturing stages, as illustrated in Figure 1. Jobs $J_{1}, J_{2}, J_{5}$ belong to the first stage and have no predecessors. Job $J_{3}$ is in the second stage, with both predecessor and successor jobs. Job $J_{6}$ belongs to the third stage, i.e., the final assembly stage, which starts only after all other jobs have been completed.
3.: Following Zhu and Zhou [26], the processing time for each operation is randomly sampled from a uniform distribution $U (100, 200)$ .
4.: All operations in the final assembly stage must be assigned to a designated machine. For all other operations, two available machines are randomly selected (without repetition) from the remaining machine pool.

The smallest instance includes two job groups, each consisting of six jobs and 12 operations (as illustrated in Figure 1), and involves five machines, denoted as 2 × 12 × 5. Additionally, we consider two larger instance sizes: 3 × 12 × 5 and 4 × 12 × 5. For each size, 1000 instances are randomly generated.

Configurations. We set the training parameters as follows: training iterations

N_{i t} = 1000

, instances batches

V = 100, B = 20

. The model parameters are updated

K = 4

times per iteration. The training instances are updated every

N_{r e} = 20

iterations, and the policy is tested every

N_{v a l} = 10

iterations. The multi-head GAT module has

l = 2

embedding layers. There are four attention heads in both the operation attention and machine attention blocks per embedding layer. The input dimensions of each head are

d_{o - i n}^{1} = 10, d_{m - i n}^{1} = 8

for the first layer and

d_{o - i n}^{2} = d_{m - i n}^{2} = 128

for the second layer. The output dimensions are

d_{o - o u t}^{1} = d_{m - o u t}^{1} = 32

for the first layer and

d_{o - o u t}^{2} = d_{m - o u t}^{2} = 8

for the second layer.

In the PPO module, there are two hidden layers with dimension

d_{h} = 64

and activation function tanh in both the actor and critic networks. In the loss function (24), the coefficients are

ϵ_{1} = 1

,

ϵ_{2} = 0.5

,

ϵ_{3} = 0.01

; the clipping parameter is

ϵ_{c l i p} = 0.2

. The learning rate is

l r = 3 \times 10^{- 4}

, and the minibatch size is 1024.

Performance Metric. We use the results obtained by solving the MILP model with an off-the-shelf solver, Gurobi, as the benchmark. With a time limit of 1800 s, Gurobi can achieve an optimal solution for 2 × 12 × 5 instances and a near-optimal one for 3 × 12 × 5 instances. Given an algorithm, we evaluate its performance by the average relative gap between the makespan obtained by the algorithm

C_{m a x}

and the benchmark

C_{m a x}^{*}

:

{Gap}^{*} = (\frac{C_{m a x}}{C_{m a x}^{*}} - 1) \times 100 % .

(25)

4.2. Effect of Proposed Mechanisms

We performed a full factorial design of experiments to investigate the impacts of several proposed mechanisms on the performance of MGPPO. The factors and levels are as follows:

Information: Single-job, Cross-job;
Node feature: Old, New;
Action space: No waiting, Allow waiting;
N_group: 2,3,4.

The factor “Information” indicates whether the information diffusion is allowed to cross different jobs (cross-job) or restricted to a single job (single-job). The factor “Node feature” indicates whether the node uses features given by [23] (old) or newly designed features (new). The factor “Action space” indicates whether operations are allowed to wait before an idle machine (allow waiting) or not (no waiting). The factor “N_group” indicates the instance size. Each experiment was run for 50 replications. In each replication, the model was first trained and then tested on 1000 instances of the specified size. The gap defined in (25) was used as the response.

The main effect plot is shown in Figure 6a. As shown, all the three modifications led to obvious improvements on the performance. Among them, introducing new node features is the most significant, followed by allowing cross-job information diffusion and allowing operation waiting. From Figure 6b, we observe that allowing operation waiting is more beneficial for small- and medium-sized instances (i.e., 2 × 12 × 5 and 3 × 12 × 5), while its impact is negligible for 4 × 12 × 5. Indeed, allowing operation waiting expands the solution space to include promising solutions; however, as the instance size increases, this advantage is offset by the search inefficiency.

Figure 7 presents boxplots of the average relative gap for MGPPO. The labels OR1–OR8 correspond to the factor combinations listed in Table 3, with each boxplot summarizing results from 1000 instances. Subfigures (a)–(c) in Figure 7 illustrate the results under the greedy strategy, while (d)–(f) show the results under the sampling strategy. OR1 serves as the baseline configuration without any modifications to the mechanisms, whereas OR8 represents the complete configuration proposed in this paper. As shown, OR8—which incorporates all proposed components—consistently outperforms the other configurations across nearly all problem sizes and both strategies.

4.3. Comparisons with Benchmark Algorithms

We compare MGPPO to four classic priority dispatching rules (PDRs) including First In First Out (FIFO), Most Operations Remaining (MOR), Shortest Processing Time (SPT), and Most Work Remaining (MWKR). These PDRs are combined with machine selection rules to address FJSP-JPC. The implementation details of these rules are provided as follows:

1.: First In First Out (FIFO): Select the candidate operation that is ready at the earliest time and assign it to the first available eligible machine.
2.: Most Operations Remaining (MOR): Select the candidate operation that has the highest number of remaining successor operations (cross-job), and assign it to a random machine that is immediately available.
3.: Shortest Processing Time (SPT): Select the $(O_{i j}, M_{k})$ combination with the shortest processing time.
4.: Most Work Remaining (MWKR): Select the candidate operation with the highest average remaining processing time of the remaining successor operations (cross-job), along with a random machine that is immediately available to process the operation.

We also adopted a state-of-the-art DRL algorithm named DANIEL [23] as a benchmark, which was proposed for FJSP. We used the same training and validation parameters as in the original paper. Additionally, we compared the proposed method with a genetic algorithm introduced in [9] with an encoding scheme tailored for FJSP-JPC. The genetic algorithm parameters were configured as follows: maximum iterations 100, population size 400, crossover rate 0.8, and mutation rate 0.1.

Table 4 presents the average makespan, Gap*, and computation time of these algorithms on 1000 different instances of specific sizes. For DRL methods, we employed two evaluation methods: greedy and sampling [22]. The former generates a schedule based on the maximum action score, while the latter uses action sampling with (22) to solve an instance 20 times and selects the best one as its result. This allows for utilizing more computational resources to improve the solution quality. Results of the DRL methods are averaged from 50 replications.

Gurobi and GA clearly deliver superior performance; however, their computational times are very long, especially for large instances. In contrast, although PDRs require shorter computational times, they exhibit poorer performance. DANIEL offers better performance than PDRs while maintaining a short runtime. The proposed algorithm outperforms both DANIEL and PDRs across all instances, achieving runtimes comparable to DANIEL. Furthermore, when employing the sampling strategy, the performance of MGPPO is significantly enhanced even with only 20 samples.

5. Discussion

The proposed DRL algorithm has proven to be a promising approach to FJSP-JPC. Its advantages stem from the tailored node features, the information diffusion mechanism, and the well-designed action space of MGPPO. The proposed method has no special requirements for the scale and structure in the instances. This means that even if changes occur in jobs and machines in the environment during the scheduling process, they will not affect the feasibility of the method. Theoretically, our method is not only applicable to FJSP-JPC, but also to the classical FJSP (where the predecessor job of each job is regarded as an empty set). In the future, experiments can be designed to validate the generality of this method.

Unlike methods such as Gurobi and GA, which generate the entire schedule at once, this approach makes incremental decisions based on the real-time status of the shop floor. As a result, even if the manufacturing process is disrupted by disturbances, the algorithm can respond flexibly. Moreover, the negligible decision time for each step makes the proposed algorithm highly applicable in dynamic environments, such as the modular flexible automotive assembly workshop, which is highly relevant to FJSP-JPC. Companies of this kind are frequently confronted with abrupt disruptions such as order changes or machine failures, and our method can effectively enhance production efficiency in such scenarios.

6. Conclusions

In this paper, we propose a deep reinforcement learning approach, named MGPPO, for solving FJSP-JPC. It employs a heterogeneous disjunctive graph to represent the shop floor status and utilizes a multi-head graph attention network to efficiently extract problem features. These features are then fed into an actor–critic framework, where the actor generates operation sequencing and machine assignment decisions simultaneously, while the critic evaluates the policy. The entire model is trained using the PPO algorithm.

Through experiments, we show that the proposed approach consistently outperforms traditional dispatching rules and a state-of-the-art DRL method. Key factors contributing to its performance include improved node features, cross-job information diffusion, and an enhanced action space.

In addition, our research for FJSP-JPC still has certain limitations, and future work can be carried out in the following aspects: first, the experiments in this paper consider no disturbance factors. In actual production environments, multi-source disturbances such as machine failures, order changes, and raw material shortages often intertwine, further increasing the solving difficulty of scheduling problems. Future work will focus on extending this approach to dynamic manufacturing environments with real-time disruptions and further enhancing its generalizability across various production scheduling problems.

Second, this paper only considers makespan as the optimization objective, without taking into account other practical production objectives such as equipment energy consumption, load balancing, and total tardiness. Future work can be expanded to multi-objective optimization by incorporating the diversified needs of intelligent manufacturing, designing weighted reward functions or Pareto front search mechanisms to balance indicators like production efficiency and sustainability.

Third, in the common scenarios of FJSP-JPC investigated in this paper (e.g., automotive assembly), companies typically face constraints on transportation resources and the collaborative scheduling of transportation and processing resources. Operations not only require a selection of processing machines but also match with available transportation equipment, as transportation delays may become production bottlenecks in actual manufacturing. Future work can develop problem models considering transportation resource constraints for further investigation.

Author Contributions

Conceptualization, Y.L. and C.Y.; methodology, Y.L. and C.Y.; software, Y.L.; validation, Y.L. and C.Y.; formal analysis, Y.L.; investigation, Y.L.; resources, C.Y.; data curation, Y.L.; writing—original draft preparation, Y.L.; writing—review and editing, Y.L. and C.Y.; visualization, Y.L.; supervision, C.Y.; project administration, C.Y.; funding acquisition, C.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China under Grant [72301196] and Tongji University “Fundamental Research Funds for the Central Universities”.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The raw data supporting the conclusions of this article and the source codes are available on 4 March 2025 at https://github.com/liyishia1/fjsp-jpc-drl.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

FJSP-JPC	Flexible job shop scheduling problem with job precedence constraints
DRL	Deep reinforcement learning
RL	Reinforcement learning
PPO	Proximal policy optimization
FJSP	Flexible job shop scheduling problem
BOM	Bill of Materials
MILP	Mixed integer linear programming
MARL	Multi-agent reinforcement learning
GNN	Graph neural network
PDRs	Priority dispatching rules
MDP	Markov decision process
GAT	Graph attention network
MLP	Multi-layer perceptron
GAE	Generalized advantage estimation
FIFO	First In First Out
MOR	Most Operations Remaining
SPT	Shortest Processing Time
MWKR	Most Work Remaining
GA	Genetic algorithm
ACO	Ant colony optimization
M2SLHO	Multi-micro-swarm leadership hierarchy-based optimization algorithm
TS	Tabu search

Appendix A. Detailed Scheduling Process

Figure A1. An example of the scheduling process.

This section provides details on the scheduling process. Figure A1 illustrates the step-by-step scheduling procedure in a heterogeneous disjunctive graph for a 4-job, 3-machine instance. The first five steps are explained as follows:

1.: Initially, the system clock is $T = 0$ , and three operations are available for scheduling: $O_{1, 1}$ , $O_{2, 1}$ , and $O_{3, 1}$ . In our approach, an operation becomes active for scheduling when all its precedence operations have either been completed or have started processing.
2.: Suppose the selected arc at this step is ${O_{2, 1}, M_{1}}$ , meaning that $O_{2, 1}$ is scheduled on machine $M_{1}$ . Since $M_{1}$ is idle at this moment, the start time is $s t = 0$ , and the completion time is calculated as $c t = 7$ , given that the processing time of $O_{2, 1}$ on $M_{1}$ is 7. The active operation set is then updated: since $O_{2, 1}$ has started at $T = 0$ , $O_{2, 2}$ is now available for scheduling. The active operation set becomes ${O_{1, 1}, O_{2, 2}, O_{3, 1}}$ .
3.: Suppose the selected arc at this step is ${O_{3, 1}, M_{3}}$ . The start and completion times are calculated as 0 and 9, respectively. The active operation set updates to ${O_{1, 1}, O_{2, 2}, O_{3, 2}}$ .
4.: Suppose the selected arc at this step is ${O_{3, 2}, M_{2}}$ . Since $O_{3, 2}$ can only start after the completion of $O_{2, 1}$ at $T = 7$ , its start time is 7, and its completion time is 11. The active operation set updates to ${O_{1, 1}, O_{3, 2}}$ .
5.: At this point, all machines connected to the active nodes are busy. The system clock is advanced to the earliest time when an operation is completed, i.e., $T = 7$ .

This process continues until all operations are completed.

References

Dauzère-Pérès, S.; Ding, J.; Shen, L.; Tamssaouet, K. The flexible job shop scheduling problem: A review. Eur. J. Oper. Res. 2024, 314, 409–432. [Google Scholar] [CrossRef]
Liu, A.; Luh, P.B.; Yan, B.; Bragin, M.A. A Novel Integer Linear Programming Formulation for Job-Shop Scheduling Problems. IEEE Robot. Autom. Lett. 2021, 6, 5937–5944. [Google Scholar] [CrossRef]
Shen, L.; Dauzère-Pérès, S.; Neufeld, J.S. Solving the flexible job shop scheduling problem with sequence-dependent setup times. Eur. J. Oper. Res. 2018, 265, 503–516. [Google Scholar] [CrossRef]
Echeverria, I.; Murua, M.; Santana, R. Leveraging constraint programming in a deep learning approach for dynamically solving the flexible job-shop scheduling problem. Expert Syst. Appl. 2025, 265, 125895. [Google Scholar] [CrossRef]
Cheng, L.; Tang, Q.; Zhang, L.; Li, Z. Inventory and total completion time minimization for assembly job-shop scheduling considering material integrity and assembly sequential constraint. J. Manuf. Syst. 2022, 65, 660–672. [Google Scholar] [CrossRef]
Hao, L.; Zou, Z.; Liang, X. Solving multi-objective energy-saving flexible job shop scheduling problem by hybrid search genetic algorithm. Comput. Ind. Eng. 2025, 200, 110829. [Google Scholar] [CrossRef]
Li, Y.; Tao, Z.; Wang, L.; Du, B.; Guo, J.; Pang, S. Digital twin-based job shop anomaly detection and dynamic scheduling. Robot. Comput.-Integr. Manuf. 2023, 79, 102443. [Google Scholar] [CrossRef]
Xu, Y.; Zhang, M.; Yang, M.; Wang, D. Hybrid quantum particle swarm optimization and variable neighborhood search for flexible job-shop scheduling problem. J. Manuf. Syst. 2024, 73, 334–348. [Google Scholar] [CrossRef]
Li, X.; Gao, L. An effective hybrid genetic algorithm and tabu search for flexible job shop scheduling problem. Int. J. Prod. Econ. 2016, 174, 93–110. [Google Scholar] [CrossRef]
Xie, J.; Li, X.; Gao, L.; Gui, L. A hybrid genetic tabu search algorithm for distributed flexible job shop scheduling problems. J. Manuf. Syst. 2023, 71, 82–94. [Google Scholar] [CrossRef]
He, N.; Sahnoun, M.; Zhang, D.; Bettayeb, B. A hybrid approach using ant colony optimisation for integrated scheduling of production and transportation tasks within flexible manufacturing systems. Comput. Oper. Res. 2025, 180, 107059. [Google Scholar] [CrossRef]
Li, X.; Guo, A.; Yin, X.; Tang, H.; Wu, R.; Zhao, Q.; Li, Y.; Wang, X. A Q-learning improved differential evolution algorithm for human-centric dynamic distributed flexible job shop scheduling problem. J. Manuf. Syst. 2025, 80, 794–823. [Google Scholar] [CrossRef]
Luo, S. Dynamic scheduling for flexible job shop with new job insertions by deep reinforcement learning. Appl. Soft Comput. 2020, 91, 106208. [Google Scholar] [CrossRef]
Li, H.; Cai, R.; Liu, N.; Lin, X.; Wang, Y. Deep reinforcement learning: Algorithm, applications, and ultra-low-power implementation. Nano Commun. Netw. 2018, 16, 81–90. [Google Scholar] [CrossRef]
Johnson, D.; Chen, G.; Lu, Y. Multi-Agent Reinforcement Learning for Real-Time Dynamic Production Scheduling in a Robot Assembly Cell. IEEE Robot. Autom. Lett. 2022, 7, 7684–7691. [Google Scholar] [CrossRef]
Luo, S.; Zhang, L.; Fan, Y. Real-Time Scheduling for Dynamic Partial-No-Wait Multiobjective Flexible Job Shop by Deep Reinforcement Learning. IEEE Trans. Autom. Sci. Eng. 2022, 19, 3020–3038. [Google Scholar] [CrossRef]
Du, Y.; Li, J.; Li, C.; Duan, P. A Reinforcement Learning Approach for Flexible Job Shop Scheduling Problem With Crane Transportation and Setup Times. IEEE Trans. Neural Netw. Learn. Syst. 2024, 35, 5695–5709. [Google Scholar] [CrossRef]
Han, B.; Yang, J.J. A Deep Reinforcement Learning Based Solution for Flexible Job Shop Scheduling Problem. Int. J. Simul. Model. 2021, 20, 375–386. [Google Scholar] [CrossRef]
Xu, S.; Li, Y.; Li, Q. A Deep Reinforcement Learning Method Based on a Transformer Model for the Flexible Job Shop Scheduling Problem. Electronics 2024, 13, 3696. [Google Scholar] [CrossRef]
Park, J.; Bakhtiyar, S.; Park, J. ScheduleNet: Learn to solve multi-agent scheduling problems with reinforcement learning. arXiv 2021, arXiv:2106.03051. [Google Scholar]
Park, J.; Chun, J.; Kim, S.H.; Kim, Y.; Park, J. Learning to schedule job-shop problems: Representation and policy learning using graph neural network and reinforcement learning. Int. J. Prod. Res. 2021, 59, 3360–3377. [Google Scholar] [CrossRef]
Song, W.; Chen, X.; Li, Q.; Cao, Z. Flexible Job-Shop Scheduling via Graph Neural Network and Deep Reinforcement Learning. IEEE Trans. Ind. Inform. 2023, 19, 1600–1610. [Google Scholar] [CrossRef]
Wang, R.; Wang, G.; Sun, J.; Deng, F.; Chen, J. Flexible Job Shop Scheduling via Dual Attention Network-Based Reinforcement Learning. IEEE Trans. Neural Netw. Learn. Syst. 2024, 35, 3091–3102. [Google Scholar] [CrossRef]
Lei, K.; Guo, P.; Zhao, W.; Wang, Y.; Qian, L.; Meng, X.; Tang, L. A multi-action deep reinforcement learning framework for flexible Job-shop scheduling problem. Expert Syst. Appl. 2022, 205, 117796. [Google Scholar] [CrossRef]
Xiong, H.; Fan, H.; Jiang, G.; Li, G. A simulation-based study of dispatching rules in a dynamic job shop scheduling problem with batch release and extended technical precedence constraints. Eur. J. Oper. Res. 2017, 257, 13–24. [Google Scholar] [CrossRef]
Zhu, Z.; Zhou, X. Flexible job-shop scheduling problem with job precedence constraints and interval grey processing time. Comput. Ind. Eng. 2020, 149, 106781. [Google Scholar] [CrossRef]
Zhang, S.; Li, X.; Zhang, B.; Wang, S. Multi-objective optimisation in flexible assembly job shop scheduling using a distributed ant colony system. Eur. J. Oper. Res. 2020, 283, 441–460. [Google Scholar] [CrossRef]
Lin, W.; Deng, Q.; Han, W.; Gong, G.; Li, K. An effective algorithm for flexible assembly job-shop scheduling with tight job constraints. Int. Trans. Oper. Res. 2022, 29, 496–525. [Google Scholar] [CrossRef]
Zhang, C.; Song, W.; Cao, Z.; Zhang, J.; Tan, P.S.; Chi, X. Learning to Dispatch for Job Shop Scheduling via Deep Reinforcement Learning. Adv. Neural Inf. Process. Syst. 2020, 33, 1621–1632. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 261–272. [Google Scholar]
Lee, J.H.; Kim, H.J. Graph-Based Imitation Learning for Real-Time Job Shop Dispatcher. IEEE Trans. Autom. Sci. Eng. 2025, 22, 8593–8606. [Google Scholar] [CrossRef]
Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal Policy Optimization Algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar]

Figure 1. The tree-like structure of FJSP-JPC.

Figure 2. Heterogeneous disjunctive graph of FJSP-JPC. The dashed line indicates possible machine choices for unscheduled operations, while the solid line represents the actual machine assignment for scheduled Operations. (a) An FJSP-JPC instance. (b) A possible solution.

Figure 3. An illustration of the attention mechanism. (a) Attention by an operation node (

O_{41}

) on its neighborhood (

O_{12}

and

O_{22}

). (b) Attention by a machine node (

M_{1}

) on its neighborhood (

M_{2}

and

M_{3}

).

Figure 3. An illustration of the attention mechanism. (a) Attention by an operation node (

O_{41}

) on its neighborhood (

O_{12}

and

O_{22}

). (b) Attention by a machine node (

M_{1}

) on its neighborhood (

M_{2}

and

M_{3}

).

Figure 4. Architecture of the multi-head GAT.

Figure 5. Architecture of the actor network and the critic network.

Figure 6. Results of the design of experiments. (a) Main effect plot. (b) Interaction plot.

Figure 7. Performance of factor combinations OR1 to OR8.

Table 1. Existing methods for the job shop scheduling problem.

Research	State	Algorithm	Objective	Problem	Special Constraint
Li and Gao [9]	-	GA, TS	Makespan	FJSP
Xie et al. [10]	-	GA, TS	Makespan	FJSP
He et al. [11]	-	ACO	Makespan	FJSP	Transportation
Johnson et al. [15]	Vector	DDQN	Makespan	FJSP
Luo et al. [16]	Vector	PPO	Total tardiness, machine utilization, machine workload	FJSP	Partial-no-wait constraints
Du et al. [17]	Vector	DDQN	makespan, total energy consumption	FJSP	Transportation, preparation time
Luo [13]	Vector	DDQN	Total tradiness	FJSP
Han and Yang [18]	Vector	REINFORCE	Makespan	FJSP
Xu et al. [19]	Vector	PPO	Makespan	FJSP
Park et al. [21]	Disjunctive graph	PPO	Makespan	JSP
Song et al. [22]	Disjunctive graph	PPO	Makespan	FJSP
Wang et al. [23]	Disjunctive graph	PPO	Makespan	FJSP
Lei et al. [24]	Disjunctive graph	PPO	Makespan	FJSP
Xiong et al. [25]	-	PDRs	Total tardiness, tardy jobs rate	JSP	Technological precedence constraints
Zhu and Zhou [26]	-	M2SLHO	Makespan	FJSP	Job precedence constraints
Zhang et al. [27]	-	ACO	Makespan, total tardiness, total workload	FJSP	Job precedence constraints
Lin et al. [28]	-	GA	Makespan	FJSP	Job precedence constraints
This paper	Disjunctive graph	PPO	Makespan	FJSP	Job precedence constraints

Table 2. Features for operation nodes, machine nodes, and undirected arcs.

ID	Description
Features of operation node $O_{i j}$
(o1)	Scheduling flag: 0 if $O_{i j}$ is unscheduled; otherwise, 1.
(o2)	Estimated completion time (see Equation (20)).
(o3)	Minimum processing time among all machines.
(o4)	Span of processing time among all machines.
(o5)	Average processing time among all machines.
(o6)	Waiting time: The time elapsed since the ready time of $O_{i j}$ until the current time T.
(o7)	Remaining processing time: From T to estimated completion time (0 if unscheduled).
(o8)	Number of remaining operations from $O_{i j}$ to End.
(o9)	Remaining work: Sum of average processing times of unscheduled operations from $O_{i j}$ to End.
(o10)	Number of machines that $O_{i j}$ can be processed on.
Features of machine node $M_{k}$
(m1)	Number of candidate operations that $M_{k}$ can process.
(m2)	Number of unscheduled operations that $M_{k}$ can process.
(m3)	Minimum processing time of operations that $M_{k}$ can process.
(m4)	Average processing time of operations that $M_{k}$ can process.
(m5)	Waiting time: The time elapsed since the machine $M_{k}$ became free until the current time T.
(m6)	Remaining processing time: The duration from the current time T until $M_{k}$ becomes free (0 if $M_{k}$ is free).
(m7)	Free time: The moment when $M_{k}$ is free.
(m8)	Working flag: 0 if $M_{k}$ is free; otherwise, 1.
Features of undirected arc $(O_{i j}, M_{k})$
(a1)	Processing time $p_{i j k}$ .
(a2)	Ratio of $p_{i j k}$ to the maximum processing time of $O_{i j}$ .
(a3)	Ratio of $p_{i j k}$ to the maximum processing time of candidate operations that $M_{k}$ can process at the current time.
(a4)	Ratio of $p_{i j k}$ to the maximum processing time of unscheduled operations.
(a5)	Ratio of $p_{i j k}$ to the maximum processing time of unscheduled operations that $M_{k}$ can process.
(a6)	Ratio of $p_{i j k}$ to the maximum processing time of all feasible $(O_{i j}, M_{k})$ .
(a7)	Ratio of $p_{i j k}$ to the remaining work of $O_{i j}$ .
(a8)	Summation of the “Waiting time” feature of $O_{i j}$ and $M_{k}$ .

Table 3. Notations for factor combinations.

Exp. ID	Information	Node Feature	Action Space
OR1	Single-job	Old	No waiting
OR2	Single-job	Old	Allow waiting
OR3	Single-job	New	No waiting
OR4	Single-job	New	Allow waiting
OR5	Cross-job	Old	No waiting
OR6	Cross-job	Old	Allow waiting
OR7	Cross-job	New	No waiting
OR8	Cross-job	New	Allow waiting

Table 4. Performance of different algorithms on instances of size (Number of job groups × Number of operations per group × Number of machines).

Size				Greedy		Sampling		PDRs
Size		Gurobi ¹	GA ²	MGPPO	DANIEL	MGPPO	DANIEL	FIFO	MOR	SPT	MWKR
2 × 12 × 5	$C_{\max}$ (s)	1193.16 (100%)	1207.84 (48.3%)	1323.23	1445.10	1269.43	1327.28	1471.01	1597.04	1943.43	1593.45
	Time (s)	0.891	51.801	0.043	0.041	0.857	0.822	0.095	0.099	0.089	0.096
	Gap*	-	1.23%	11.05%	21.59%	6.44%	11.60%	23.74%	34.61%	64.48%	34.24%
3 × 12 × 5	$C_{\max}$ (s)	1492.24 (92.5%)	1537.31 (12.5%)	1712.46	1955.90	1633.42	1773.81	1993.91	2167.73	2727.02	2166.79
	Time (s)	161.346	127.597	0.151	0.145	3.019	3.008	0.287	0.291	0.290	0.266
	Gap*	-	3.02%	14.74%	31.49%	9.40%	19.18%	33.98%	45.87%	84.04%	45.82%
4 × 12 × 5	$C_{\max}$ (s)	1816.46 (47.9%)	1872.23 (13.5%)	2109.14	2485.37	2007.18	2251.23	2562.07	2754.27	3447.57	2747.28
	Time (s)	1047.276	229.356	0.356	0.352	7.112	7.046	0.580	0.542	0.416	0.427
	Gap*	-	3.07%	15.98%	37.06%	10.36%	24.11%	41.23%	52.02%	90.57%	51.66%

¹ (.%): percentage of instances solved optimally within 1800 s. ² (.%): percentage of instances solved equally or better than Gurobi within 1800 s.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, Y.; Yu, C. Flexible Job Shop Scheduling with Job Precedence Constraints: A Deep Reinforcement Learning Approach. J. Manuf. Mater. Process. 2025, 9, 216. https://doi.org/10.3390/jmmp9070216

AMA Style

Li Y, Yu C. Flexible Job Shop Scheduling with Job Precedence Constraints: A Deep Reinforcement Learning Approach. Journal of Manufacturing and Materials Processing. 2025; 9(7):216. https://doi.org/10.3390/jmmp9070216

Chicago/Turabian Style

Li, Yishi, and Chunlong Yu. 2025. "Flexible Job Shop Scheduling with Job Precedence Constraints: A Deep Reinforcement Learning Approach" Journal of Manufacturing and Materials Processing 9, no. 7: 216. https://doi.org/10.3390/jmmp9070216

APA Style

Li, Y., & Yu, C. (2025). Flexible Job Shop Scheduling with Job Precedence Constraints: A Deep Reinforcement Learning Approach. Journal of Manufacturing and Materials Processing, 9(7), 216. https://doi.org/10.3390/jmmp9070216

Article Menu

Flexible Job Shop Scheduling with Job Precedence Constraints: A Deep Reinforcement Learning Approach

Abstract

1. Introduction

2. Problem Formulation

2.1. Problem Description

2.2. Mathematical Programming Model

2.2.1. Parameters

2.2.2. Decision Variables

2.2.3. Objective and Constraints

3. Methodology

3.1. MDP Formulation of the Scheduling Problem

3.2. State

3.2.1. Heterogeneous Disjunctive Graph

3.2.2. Multi-Head GAT

3.2.3. Attention Mechanism

3.2.4. Multi-Head Attention

3.2.5. Global Average Pooling

3.2.6. The Overall Architecture

3.3. Action

3.4. Reward

3.5. Decision-Making

3.6. Training Process

4. Numerical Results

4.1. Experimental Setup

4.2. Effect of Proposed Mechanisms

4.3. Comparisons with Benchmark Algorithms

5. Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

Appendix A. Detailed Scheduling Process

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI