1. Introduction
Dynamic flexible job shop scheduling (DFJSS) is a central problem in manufacturing scheduling, where a set of jobs with multiple operations must be assigned and sequenced on alternative machines while respecting precedence constraints, processing times, and machine eligibility [
1]. In addition to these standard constraints, dynamic events (for example, new job arrivals, machine breakdowns, urgent orders, or variable processing times) pose significant challenges and make DFJSS much more complex than the static case [
2]. Efficient scheduling in such environments requires online rescheduling strategies capable of balancing multiple objectives, including minimizing makespan, maximizing machine utilization, reducing work-in-process inventory, and ensuring robustness against disruptions [
3,
4].
Traditional optimization methods, including exact solvers and classical mixed-integer programming, can guarantee optimality but are computationally expensive and impractical for real-time or large-scale dynamic instances [
5]. Heuristic and metaheuristic methods—such as priority dispatching rules, genetic algorithms (GA), simulated annealing, and tabu search—offer faster computation and adaptability but often fail to capture the complex temporal and relational dependencies among jobs and machines in dynamic scenarios [
6]. Furthermore, many conventional methods are designed for single-objective optimization or rely on manually tuned parameters, limiting their applicability in highly dynamic or uncertain environments [
7].
In recent years, Deep Reinforcement Learning (DRL) has gained attention as a promising paradigm for DFJSS, as it enables agents to learn scheduling policies through repeated interactions with simulated or real shop-floor environments [
8]. DRL methods can adapt to dynamic disturbances without re-deriving the optimization model for each change, and they can, in principle, learn to balance multiple objectives in an online manner [
9]. However, value-based methods such as deep Q-networks (DQN) face several challenges. Frequent updates of the target network may cause instability during training (the moving-target problem), and many DRL-based schedulers still simplify real-world constraints and optimize only a single objective, limiting their robustness and practical effectiveness in large-scale DFJSS applications [
10,
11].
To mitigate these issues, hybrid frameworks combining evolutionary algorithms (e.g., GA) with DQN have been proposed. In these GA+DQN approaches, GA is often used to explore high-quality solutions that serve as training targets for the DQN agent, aiming to accelerate learning and improve solution quality. While promising, existing GA+DQN frameworks typically rely on fixed genetic parameters and unsmoothed DQN targets, which can still cause oscillations or instability during training in highly dynamic environments.
Building on the conventional GA+DQN framework, we proposed two key enhancements—an adaptive genetic algorithm and a dynamic target smoothing mechanism—to improve training stability in dynamic DFJSS scenarios. Additionally, a Graph Convolutional Network (GCN) with self-attention was used to extract rich features from job and machine states for improved state representation.
The original contributions of this work are summarized as follows:
Adaptive GA-based target generation: We introduced an adaptive GA module that dynamically adjusts crossover and mutation rates to produce high-quality DQN targets, significantly improving training stability compared to standard DQN.
Dynamic target smoothing: By applying sliding-window and exponential smoothing to GA outputs, we further stabilized the training targets, reducing variance and boosting training efficiency.
Extensive experimental validation: On multiple DFJSS instances from the Brandimarte dataset, our method outperformed traditional GA+DQN (QNGA) and pure DQN variants (DQN, DDQN, Dueling DQN), demonstrating superior makespan reduction and robustness.
The remainder of this paper is organized as follows.
Section 2 reviews related work.
Section 3 details the proposed algorithm framework.
Section 4 presents the experimental design.
Section 5 analyzes the results, and
Section 6 concludes the study and outlines future research directions.
2. Related Work
2.1. Overview of Dynamic Flexible Job Shop Scheduling (DFJSS)
Dynamic flexible job shop scheduling (DFJSS) addresses scheduling optimization in manufacturing systems where dynamic events—such as new job arrivals—may occur unpredictably during production. These disruptions require the scheduler to continuously adjust decisions in real time. In DFJSS, each job consists of multiple operations, and each operation can be processed on several candidate machines with different processing times. Common objectives include minimizing makespan, maximizing machine utilization, and balancing workloads across resources. Achieving these goals requires real-time schedule adjustments in response to environmental changes [
12]. Existing solution methods fall into three main categories:
Exact optimization algorithms (e.g., integer programming), which guarantee global optimality but suffer from prohibitive computational complexity in dynamic settings [
13].
Approximate meta-heuristic algorithms, which can find high-quality solutions within reasonable time but often fail to react swiftly to frequent dynamic events [
14].
Dispatching rules, which are valued for their simplicity and real-time decision capability. Typical Priority Dispatching Rules (PDRs)—such as Longest Processing Time (LPT) and Weighted Shortest Processing Time (WSPT)—assign priorities to operations and machines and make decisions by sorting according to these priorities. However, PDRs are usually hand-crafted and lack adaptability to complex, changing manufacturing environments [
15].
Under a dynamic scheduling framework, DFJSS can be formulated as a Markov decision process (MDP) defined by the tuple
, where
is the state space,
the action space,
P the state-transition probabilities,
R the immediate reward function, and
the discount factor. The action-value function
satisfies the Bellman optimality equation:
where
is the reward received for taking action
a in state
s, and
is the next state. This MDP formulation allows reinforcement learning algorithms to learn scheduling policies through interactions with a simulated environment [
16].
2.2. Problem Formulation of DFJSS
The DFJSS problem, considered in this study, involves a set of jobs J and a set of machines M. Each job consists of sequential operations indexed by . Each operation can be processed on one of several eligible machines , with processing time if executed on machine . Jobs arrive dynamically over time; each job j has a release time , which denotes the earliest moment it becomes available for processing. Whenever a new job arrives or a disruption occurs, the system triggers rescheduling to adapt the current schedule to the new environment.
2.2.1. Static FJSS MILP Formulation
The following MILP formulation models the static scheduling problem and defines the constraints and objective function.
For any distinct
and
:
2.2.2. Notation and Remarks
Table 1 summarizes the notation used in the MILP formulation. The proposed MILP captures the core constraints of the static FJSS problem, including machine assignment, job precedence, release times, and machine non-overlap. It serves as an offline benchmark for evaluating the optimality of the proposed dynamic scheduling policy in the dynamic environment. Accordingly, the DFJSS can be formally defined as the problem of determining, at each decision epoch
t, a dispatching action that minimizes the expected makespan
over time, subject to the constraints (3)–(10) and the stochastic job-arrival process.
2.3. Methods for Solving DFJSS
2.3.1. Dispatching Rules and Hyper-Heuristic Strategies
To overcome the limited adaptability of fixed dispatching rules, researchers have developed two broad hyper-heuristic approaches—learn then select and learn then generate—which either choose among existing rules or automatically evolve new ones.
2.3.2. Evolutionary Programming-Based Rule Generation
Genetic Programming (GP) for Rule Generation. GP evolves expression trees from basic operators and features to automatically generate dispatching rules. Nguyen et al. proposed an archive-based GP framework that evolves high-performance rules for multi-objective DFJSS and preserves elite strategies [
20]; Zhang et al. used multi-tree GP to produce interpretable composite rules, showing superior performance under dynamic fault scenarios [
21].
Hybrid GP and Reinforcement Learning. Combining GP’s global search with RL’s online feedback has gained attention. Some approaches use GP to generate candidate heuristics and then an RL agent to select and refine them online [
22]. Others feed GP individuals’ fitness back as rewards to an RL system, dynamically adjusting GP’s crossover and mutation rates based on Q-values to enhance diversity and convergence speed [
23].
2.3.3. Deep Reinforcement Learning for DFJSS
Deep Reinforcement Learning (DRL), notably DQN and its variants, has become a powerful tool for DFJSS. By interacting with the environment, a DRL agent can learn to make scheduling decisions under partial information. DQN approximates the Q-function using a neural network, enabling effective policy learning in high-dimensional state spaces. For instance, Turgut et al. trained a double DQN via discrete-event simulation to select dispatching rules automatically, reducing latency and cost [
24]. Recent studies have demonstrated that advanced DQN-based models, such as hybrid deep Q networks, can effectively handle dynamic events in DFJSS by enabling real-time schedule generation and adaptation [
25,
26].
However, the application of DQN in DFJSS faces several challenges, the most significant of which is the training instability caused by frequent updates of the target network [
27]. In the conventional DQN framework, the update frequency of the target network directly affects training stability, and updating too often or too infrequently can both lead to performance oscillations. To address this issue, some studies have proposed weighted target-network update strategies or optimized update schedules [
28], yet the fundamental reliance on a separate target network remains a drawback.
2.3.4. Graph Convolutional Networks in Scheduling
As state representations shift from vectors to graphs, Graph Convolutional Networks (GCNs) have been adopted for scheduling problems due to their ability to embed structural and feature information. In DFJSS, the shop-floor can be modeled as a bipartite or heterogeneous graph where nodes represent operations or machines, and edges encode precedence or assignment feasibility. A typical GCN layer performs:
where
and
is its degree matrix.
Ren et al. built a multi-type knowledge graph incorporating operations, machines, and constraints, and combined GCN embeddings with DQN for scheduling. Gui et al. proposed subgraph compression to speed up inference under frequent dynamic events [
29]. Liu et al. integrated multi-head self-attention with heterogeneous GCNs to fuse operation and machine features [
30], and Jing et al. designed a multi-agent RL framework where agents correspond to graph nodes [
31]. These methods improve representational power but retain traditional target-network limitations and incur high online computation when stacking GCN layers with attention. Our work inherits subgraph GCN embedding advantages while innovatively using an adaptive GA to generate high-quality fixed targets and a dynamic smoothing mechanism to stabilize training.
2.3.5. Genetic Algorithms in DFJSS
Genetic algorithms (GAs) simulate biological evolution to solve optimization problems and are widely used for DFJSS. By evolving a population of candidate schedules, GAs can handle complex constraints and multiple objectives. GAs’ global search capability helps avoid local optima. Examples include Wei et al.’s hybrid GA–simulated annealing algorithm for minimizing makespan [
32], and Huang et al. proposed an effective hybrid genetic algorithm and particle swarm optimization approach for solving multi-objective flexible job shop scheduling problems, demonstrating significant improvements in solution quality and convergence speed under dynamic conditions [
33].
A key challenge is efficient encoding of schedules. Classic GA encodings represent entire job sequences and machine assignments in chromosomes; Li et al. used a two-part encoding for sequence and machine choice [
34], while Ning et al. proposed a three-segment encoding including time vectors [
35]. These global encodings work well for static FJSS but are inefficient for DFJSS, where only partial schedules at decision points need updating. Over-descriptive encodings inflate search space and reduce efficiency.
2.4. Summary and Research Motivation
In summary, DFJSS presents complex, highly dynamic challenges. Dispatching rules are simple but lack adaptability. DRL methods can learn policies but often suffer from target-network instability. GCNs improve state representation, yet they do not resolve the issue of volatile training targets. Motivated to address these gaps, this study proposes an adaptive GA-based target generation mechanism to improve DQN training stability, aiming to deliver a robust, adaptive scheduling method for dynamic environments.
3. Methodology
3.1. Overall Framework
This work presents a novel DFJSS approach that integrates an adaptive genetic algorithm, dynamic target smoothing, and a deep Q-network. The overall framework, shown in
Figure 1, consists of five cooperative modules that jointly mitigate unstable target generation, improve training efficiency, and enhance feature extraction under dynamic scheduling conditions:
Problem Modeling and State Representation. The DFJSS problem is first modeled as a Markov decision process, with a defined state space, action space, and reward function. The shop-floor state at each decision point is represented by operation nodes, machine nodes, and their dependencies. A graph convolutional network embeds both static information (e.g., processing times, machine availability) and dynamic information (e.g., operation sequence, machine idle times) into a rich feature vector for subsequent scheduling decisions.
Deep Q-Network (DQN) Module. The DQN module learns an optimal scheduling policy by approximating the action-value function with a neural network. At each decision step, the DQN selects the highest-value action based on the current state embedding. A separate target network is maintained to stabilize temporal-difference updates and ensure smooth convergence of the Q-value estimates.
Adaptive Genetic Algorithm Module. GA is employed to generate high-quality training targets for the DQN, mitigating instability from conventional target network updates. Each individual in the population encodes a candidate schedule, and fitness is evaluated via a function that balances immediate reward and remaining workload. Unlike fixed-parameter GAs, our method adaptively adjusts the crossover probability and mutation probability during evolution to enhance search diversity and global exploration.
Dynamic Target Smoothing Module. To further improve target stability, the raw GA outputs undergo a second smoothing stage. Two techniques are used: sliding-window smoothing, which computes the mean of the most recent W GA targets to suppress short-term oscillations; and exponential smoothing, which recursively blends the current raw target with the previous smoothed value using a smoothing coefficient . The resulting smoothed target is fed into DQN training to reduce the negative effects of target volatility.
Training and Optimization Process. The DQN and GA operate in a closed-loop framework. At initialization, the DQN evaluation and target networks as well as the GA population and parameters are set. In each iteration, the DQN selects and executes an action, observes the reward, and transitions to the next state. Concurrently, the GA evolves its population and applies smoothing to generate a stable training target. This target, together with the observed reward, forms the DQN’s training objective for updating network parameters. The target network is synchronized with the evaluation network at fixed intervals to maintain stability. This multi-module training loop continuously refines the scheduling policy.
3.2. State Representation
In DFJSS, the production environment exhibits strong spatial–temporal coupling, where each job consists of multiple operations, and each operation can be processed by several alternative machines. To capture these complex structural dependencies, a graph convolutional network is employed to encode the dynamic shop-floor state, as illustrated in
Figure 2.
Each operation is modeled as a node, and the connections between operations form a directed acyclic graph (DAG). Two types of edges are defined:
Technological edges—connecting , representing process sequence constraints of the same job.
Machine edges—connecting operations that compete for the same machine , representing resource conflicts.
Let
denote the scheduling graph at time step
t, where
is the set of all operations and
the set of edges. For each node
, the initial feature vector is defined as:
where
is the ready time,
the due date,
the number of remaining operations of its job,
the processing time, and
the machine identifier (one-hot encoded).
The GCN aggregates neighborhood information according to:
where
is the neighbor set of node
v,
is the normalization coefficient, and
is the ReLU activation.
denotes the trainable weight matrix at layer
l,
is the bias vector, and
represents the normalized adjacency matrix with self-loops used during message propagation. These definitions ensure that Equation (
13) is consistent with the standard formulation of graph convolutional networks.
After L layers of propagation, the node embedding is obtained.
The overall graph embedding is computed by a readout function:
which provides a compact state representation fed into the policy network.
This encoding allows the agent to perceive both job precedence and machine resource constraints, ensuring a consistent decision space across dynamic job arrivals and machine breakdowns.
3.3. Reward Structure
In the proposed MyQNGA framework, the reward function directly reflects the scheduling objective of minimizing the makespan. At each decision step
t, after selecting a dispatching action, the immediate reward is defined as the following:
where
denotes the change in predicted makespan caused by the chosen action. This dense, incremental reward provides more informative feedback than sparse terminal rewards and accelerates convergence under dynamic DFJSS conditions.
To capture local shop-floor dynamics, two optional penalty terms can be added:
where
indicates induced idleness on critical machines, and
measures the increase in local waiting time for affected operations. The small coefficients
and
ensure that global makespan minimization remains the primary learning objective while improving local responsiveness.
This reward design balances global efficiency () with short-term responsiveness (, ), thereby improving the stability and adaptability of the DQN when trained with GA-evolved targets.
3.4. Adaptive GA-Based Target Generation
To address instability in DQN training and enhance adaptability in dynamic environments, an
adaptive genetic algorithm is designed to evolve auxiliary target Q-values, as illustrated in
Figure 3. The GA refines candidate action-value vectors through evolutionary operators, providing stable targets that mitigate learning oscillations under highly dynamic DFJSS conditions.
Each individual in the GA population represents a candidate Q-value vector:
where each
corresponds to the estimated Q-value of a dispatching action.
3.4.1. Fitness Function
Since the experiments focus solely on minimizing the makespan, the GA employs a single-objective fitness function directly aligned with the scheduling goal:
where
denotes the total completion time (makespan) achieved by the schedule represented by
. A higher fitness value indicates a more efficient schedule with a smaller makespan, ensuring consistency between the GA optimization criterion and the reinforcement learning objective.
3.4.2. Decision Horizon
In the proposed Adaptive GA, the fitness function evaluates candidate schedules over all currently known operations. That is, the decision horizon is global with respect to the available job set, capturing the total makespan of each individual. In dynamic DFJSS scenarios, new jobs may arrive during execution; these will be considered in subsequent iterations. This global evaluation ensures that the GA targets provided for DQN training reflect the overall efficiency of the current system.
3.4.3. Adaptive Evolution
Each individual adopts a real-valued encoding, where each gene corresponds to a Q-value component. The GA employs a fitness-proportionate selection strategy, while crossover and mutation rates are adaptively adjusted according to population diversity:
where
denotes the normalized population diversity at generation
t, and
is the standard deviation of fitness values in the population. High population diversity shifts the search toward crossover-dominated exploitation, whereas reduced diversity increases mutation to prevent premature convergence.
3.4.4. Integration with DQN
The evolved Q-value target
is incorporated into the DQN update rule as follows:
This adaptive target evolution stabilizes the temporal-difference updates of the DQN, effectively reducing Q-value fluctuations and accelerating convergence under dynamic production conditions.
3.5. Dynamic Target Smoothing
Although the GA module provides high-quality target values, stochastic variations in the scheduling environment and population evolution can still introduce fluctuations. To mitigate this, we apply a second smoothing stage using two common techniques:
Whichever smoothing technique is chosen, the resulting smoothed target is used as the final training target for the DQN. This dynamic smoothing mechanism effectively reduces the target volatility observed in QNGA, enhancing both training stability and convergence speed.
3.6. Learning and Update Procedure
During training, the DQN, GA, and target-smoothing modules interact in a closed loop to iteratively improve the scheduling policy. The detailed steps are:
State Sampling: Observe the current shop-floor state, construct the corresponding graph, and extract its feature embedding s via the GCN. The state includes all currently available jobs and machine statuses.
Action Selection: Choose an action a (e.g., assign a specific job to a machine) using the -greedy policy on the evaluation network , where is annealed from an initial value to a final value over the training episodes.
Environment Interaction: Execute a, then receive the immediate reward r (e.g., change in makespan) and the next state from the environment.
Target Generation and Smoothing: Run the adaptive GA on the current population for G generations to obtain a raw target T. The GA is invoked at every decision step. Each individual encodes the Q-values of all currently available dispatching actions. The raw target T is then smoothed using either a sliding-window average over the last W steps or exponential smoothing with factor , producing the final target .
Experience Storage: Store the transition in the replay buffer with capacity , and periodically sample mini-batches of size B for training.
Network Update: For each sampled experience, construct the training target:
and update the evaluation network by minimizing the mean squared error loss
Perform a gradient descent step on using learning rate , and every C steps synchronize the target network parameters: .
Iteration: Set and repeat the above steps until the training termination criterion is met, which can be a maximum number of episodes or convergence of the makespan improvement.
In this joint learning process, each DQN update is guided by the GA-optimized target rather than solely the DQN’s own target network estimate. This integration of global search information from the GA accelerates policy learning and improves the quality of the learned scheduling strategy.
3.7. Algorithm Pseudocode and Complexity Analysis
To facilitate understanding, the overall training procedure of MyQNGA is presented in Algorithm 1, followed by a complexity analysis.
Complexity Analysis
The computational complexity of the proposed MyQNGA algorithm primarily stems from two components: the genetic algorithm and the deep Q-network. In each training iteration, the dominant computational overhead of the GA arises from evolving the population. Given a population size of
and
G generations per iteration, the GA complexity is primarily determined by the number of solution evaluations and the simulation time of each candidate schedule. Formally, the per-iteration computational complexity can be approximated as the following:
where
is the population size,
G the number of generations,
the average time to simulate one scheduling solution, and
the number of neural network parameters involved in forward and backward propagation.
| Algorithm 1: Training MyQNGA for DFJSS |
![Applsci 15 12626 i001 Applsci 15 12626 i001]() |
For the DQN component, each update involves a forward and backward propagation through the neural network. The computational complexity of this process is proportional to and remains relatively stable per iteration. Compared with the GA, the additional overhead introduced by experience replay and periodic target-network updates is relatively minor.
In summary, MyQNGA incurs a moderate computational overhead compared with traditional DQN and QNGA approaches, owing to the additional evolutionary search and target-smoothing operations. However, this cost is offset by notable gains in training stability and convergence efficiency. The integration of global exploration through GA and dynamic target smoothing effectively mitigates the instability and slow convergence issues of conventional DQN-based schedulers, thereby enhancing both the adaptability and robustness of the proposed scheduling strategy under dynamic shop-floor conditions.
To provide a clearer view of the practical computational cost,
Table 2 reports the average training time, per-episode duration, and final schedule generation time of MyQNGA compared with classical DQN variants and QNGA, evaluated on Brandimarte instances using an NVIDIA RTX 4060 Ti GPU.
As shown in the table, MyQNGA achieves shorter training time and faster episode execution than QNGA while maintaining competitive scheduling time. These results indicate that the proposed improvements bring practical efficiency gains in addition to the theoretical complexity advantages.
4. Experimental Design and Simulation Model
This section presents the experimental design, benchmark settings, and simulation framework used to evaluate the proposed MyQNGA algorithm in the context of the DFJSS problem. The objective is to assess both the scheduling performance and adaptability of MyQNGA under stochastic job arrivals. Specifically,
Section 4.1 introduces the dynamic benchmark setup and experimental factors;
Section 4.2 describes the discrete-event simulation model that emulates real-time shop-floor dynamics; and
Section 4.3 details the algorithmic parameter configuration and its justification.
4.1. Experimental Design
The experiments address a DFJSS environment where each job consists of a sequence of operations, and each operation can be processed on multiple candidate machines with varying processing times. The production environment is modeled as a discrete-event simulation in which job arrivals, processing events, and machine states evolve over time.
To ensure a fair and controlled evaluation, the following experimental design principles are applied:
Known initial workload: At time zero, all initial jobs are known, allowing reproducible baseline scheduling.
Single dynamic factor: New jobs arrive according to a Poisson process with rate jobs/time unit, corresponding to approximately a 30% workload increase. This setting reflects a moderate dynamic intensity commonly adopted in DFJSS literature, balancing between nearly static and highly disturbed environments. Additional tests with confirmed similar performance trends, validating the robustness of the proposed method.
No machine breakdowns: Machine reliability is assumed to isolate the effect of job arrivals and evaluate adaptability of the algorithm to this specific disturbance type.
This design isolates dynamic arrivals as the only stochastic factor, enabling a focused analysis of rescheduling robustness and convergence behavior.
To demonstrate generality, the experiments employ the widely used Brandimarte benchmark set (mk01–mk10). The Brandimarte benchmark set includes instances of varying scales, where the number of jobs ranges from 10 to 20, the number of machines from 6 to 15, and the total number of operations from 55 to 240. These datasets are widely used in FJSS and DFJSS studies due to their progressive difficulty and structural diversity.
All dynamic simulations are repeated 30 times under different random seeds, and the same seed sets are used across all algorithms to ensure fair comparison. For each configuration, the mean and standard deviation of the makespan are reported to characterize both central tendency and performance variability.
Five representative algorithms are included for comparison:
QNGA: Baseline hybrid of genetic algorithm and deep Q-network;
DQN: Standard deep Q-learning approach;
DDQN: Double deep Q-network with decoupled target estimation;
Dueling DQN: Architecture separating state-value and advantage estimation;
MyQNGA: Proposed adaptive GA + DQN framework with dynamic target smoothing (DTS).
The primary performance metric is the makespan (), which measures global scheduling efficiency and directly reflects the optimization objective used in both the GA fitness and the RL reward function. Although auxiliary statistics such as machine utilization and job waiting time were collected for diagnostic purposes, they are not included in the optimization objective or comparative analysis. Additionally, convergence rate and computational time are analyzed to assess learning stability and scalability.
All experiments are implemented in Python 3.9 (PyCharm IDE 2021.3) and executed on a workstation equipped with an Intel® Core™ i5-14600KF CPU, Intel Corporation, Shenzhen, China, 32 GB RAM, and an NVIDIA Corporation, Shenzhen, China GeForce RTX 4060 Ti GPU (8 GB VRAM). The GPU is primarily used to accelerate neural network training during DQN-based methods.
4.2. Simulation Model
A discrete-event simulation framework is developed to model the dynamic evolution of a flexible job shop.The model continuously tracks the states of jobs, machines, and operation queues. Each time an operation completes or a new job arrives, the system state is updated, triggering the scheduling module to reassign operations. This process mirrors the online decision-making mechanism in real production systems.
4.2.1. Input Data
The Brandimarte dataset provides detailed job, operation, and machine data, including feasible machine assignments and processing times. Each instance is converted into a directed graph structure using the NetworkX library, where nodes represent operations or machines, and edges denote feasible processing routes. This graph-based encoding allows seamless integration with the reinforcement learning state representation.
4.2.2. Output Metrics
The final schedule’s makespan is recorded for each run. Additional indicators such as convergence speed (number of training iterations to reach stable reward), average machine utilization, and computational overhead are also collected to provide a comprehensive performance assessment.
4.2.3. Dynamic Environment Simulation
During execution, stochastic job arrivals are injected following the Poisson process. When a new job arrives, the MyQNGA agent immediately performs rescheduling, leveraging the adaptive GA to explore feasible solutions and the DQN policy to refine local decision-making. This framework effectively tests ability of the algorithm to adapt and recover under dynamic perturbations while maintaining computational tractability.
4.3. Algorithm Parameter Settings
The parameter configuration of MyQNGA combines principles from evolutionary computation and reinforcement learning. The adopted parameters, summarized in
Table 3, are determined through preliminary grid search experiments and are consistent with prior GA–RL scheduling studies focusing on makespan minimization.
The genetic component ensures population diversity through adaptive crossover and mutation, while the reinforcement learning component accelerates convergence by exploiting learned state–action relationships. These complementary mechanisms enable MyQNGA to efficiently explore the solution space and adaptively refine schedules in response to real-time disturbances.
Overall, this experimental design provides a controlled yet realistic platform for evaluating the adaptability, efficiency, and convergence stability of MyQNGA in minimizing makespan under dynamic job arrivals. By combining statistically reliable repetitions and structurally diverse benchmarks, the experiments ensure that the observed improvements are both statistically meaningful and practically generalizable.
5. Experimental Results and Analysis
5.1. Orthogonal Experiment Analysis of Parameters
To quantitatively evaluate the impact of key genetic algorithm parameters on MyQNGA performance under dynamic scheduling scenarios with new job arrivals, we conducted an L
9(3
4) orthogonal experiment. The experiment measured average makespan on the MK04 instance with Poisson-distributed job arrivals (
).
Table 4 defines the four control factors and their tested levels. Each parameter combination was executed 30 times, with makespan results (mean ± standard deviation) reported in
Table 5.
Table 6 presents the response analysis. For each factor, level averages
and range
are calculated to quantify influence.
Key observations:
The optimal configuration is Run 5: , , , , yielding makespan 74.0 ± 2.2 (vs. CPLEX 69).
Mutation probability () exhibits the largest range (), indicating it remains the most sensitive parameter in dynamic scheduling.
Population size (N) shows moderate influence (), while crossover probability () has reduced impact () under new job arrivals.
Number of generations (G) demonstrates minimal impact (), suggesting extended evolution provides diminishing returns.
5.2. Model Validation (CPLEX)
To verify the correctness of the proposed MILP formulation in
Section 2.2, we solved both the static and dynamic DFJSS models using IBM ILOG CPLEX (CPU) on the
mk04 and
mk05 instances. The static model optimizes makespan under fixed conditions (no dynamic arrivals), whereas the dynamic model introduces job-release events to simulate real-time job arrivals.
5.2.1. Validation Setup
The MILP model described in
Section 2.2 was implemented and solved with the following CPLEX configuration:
Solver: IBM ILOG CPLEX 20.1 (CPU).
Time limit: 3600 s (1 h) per instance.
MIP gap tolerance: 0.5%.
Presolve: Aggressive presolving enabled to reduce redundant constraints.
Strategy: Benders decomposition was adopted to improve computational efficiency for large-scale MILPs.
5.2.2. Numerical Results
As shown in
Table 7, introducing new job arrivals increases the makespan from 60 to 69 for instance mk04 (a 15.0% increase) and from 172 to 198 for mk05 (a 15.1% increase), highlighting the scheduling overhead caused by dynamic events. In static cases, CPLEX reached proven optimality, whereas in dynamic cases, it achieved a 0.5% MIP gap within the 3600 s limit, demonstrating both correctness and computational difficulty of the DFJSS model.
5.3. Strategy Effectiveness Comparison
The proposed adaptive strategy (AS) and dynamic target smoothing were evaluated against the QNGA baseline under new job arrival scenarios on the
mk04 and
mk05 benchmark instances. We clarified that the QNGA baseline directly corresponded to the ETDQN framework proposed by Liu et al. (2025) [
36], which integrated evolutionary training with DQN. Each strategy executed 30 independent runs to ensure statistical reliability, with performance metrics detailed in
Table 8.
Figure 4 further illustrates the distribution characteristics of the obtained solutions.
Experimental results show that both proposed strategies consistently outperform the QNGA baseline. Specifically, the adaptive strategy achieved a makespan of on mk04, representing a improvement over the baseline and only above the CPLEX optimum. This indicates that AS can reduce production time and improve throughput on dynamic shop floors, leading to higher machine utilization and more timely order completion.
On the more complex mk05 instance, AS achieved , corresponding to a 1.9% improvement over the baseline and 4.2% above the CPLEX optimum. DTS also showed notable improvements, with makespans of (mk04) and (mk05). Even modest gains from DTS help maintain operational stability under fluctuating job arrivals and reduce the risk of delays and bottlenecks. This performance hierarchy was consistent across instances, with AS delivering the most favorable outcomes.
However, these improvements came with moderate overhead. AS required 215 s (mk04) and 235 s (mk05) of GPU time (7.5% and 6.8% above the baseline). DTS required 210 s and 230 s (5.0% and 4.5% above baseline). This extra computational cost is justified by better solution quality and stability: AS produces a tighter makespan distribution on mk05 (standard deviation vs. ).
Overall, these results confirm the operational viability of both enhancement strategies in dynamic scheduling environments. AS demonstrates particular strength in balancing solution quality and stability, consistently maintaining near-optimal performance while effectively responding to stochastic job arrivals.
5.4. Algorithm Comparison and Analysis
To comprehensively assess MyQNGA’s performance, we compared it against four established algorithms—QNGA, DQN, DDQN, and Dueling-DQN—on Brandimarte_Data instances under dynamic job-arrival scenarios. Each algorithm was executed in 30 independent runs, and paired-t tests confirmed statistical significance ().
The experimental results, summarized in
Table 9, show that MyQNGA consistently outperforms all comparative algorithms in terms of makespan. This consistent advantage suggests that MyQNGA can reliably enhance scheduling efficiency and robustness in practical, high-load manufacturing environments.
On mk04, MyQNGA achieved a makespan of 74.2, which is only 7.5% higher than the CPLEX-verified optimum (69) and represents a 2.0% improvement over the QNGA baseline (75.7). Although the numerical reduction seems modest, it translates into a measurable decrease in job completion times, which can cumulatively improve overall throughput and reduce the average waiting time in multi-machine scheduling. The convergence trends of MyQNGA and QNGA on mk04 are shown in
Figure 5, where MyQNGA demonstrates faster convergence and lower makespan values.
Similar trends were observed on mk05. MyQNGA reduced the makespan by 2.1% compared to QNGA (205.8 vs. 210.3) and narrowed the gap with the adaptive strategy (206.4) discussed in
Section 4.2, indicating enhanced schedule stability and lower risk of delay propagation.
Figure 6 presents the convergence behavior on mk05, confirming that MyQNGA consistently achieves better performance and more stable learning compared to QNGA.
An important observation is that MyQNGA’s performance advantage over DRL-based algorithms (DQN, DDQN, and Dueling-DQN) increases with problem scale. For example, on mk05, MyQNGA outperforms DDQN by 4.9%. This relative improvement grows in larger instances such as mk08 and mk10, reaching 6.4% and 7.4% over DQN, respectively. These results highlight MyQNGA’s strong scalability and resilience in dynamic, high-load scheduling environments.
Paired-t test results confirm that these improvements are statistically significant in all cases, with p-values well below 0.001. This validates the reliability and robustness of MyQNGA’s performance advantages and indicates that the observed improvements are not attributable to random variation.
Moreover, beyond consistently outperforming the baseline QNGA, MyQNGA demonstrates comparable performance to adaptive strategies specifically tailored for dynamic job arrivals, without requiring instance-specific adjustments. These results highlight the generalization capability of MyQNGA, demonstrating that it can effectively handle dynamic job arrivals across different problem instances without requiring instance-specific parameter adjustments.
In summary, the results substantiate that integrating adaptive genetic mechanisms with deep reinforcement learning policies enables MyQNGA to deliver stable, high-quality schedules under unpredictable conditions, while maintaining statistical reliability and superior scalability compared to both traditional and reinforcement learning-based algorithms.
6. Conclusions
This study proposed MyQNGA, a hybrid framework integrating adaptive genetic algorithms with deep reinforcement learning to address the DFJSS problem under stochastic job arrivals. The framework incorporates two key mechanisms: the adaptive genetic component, which balances exploration and exploitation to maintain population diversity, and the DTS mechanism, which stabilizes Q-value updates by integrating evolutionary information.
Comprehensive experiments on the Brandimarte benchmark set (mk01–mk10) show that MyQNGA consistently outperforms classical and reinforcement learning-based methods. For instance, on mk04, MyQNGA reduces the makespan from 75.7 (QNGA baseline) to 74.2 (−2.0%), which is only 7.5% above the CPLEX-verified dynamic optimum (69). On mk05, it achieves 205.8 compared to 210.3 for QNGA (−2.1%), narrowing the gap with the adaptive strategy (206.4). Across all instances, MyQNGA provides 3–8% improvement over baselines while maintaining stable convergence in fewer training iterations. The GPU computation overhead remains moderate (∼215–235 s). Paired-t tests confirm that these improvements are statistically significant (), validating the robustness and scalability of the method.
Nonetheless, the current evaluation has some limitations. Experiments were restricted to the Brandimarte benchmark instances with specific dynamic job-arrival distributions (Poisson, ), and computational costs increased slightly due to adaptive mechanisms. The performance under alternative arrival distributions, larger-scale systems, or more complex real-time constraints remains to be investigated.
Future work will focus on extending MyQNGA to large-scale and distributed manufacturing systems, integrating real-time shop-floor monitoring for online adaptive scheduling, exploring multi-objective extensions (e.g., makespan and energy efficiency), and further automating parameter tuning to enhance robustness and generalization in diverse dynamic environments.
In summary, MyQNGA effectively combines adaptive evolutionary search with dynamic target smoothing. It delivers high-quality, stable, and scalable schedules under uncertainty while maintaining statistical reliability.
Author Contributions
Conceptualization, Z.Z. and H.C.; methodology, Z.Z.; software, Z.Z.; validation, Z.Z. and Z.W.; formal analysis, Z.Z.; investigation, Z.Z.; resources, J.H.; data curation, Z.Z. and Z.W.; writing—original draft preparation, Z.Z.; writing—review and editing, H.C. and J.H.; visualization, Z.Z.; supervision, J.H.; project administration, J.H. All authors have read and agreed to the published version of the manuscript.
Funding
This research received no external funding.
Institutional Review Board Statement
Not applicable.
Informed Consent Statement
Not applicable.
Data Availability Statement
The data presented in this study are available on request from the corresponding author.
Conflicts of Interest
The authors declare no conflict of interest.
References
- Chaudhry, I.A.; Khan, A.A. A research survey: Review of flexible job shop scheduling techniques. Int. Trans. Oper. Res. 2016, 23, 551–591. [Google Scholar] [CrossRef]
- Ouelhadj, D.; Petrovic, S. A survey of dynamic scheduling in manufacturing systems. J. Sched. 2009, 12, 417–431. [Google Scholar] [CrossRef]
- Zhang, L.; Feng, Y.; Xiao, Q. Deep reinforcement learning for dynamic flexible job shop scheduling problem considering variable processing times. J. Manuf. Syst. 2023, 71, 257–273. [Google Scholar] [CrossRef]
- Wu, Z.; Fan, H.; Sun, Y.; Peng, M. Efficient multi-objective optimization on dynamic flexible job shop scheduling using deep reinforcement learning approach. Processes 2023, 11, 2018. [Google Scholar] [CrossRef]
- Ngwu, C.; Liu, Y.; Wu, R. Reinforcement learning in dynamic job shop scheduling: A comprehensive review of AI-driven approaches in modern manufacturing. J. Intell. Manuf. 2025, 36, 1–25. [Google Scholar] [CrossRef]
- Momenikorbekandi, A.; Kalganova, T. Intelligent scheduling methods for optimisation of job shop scheduling problems in the manufacturing sector: A systematic review. Electronics 2025, 14, 1663. [Google Scholar] [CrossRef]
- Sangaiah, A.K.; Suraki, M.Y.; Sadeghilalimi, M.; Bozorgi, S.M.; Hosseinabadi, A.A.R.; Wang, J. A new meta-heuristic algorithm for solving the flexible dynamic job-shop problem with parallel machines. Symmetry 2019, 11, 165. [Google Scholar] [CrossRef]
- Hu, H.; Jia, X.; He, Q.; Fu, S.; Liu, K. Deep reinforcement learning based agvs real-time scheduling with mixed rule for flexible shop floor in industry 4.0. Comput. Ind. Eng. 2020, 149, 106749. [Google Scholar] [CrossRef]
- Zhang, L.; Yang, C.; Yan, Y.; Hu, Y. Distributed real-time scheduling in cloud manufacturing by deep reinforcement learning. IEEE Trans. Ind. Inform. 2022, 18, 8999–9007. [Google Scholar] [CrossRef]
- Lu, S.; Wang, Y.; Kong, M.; Wang, W.; Tan, W.; Song, Y. A double deep q-network framework for a flexible job shop scheduling problem with dynamic job arrivals and urgent job insertions. Eng. Appl. Artif. Intell. 2024, 133, 108487. [Google Scholar] [CrossRef]
- Panzer, M.; Bender, B. Deep reinforcement learning in production systems: A systematic literature review. Int. J. Prod. Res. 2021, 60, 4316–4341. [Google Scholar] [CrossRef]
- Liu, R.; Piplani, R.; Toro, C. Deep reinforcement learning for dynamic scheduling of a flexible job shop. Int. J. Prod. Res. 2022, 60, 4049–4069. [Google Scholar] [CrossRef]
- Meng, L.; Zhang, C.; Shao, X.; Caile, R. Mixed-integer linear programming and constraint programming formulations for solving distributed flexible job shop scheduling problem. Comput. Ind. Eng. 2020, 142, 106347. [Google Scholar] [CrossRef]
- Xu, Y.; Zhang, M.; Yang, M.; Wang, D. Hybrid quantum particle swarm optimization and variable neighborhood search for flexible job-shop scheduling problem. J. Manuf. Syst. 2024, 73, 334–348. [Google Scholar] [CrossRef]
- Zhang, C.; Song, W.; Cao, Z.; Zhang, J.; Tan, P.S.; Xu, C. Learning to dispatch for job shop scheduling via deep reinforcement learning. In Proceedings of the Advances in Neural Information Processing Systems 33 (NeurIPS 2020), Virtual, 6–12 December 2020. [Google Scholar]
- Zeng, Y.; Liao, Z.; Dai, Y.; Wang, R.; Li, X.; Yuan, B. Hybrid intelligence for dynamic job-shop scheduling with deep reinforcement learning and attention mechanism. arXiv 2022, arXiv:2201.00548. [Google Scholar]
- Shiue, Y.-H.; Huang, K.-Y.; Chang, C.-H. Reinforcement learning for adaptive scheduling in dynamic job shops. Int. J. Prod. Res. 2020, 58, 4365–4381. [Google Scholar]
- Gabel, T.; Riedmiller, M. Adaptive scheduling for flexible job-shop production using reinforcement learning. CIRP Ann. 2019, 68, 433–436. [Google Scholar]
- Zhang, S.; Pan, Q.; Fatima, S. Deep reinforcement learning with double q-learning and dueling network for dynamic job shop scheduling. Appl. Soft Comput. 2021, 106, 107317. [Google Scholar] [CrossRef]
- Nguyen, S.; Zhang, M.; Johnston, M. A new genetic programming approach to evolving dispatching rules for dynamic multi-objective job shop scheduling. Eur. J. Oper. Res. 2020, 280, 955–972. [Google Scholar]
- Zhang, C.; Zhang, M.; Nguyen, S. Interpretable dispatching rules for dynamic job shop scheduling via multi-tree genetic programming. Comput. Ind. Eng. 2021, 154, 107126. [Google Scholar]
- Wang, X.; Pan, Q.-K.; Fatima, S. Hybrid genetic programming and reinforcement learning approach for dynamic job shop scheduling. Expert Syst. Appl. 2020, 162, 113788. [Google Scholar]
- Chen, Z.; Pan, Q.-K.; Fatima, S. An adaptive hybrid genetic programming and deep reinforcement learning algorithm for dynamic job shop scheduling. Appl. Soft Comput. 2022, 113, 108033. [Google Scholar]
- Turgut, C.E.; Bozdag, E. Deep q-network model for dynamic job shop scheduling problem based on discrete event simulation. In Proceedings of the 2020 Winter Simulation Conference (WSC), Orlando, FL, USA, 14–18 December 2020; pp. 1551–1562. [Google Scholar]
- Sun, Y.; Zhang, H.; Wang, Y.; Liu, Z. Real-time data-driven dynamic scheduling for flexible job shop with insufficient transportation resources using hybrid deep q network. Robot. Comput. Integr. Manuf. 2021, 70, 102283. [Google Scholar]
- Yang, D.; Shu, X.; Yu, Z.; Lu, G.; Ji, S.; Wang, J.; He, K. Dynamic flexible job shop scheduling based on deep reinforcement learning. Proc. Inst. Mech. Eng. Part B J. Eng. Manuf. 2024, 239. [Google Scholar] [CrossRef]
- Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef]
- Kobayashi, T.; Ilboudo, W.E.L. t-soft update of target network for deep reinforcement learning. arXiv 2020, arXiv:2008.10861. [Google Scholar]
- Qin, Z.J.; Lu, Y.Q. Knowledge graph-enhanced multi-agent reinforcement learning for adaptive scheduling in smart manufacturing. J. Intell. Manuf. 2025, 36, 5943–5966. [Google Scholar]
- Liu, Y.; Wang, W.; Hu, Y.; Hao, J.; Chen, X.; Gao, Y. Multi-agent game abstraction via graph attention neural network. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 7211–7218. [Google Scholar]
- Jing, X.; Yao, X.; Liu, M.; Zhou, J. Multi-agent reinforcement learning based on graph convolutional network for flexible job shop scheduling. J. Intell. Manuf. 2024, 35, 75–93. [Google Scholar] [CrossRef]
- Wei, H.; Li, S.; Jiang, H.; Hu, J.; Hu, J. Hybrid genetic simulated annealing algorithm for improved flow shop scheduling with makespan criterion. Appl. Sci. 2018, 8, 2621. [Google Scholar] [CrossRef]
- Huang, X.; Guan, Z.; Yang, L. An effective hybrid algorithm for multi-objective flexible job-shop scheduling problem. Adv. Mech. Eng. 2018, 10, 1–14. [Google Scholar] [CrossRef]
- Li, K.; Gao, L.; Shao, X. A hybrid genetic algorithm with dual-segment encoding for flexible job shop scheduling problem. J. Intell. Manuf. 2020, 31, 1405–1422. [Google Scholar]
- Ning, X.; Zhang, L.; Wang, H. A three-segment encoding based genetic algorithm for dynamic flexible job-shop scheduling problem with transportation constraints. Comput. Ind. Eng. 2021, 155, 107180. [Google Scholar]
- Liu, Y.; Zhang, F.; Sun, Y.; Zhang, M. Evolutionary Trainer-Based Deep Q-Network for Dynamic Flexible Job-Shop Scheduling. IEEE Trans. Evol. Comput. 2025, 29, 749–763. [Google Scholar] [CrossRef]
| Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).