1. Introduction
With the rapid advancement of automation, digitalization, and intelligent technologies, modern manufacturing systems are increasingly shifting toward distributed and energy-efficient production modes. In this context, energy-aware distributed manufacturing has attracted significant attention from both academia and industry [
1,
2]. It enables geographically dispersed factories to collaboratively share production resources, reduce idle time, and improve overall energy utilization efficiency [
3]. This paradigm enhances operational flexibility and responsiveness while supporting global objectives for sustainable and low-carbon manufacturing.
The distributed flexible job shop scheduling problem (DFJSP) has become an important model for representing such complex production environments. It extends the classical flexible job shop scheduling problem (FJSP) by including the factory assignment decision, in which each job is assigned to one factory and then processed by multiple machines within that factory. This problem structure captures the hierarchical and multi-resource characteristics of distributed production systems that are widely observed in industries such as aerospace manufacturing, precision machining, and electronics assembly [
4,
5]. Compared with centralized scheduling, the distributed variant significantly enlarges the decision space and increases computational complexity, making efficient optimization approaches essential.
Existing research on the DFJSP has primarily focused on several aspects, including minimizing makespan or total tardiness, improving load balance among factories, and optimizing machine utilization efficiency [
6,
7]. These studies have significantly advanced the theoretical modeling and algorithmic design of distributed scheduling systems. However, energy efficiency has not been adequately incorporated into most DFJSP formulations. In modern industrial environments, energy consumption has become a critical performance indicator due to the growing emphasis on sustainable production and environmental responsibility. Consequently, energy-aware scheduling has emerged as a key research direction that seeks to achieve a balance between operational efficiency and energy consumption [
8].
In real-world distributed manufacturing, jobs usually differ in their importance and urgency depending on customer demand, contractual requirements, and delivery deadlines. Ignoring these differences may lead to unbalanced resource allocation and inefficient scheduling decisions. Moreover, job priority affects both machine utilization and energy consumption, as high-priority jobs often require schedule adjustments that change the system load distribution [
9]. Therefore, integrating job priority into energy-aware distributed scheduling is vital to improve fairness and efficiency in real manufacturing systems. However, only a few studies have simultaneously considered energy consumption and job priority within a distributed flexible job shop framework. This limitation reveals a significant gap between theoretical research and industrial application.
To address this gap, this study develops the energy-aware distributed flexible job shop scheduling problem with job priority (EA-DFJSP-JP). The proposed model aims to minimize total weighted tardiness and total energy consumption simultaneously. It integrates the decisions of factory assignment, machine sequencing, and job prioritization under energy constraints to better reflect real production systems. This model provides a balanced approach to achieving both production efficiency and sustainable energy management.
The EA-DFJSP-JP is a complex combinatorial optimization problem that poses substantial computational challenges. Traditional exact methods, such as mixed-integer programming and branch-and-bound algorithms, are often unsuitable for large-scale instances because of exponential computational complexity [
10,
11]. Consequently, heuristic, metaheuristic, and hybrid intelligent algorithms have been widely used to obtain near-optimal solutions within acceptable computational time. Recently, learning-based optimization methods have emerged as powerful alternatives for solving large-scale scheduling problems. In particular, deep reinforcement learning (DRL) has demonstrated strong potential by learning adaptive decision policies through continuous interaction with the environment [
12]. However, conventional DRL approaches often experience slow convergence, unstable learning, and limited interpretability when applied to discrete scheduling problems, primarily because they do not effectively utilize domain knowledge.
To overcome these challenges, this paper presents a knowledge-guided deep reinforcement learning approach for solving the EA-DFJSP-JP. Among various DRL algorithms, the Double Deep Q-Network (D2QN) is specifically selected due to its superior sample efficiency and stability in handling discrete scheduling decision spaces compared to on-policy methods like PPO or A2C. In this framework, domain knowledge such as job precedence relationships, energy consumption patterns, and machine utilization features is embedded into a D2QN to guide the adaptive selection of local search operators. In addition, a co-evolutionary learning mechanism is incorporated to balance global exploration and local exploitation, thereby improving convergence speed and maintaining population diversity. The proposed algorithm, referred to as the double deep Q-network-based co-evolutionary algorithm (D2QN-COEA), integrates domain knowledge with deep reinforcement learning to achieve intelligent and energy-efficient optimization for distributed scheduling problems.
The main contributions of this study are as follows.
- (1)
A new EA-DFJSP-JP is formulated, incorporating both energy efficiency and job priority considerations to represent the practical characteristics of distributed manufacturing environments. The model simultaneously minimizes total weighted tardiness and total energy consumption, providing a more balanced and sustainable production scheduling framework.
- (2)
A D2QN-COEA algorithm is proposed, which embeds domain-specific knowledge into a Double Deep Q-Network and integrates a co-evolutionary learning mechanism to enhance learning stability and optimization efficiency.
- (3)
Extensive computational experiments on benchmark datasets and a real-world industrial case validate the effectiveness of the proposed method in achieving high-quality and energy-efficient scheduling solutions.
The remainder of this paper is organized as follows.
Section 2 reviews related work on distributed job shop scheduling and optimization methods.
Section 3 formulates the EA-DFJSP-JP model.
Section 4 presents the proposed D2QN-COEA.
Section 5 reports the computational experiments and results.
Section 6 concludes the study and discusses future research directions.
3. Mathematical Model
3.1. Definitions and Assumptions
The EA-DFJSP-JP is defined as follows. A set of jobs is to be processed in a distributed manufacturing system consisting of factories . Each factory includes a set of machines . Each job comprises operations, denoted by . The operations of each job must follow a strict precedence order, where an operation can start only after the completion of its predecessor. Each operation can be processed on one of several eligible machines, and the processing times differ across machines, representing the flexibility characteristic of the problem. The scheduling decision process involves three hierarchical levels. At the first level, each job must be assigned to exactly one factory, and all its operations must be executed within that factory, capturing the distributed feature of the system. At the second level, each operation is assigned to an appropriate machine within the selected factory. At the third level, the processing sequence of all operations on each machine is determined. Each job is associated with fixed attributes including a priority weight , a discrete priority level Li, and a due date . It is assumed that these priority parameters are static and determined by customer importance prior to scheduling. A higher priority weight indicates greater significance in the objective function, and delayed completion of such jobs results in larger penalties. Energy consumption is considered in two states: processing and idle. Machines consume processing power during operation and idle power during non-productive periods between operations. Appropriate assignment and sequencing decisions can effectively reduce idle time, thereby lowering total energy consumption.
To clearly articulate the decision hierarchy without expanding the graphical content, we explicitly define the workflow as follows. The system operates on three levels. First, the factory assignment level receives customer orders (with priorities wi) and allocates them to specific factories. Second, within the assigned factory, the machine selection level determines the processing resource. Finally, the operation sequencing level arranges the execution order. This hierarchical structure ensures that the objective of minimizing weighted tardiness is directly addressed by the sequencing decisions guided by our priority-based operators.
The EA-DFJSP-JP is formulated as a bi-objective optimization problem. The first objective minimizes the total weighted tardiness, while the second minimizes total energy consumption, which includes both processing and idle energy. These two objectives are inherently conflicting: increasing production speed often raises energy consumption, whereas reducing energy usage may extend job completion times. The goal is to identify Pareto-optimal schedules that balance production efficiency and energy performance under given technological and resource constraints.
The problem formulation is based on the following assumptions:
- (1)
Each machine can process at most one operation at any given time.
- (2)
Once an operation begins, it cannot be interrupted or transferred to another machine.
- (3)
The operations of each job must strictly follow the predefined technological order and cannot be processed in parallel.
- (4)
Once a job is assigned to a factory, all its operations must be completed within that factory without cross-factory processing.
- (5)
Transportation time between factories is negligible or included in the processing time.
3.2. Objectives and Constraints
This bi-objective optimization problem is formulated to minimize two conflicting objectives, where the first objective is to minimize the total weighted tardiness, as defined in Equation (1).
The second objective is to minimize the total energy consumption, as defined in Equation (2).
The total energy consumption consists of two main components: processing energy and idle energy. The processing energy is computed as the total energy consumed by all machines during operation execution, as defined in Equation (3). The idle energy represents the energy consumed during the idle periods between consecutive operations and is obtained by multiplying the idle duration by the corresponding idle power, as defined in Equation (4). The mechanism for energy saving is explicitly embedded in the minimization of idle energy (
), as shown in Equation (4). Since processing energy (
) is a fixed constant for a given set of operations, the optimization algorithm focuses exclusively on compressing the idle time gaps between consecutive operations. By adjusting the operation sequence, the algorithm effectively ‘squeezes’ these gaps, thereby reducing the total duration the machines spend in the standby state.
subjective toConstraints (5) and (6) ensure proper job-factory allocation. Each job is assigned to exactly one factory and all its operations are executed within the same facility. Constraint (7) guarantees that every operation is uniquely assigned to a specific factory-machine-position combination, whereas Constraint (8) enforces the sequential utilization of positions on each machine, preventing idle gaps between consecutive operations. Constraint (9) preserves the technological precedence among operations of the same job. Each subsequent operation can start only after its predecessor has been completed. Constraint (10) restricts each machine position to process no more than one operation at a time, while Constraints (11) and (12) eliminate temporal overlaps among operations assigned to the same machine. Constraints (13) and (14) link the start times of operations with the corresponding machine positions, ensuring temporal consistency across all assignments. Constraint (15) defines the completion time of each operation as the sum of its start time and processing duration, and Constraint (16) specifies that the job completion time equals that of its final operation. Constraints (17) and (18) define job tardiness as the nonnegative difference between the completion time and the due date. Constraints (19) and (20) guarantee non-negativity for all time-related variables, and Constraint (21) imposes binary restrictions on all assignment and sequencing decision variables.
4. The Proposed D2QN-COEA
4.1. Encoding and Decoding
To represent feasible schedules for the EA-DFJSP-JP, the proposed D2QN-COEA employs a three-layer encoding scheme that captures the hierarchical decision structure of distributed scheduling. The three layers correspond to the Operation Sequence (OS), Factory Assignment (FA), and Machine Selection (MS). This representation separates sequencing, factory allocation, and machine selection decisions, thereby enhancing search flexibility and maintaining feasibility during optimization. The operation sequence (OS) is expressed as an integer vector of length , where each element denotes a job identifier. Job appears times, corresponding to its operations. The order of elements determines the global processing sequence of all operations across factories. The factory assignment (FA) is an integer vector of length , in which the -th element indicates the factory to which job is assigned. All operations of a job must be executed within the same factory, ensuring compliance with the distributed manufacturing structure. The machine selection (MS) vector, whose length equals that of the OS, specifies the processing machine for each operation in the given sequence. Each selected machine must belong to the eligible machine set associated with the job and the assigned factory.
To demonstrate the encoding mechanism, consider a simplified example with three jobs and two factories, each containing three machines. Job 1 consists of three operations, while Jobs 2 and 3 include two operations each, yielding seven operations in total. The job parameters and partial processing times are shown in
Table 2 and
Table 3.
A feasible encoding can be expressed as
This representation implies that Jobs 1 and 3 are assigned to Factory 1, whereas Job 2 is assigned to Factory 2. The processing sequence follows the OS vector, producing the following operation order: Job 2–O1 (F2–M1), Job 1–O1 (F1–M1), Job 3–O1 (F1–M2), Job 1–O2 (F1–M3), Job 2–O2 (F2–M1), Job 3–O2 (F1–M2), and Job 1–O3 (F1–M2). The decoding process translates this encoding into a feasible schedule. Operations are processed sequentially according to the OS. For the
-th position, the algorithm determines the corresponding operation based on the job ID and its occurrence count, retrieves the factory from FA, and assigns the machine from the MS vector. The start time of each operation equals the maximum of the completion time of its predecessor and the availability time of the assigned machine, while the completion time equals the start time plus the processing duration. The final decoded scheduling results are summarized in
Table 4, which lists all operation assignments, their start and completion times, and associated machine utilization.
From
Table 4, the completion times of Jobs 1, 2, and 3 are 16, 7, and 8, respectively, all meeting their due dates with a total weighted tardiness of zero. Assuming that all machines consume 10 kW during processing and 2 kW when idle, the total processing energy consumption is
During the idle interval of Machine 2 in Factory 1 (from time 8 to 11), an additional
is consumed, yielding a total energy consumption of
.
4.2. Algorithm Framework
The proposed D2QN-COEA integrates the global exploration capability of co-evolutionary algorithms with the adaptive decision-making mechanism of deep reinforcement learning to iteratively approximate the Pareto-optimal front for the EA-DFJSP-JP. Conceptually, the algorithm employs population-based evolutionary operators to explore the solution space, while a D2QN adaptively selects local search operators to enhance exploitation in promising regions. Furthermore, an energy-aware adjustment mechanism is incorporated to reduce total energy consumption, and an elite archive is maintained to preserve the set of non-dominated solutions throughout the optimization process.
Step 1. Population initialization. The algorithm begins by generating an initial population of size pop_size using a hybrid initialization strategy that combines random generation, priority-based heuristics, and due-date-oriented heuristics. Each individual is decoded to evaluate the objective functions, including total weighted tardiness and total energy consumption. All non-dominated solutions are stored in the elite archive (Archive), the D2QN network is initialized, and the number of function evaluations (NFEs) is set to the initial population size.
Step 2. Global exploration. Global exploration is conducted through evolutionary operations to enhance population diversity. Parent individuals are selected via a tournament mechanism, followed by crossover and mutation to generate offspring. Each offspring is decoded and evaluated to determine its objective values. This process is repeated until a complete offspring population (Offspring) of size pop_size is obtained, after which NFEs is updated accordingly.
Step 3. Environmental selection. The parent and offspring populations are merged, and fast non-dominated sorting is applied to classify individuals into Pareto dominance levels. Individuals are then selected sequentially from the highest-ranking fronts until pop_size individuals are retained, forming the next generation of the main population (P). This procedure ensures a balance between convergence pressure and population diversity.
Step 4. D2QN-guided local search. Once the experience replay buffer of the D2QN contains sufficient samples (i.e., ≥ batch_size), local search is executed on each solution in the elite archive. For every solution sol, a state vector is extracted to represent its scheduling characteristics. The D2QN determines the most appropriate local search operator based on the current state and applies it to generate a modified solution sol′. A reward is computed according to the improvement in solution quality, and the transition (state, action, reward, state′) is stored in the replay buffer for network training. The D2QN parameters are updated using the stored experiences, and if the obtained reward is positive, sol is replaced by sol′ in the archive.
Step 5. Archive maintenance and energy adjustment. The elite archive is updated by merging it with the current population and reapplying the non-dominated sorting procedure. Only the first Pareto front is retained to maintain elite convergence quality and diversity. An energy-adjustment operator is then applied to each archived solution to minimize idle power consumption without deteriorating scheduling performance.
The energy-adjustment strategy exploits the inherent scheduling flexibility of non-critical jobs to reduce idle energy consumption, as illustrated in
Figure 1. Since the processing energy
remains constant for a given set of operations, the optimization explicitly focuses on minimizing idle energy
through operation repositioning, without affecting the makespan. The strategy first identifies critical jobs, defined as those whose completion times reach or fall within 5% of the makespan. These jobs constitute the critical path and therefore cannot be delayed without deteriorating scheduling performance. In contrast, non-critical jobs possess sufficient slack time, allowing their start times to be postponed while preserving the original makespan. As shown in
Figure 1a, the initial schedule contains fragmented idle gaps distributed across machines, particularly in the middle of the production timeline. Such dispersed idle periods lead to inefficient energy usage, as machines remain in standby mode between consecutive operations. The proposed energy-adjustment operator mitigates this inefficiency by right-shifting non-critical jobs to eliminate intermediate idle gaps. Specifically, in
Figure 1b, the non-critical job J2 (highlighted in green) is postponed from its original start time at
to
, thereby consolidating the idle time on Machine M2 from a mid-timeline gap into a single contiguous block at the beginning of the schedule. These repositioning yields two key benefits. First, idle time is concentrated into continuous periods, enabling practical energy-saving actions such as delayed machine startup or temporary shutdown. Second, the reduction in frequent start–stop cycles lowers transition-related energy overhead. Although the total duration of idle time remains unchanged, this structural reorganization facilitates tangible energy savings in real manufacturing environments. Concentrated idle periods at the beginning of a machine’s timeline allow for delayed activation, eliminating unnecessary warm-up energy, while consolidated idle blocks at the end permit earlier shutdown. Moreover, reducing fragmented idle gaps mitigates excessive power cycling, contributing to both improved energy efficiency and enhanced equipment longevity. Importantly, the original makespan is preserved (Makespan = 12 in
Figure 1), ensuring that delivery deadlines are not compromised while achieving a more energy-efficient scheduling configuration.
Step 6. Termination condition. The iterative process continues until the number of function evaluations reaches the predefined limit (max_nfes). At termination, the elite archive (Archive) is output as the final approximation of the Pareto-optimal solution set.
The complete computational procedure is summarized below, with the algorithmic workflow and pseudo-code illustrated in
Figure 2 and Algorithm 1, respectively.
| Algorithm 1. Overall framework of D2QN-COEA |
| Input: |
| data: problem instance (jobs, factories, machines, etc.) |
| pop_size: population size |
| archive_size: elite archive size |
| max_nfes: maximum number of function evaluations |
| Output: Pareto-optimal solution set Archive |
| 1: Archive ← UpdateArchive(∅, P) |
| 2: DQN ← InitializeDoubleDQN(state_size, action_size) |
| 3: while NFEs < max_nfes do |
| 4: Offspring ← ∅ |
| 5: for i = 1 to pop_size do // Generate offspring population |
| 6: parent1 ← TournamentSelection(P) |
| 7: parent2 ← TournamentSelection(P) |
| 8: child ← Crossover(parent1, parent2) |
| 9: child ← Mutation(child) |
| 10: Offspring ← Offspring ∪ {child} |
| 11: P ← EnvironmentalSelection(P ∪ Offspring, pop_size) |
| 12: if |DQN.memory| ≥ batch_size then // Execute local search if sufficient samples exist |
| 13: state ← GetStateVector(sol) // Extract state vector |
| 14: action ← DQN.SelectAction(state) // Select local search operator via D2QN |
| 15: sol’ ← ApplyLocalSearch(sol, action) // Apply selected local search operator |
| 16: reward ← CalculateReward(sol, sol’) // Compute reward based on improvement |
| 17: state’ ← GetStateVector(sol’) // Extract new state vector |
| 18: DQN.StoreExperience(state, action, reward, state’) |
| 19: DQN.Train() |
| 20: if reward > 0 then |
| 21: sol ← sol’ |
| 22: for sol in Archive do |
| 23: sol ← ApplyEnergySaving(sol) |
| 24: Archive ← UpdateArchive(Archive, P) |
4.3. Global Search Strategy
The global search strategy is designed within a multi-objective evolutionary optimization framework, employing selection, crossover, and mutation operators to explore the solution space and generate a diverse set of candidate solutions. The fundamental purpose of this strategy is to maintain a balance between convergence quality and population diversity, thereby preventing premature convergence to local optima. The execution procedure of the global search phase is described as follows.
Step 1. Parent selection. A tournament selection mechanism based on Pareto dominance is employed to select parent individuals from the current population. A subset of individuals is randomly sampled to form a competition pool, and non-dominated solutions within this pool are identified. If several non-dominated individuals exist, one is randomly selected; otherwise, a random solution is chosen from the pool. This selection process preserves selection pressure while maintaining population diversity, preventing the algorithm from focusing excessively on a single objective.
Step 2. Crossover operation. Two selected parents undergo crossover to produce an offspring solution. Given the three-layer encoding structure of the EA-DFJSP-JP, the crossover is conducted independently on the operation sequence (OS), factory assignment (FA), and machine selection (MS) layers. The offspring inherits the OS from one parent to preserve the relative processing order of operations. For factory assignment, a uniform crossover operator is applied, where each job inherits its factory assignment from either parent with equal probability. Since modifications in factory assignment may lead to infeasible machine selections, the MS vector of the offspring is regenerated according to the updated OS and FA. Each operation attempts to retain the parent’s machine selection; if this is infeasible, a feasible machine is randomly selected from the available set within the assigned factory.
Step 3. Mutation operation. Mutation is applied to the offspring to introduce stochastic variations and enhance population diversity. The three layers of the encoding are modified independently. In the OS layer, two randomly selected positions are swapped to alter the processing order of operations. In the FA layer, a randomly chosen job is reassigned to another factory, and the MS vector is regenerated to maintain feasibility. In the MS layer, a randomly selected operation is reassigned to another available machine within the same factory. The FA and MS mutations are executed in a mutually exclusive manner to avoid redundant modifications.
Step 4. Offspring evaluation. Each offspring is decoded into a feasible schedule, and its objective values, including total weighted tardiness and total energy consumption, are calculated. The overall procedure of the global search strategy is summarized in Algorithm 2.
| Algorithm 2. Global search strategy |
| Input: |
| P: current population |
| pop_size: population size |
| Output: Offspring: offspring population |
| 1: Offspring ← ∅ // Initialize the offspring population |
| 2: for i = 1 to pop_size do // Generate offspring individuals |
| 3: competitors ← RandomSample(P, k) // Randomly select k competitors |
| 4: parent1 ← SelectNonDominated(competitors) // Select the first parent based on Pareto dominance |
| 5: competitors ← RandomSample(P, k) // Randomly select another set of competitors |
| 6: parent2 ← SelectNonDominated(competitors) // Select the second parent |
| 7: child ← CreateEmptySolution() // Create a new offspring solution |
| 8: child.OS ← parent1.OS // Inherit the operation sequence from parent 1 |
| 9: for j = 1 to n_jobs do // Perform uniform crossover for factory assignment |
| 10: if Random() < 0.5 then |
| 11: child.FA[j] ← parent1.FA[j] |
| 12: else |
| 13: child.FA[j] ← parent2.FA[j] |
| 14: child.MS ← RegenerateMachineSelection(child) / Regenerate machine selection based on updated FA |
| 15: if Random() < p_m then // Apply operation sequence mutation |
| 16: i, j ← RandomSampleTwo(1, |child.OS|) |
| 17: Swap(child.OS[i], child.OS[j]) // Swap two operations |
| 18: if Random() < p_m then // Reassign factory |
| 19: idx ← RandomInt(1, n_jobs) |
| 20: child.FA[idx] ← RandomInt(1, n_factories) // Ensure feasibility by regenerating machine selection |
| 21: child.MS ← RegenerateMachineSelection(child) |
| 22: else if Random() < p_m then |
| 23: idx ← RandomInt(1, |child.MS|) |
| 24: job_id, op_idx ← GetOperationInfo(child, idx) |
| 25: factory_id ← child.FA[job_id] |
| 26: available ← GetAvailableMachines(job_id, op_idx, factory_id) |
| 27: child.MS[idx] ← RandomChoice(available) // Randomly select a feasible machine |
| 28: Decode(child) // Decode and evaluate offspring |
| 29: Offspring ← Offspring ∪ {child} // Add offspring to population |
| 30: return Offspring |
4.4. Local Search Operator Design
The local search operators are specifically designed according to the structural characteristics of the EA-DFJSP-JP, aiming to enhance solution quality by refining the processing order of critical jobs. Crucially, apart from the weighted objective function, the job priority level (Li) is explicitly used to guide the structural transformation of the schedule. The priority-based operators (LS3 and LS4) enforce reordering mechanisms that structurally advance high-priority jobs in the operation sequence and adjust their machine assignments accordingly. Four distinct local search operators are developed, and the D2QN network adaptively selects one for execution based on the current solution state. Each operator focuses on rescheduling the most critical job with the largest tardiness to minimize total weighted tardiness while preserving the feasibility of the schedule.
LS1: Due-date-based swap operator
This operator identifies the job with the maximum tardiness and locates the position of its first operation in the operation sequence. It then searches for another job that satisfies either or , where denotes the due date and represents the job priority level. The two operations are then swapped to advance the processing of the critical job. This mechanism helps reduce tardiness by prioritizing jobs with earlier due dates or higher importance levels.
LS2: Due-date-based insertion operator
Similar to LS1, this operator identifies the tardiest job and removes its first operation from the current sequence. The operation is then inserted before another job that meets the condition or . Compared with the swap operator, the insertion operation causes a smaller perturbation to the scheduling structure while still improving the completion timeliness of critical jobs. It is particularly suitable for fine-tuning solutions that are already close to local optima.
LS3: Priority-based swap operator
This operator focuses on job priorities rather than due dates. It identifies the tardiest job and searches for another job satisfying or . The selected jobs exchange their positions in the operation sequence. This operator promotes fairness among jobs by allowing low-priority tardy jobs to yield scheduling positions to those with higher priority, effectively reducing total weighted tardiness without compromising balance among objectives.
LS4: Priority-based insertion operator
This operator also targets the tardiest job but adopts an insertion mechanism similar to LS2. The first operation of is moved and inserted before another job that satisfies or . The priority-based insertion operator combines the advantages of priority orientation and minimal structural disturbance, enabling efficient local improvements in both objectives.
All four operators are designed to refine the schedule around the job with the highest tardiness, directly targeting the reduction in total weighted tardiness. The D2QN network adaptively selects among these operators according to the current state representation of the solution, achieving an effective integration of problem-specific knowledge and reinforcement learning-based decision-making.
4.5. Double Deep Q-Network Architecture
To clarify the technical realization of the knowledge-guided strategy, we explicitly define the source, integration, and contribution of domain knowledge as follows:
Source of knowledge: The domain knowledge is derived from classical scheduling rules, specifically involving job due dates and priority levels, which are critical indicators of job urgency and importance.
Form of integration: Knowledge is integrated into the learning framework through action space design rather than the reward function. Unlike traditional DRL approaches that use atomic actions (e.g., assigning a job to a machine), the action space in our D2QN consists of four knowledge-encapsulated local search operators (LS1–LS4) defined in
Section 4.4. The agent learns to select the most appropriate heuristic rule for the current state. Additionally, domain knowledge is used in population initialization to provide a high-quality starting point for the evolutionary process.
Contribution to learning: This knowledge-guided design significantly reduces the search space dimensionality and avoids the “cold start” problem common in pure reinforcement learning. By learning to manage high-level heuristics instead of low-level movements, the agent achieves faster convergence and improved solution feasibility. It is noted that the reward function remains purely objective-driven (based on TWT and TEC improvement) to ensure the agent optimizes the true performance metrics without bias.
In the D2QN-COEA framework, the D2QN serves as the decision-making module that adaptively selects the most effective local search operator according to the current state of the solution. The D2QN is trained under a reinforcement learning framework, continuously interacting with the optimization environment to learn which operator yields the greatest improvement under varying scheduling conditions.
Figure 3 illustrates the overall methodological framework of the proposed D2QN-COEA, including problem modeling, state–action–reward design, DQN training and inference, scheduling decision generation, and solution evaluation.
State: The state space is designed to characterize the essential features of the current scheduling solution and to provide sufficient information for the decision-making process of the D2QN. Each state vector
consists of two components, representing positional and factory-related information, with a total dimension of
:
. Here,
denotes the position feature vector, and
represents the factory assignment feature vector. Specifically, the
-th element of
indicates the normalized position of job
’s first operation in the operation sequence, while the
-th element of
corresponds to the normalized factory index assigned to job
. The state vector as defined in Equation (21).
where
represents the first occurrence position of job
in the sequence,
denotes the factory assigned to job
, and
is the total number of factories. The positional features capture the relative scheduling priority of jobs, while the factory features reflect the distribution of workloads across factories.
Action: The action space
contains four discrete actions corresponding to the four local search operators (LS1–LS4) introduced in
Section 4.4. Each output neuron of the D2QN represents the Q-value associated with one operator. During training, the network learns to predict the expected cumulative reward for each action given the current state, enabling adaptive operator selection.
Reward: The reward function is designed to guide the learning process toward actions that yield the greatest improvement in scheduling performance. The total reward comprises three components: . Where represents the improvement in total weighted tardiness, represents the improvement in total energy consumption, and provides an additional reward when both objectives are simultaneously improved.
The individual components are defined as Equations (22)–(24).
Here, and are weighting coefficients, is the bonus reward, and is a small constant to prevent division by zero.
Network Architecture: The D2QN employs a fully connected feedforward neural network structure consisting of one input layer, three hidden layers, and one output layer. The input layer receives the -dimensional state vector. The hidden layers contain 256, 128, and 64 neurons, respectively, with ReLU activation functions applied to introduce nonlinearity. The output layer contains four neurons, each corresponding to one Q-value in the action space, and uses a linear activation function. The network parameters are optimized using the Adam optimizer, and the mean squared error (MSE) is adopted as the loss function to minimize the difference between the predicted and target Q-values.
Double DQN mechanism: To mitigate the overestimation bias commonly observed in traditional DQN models, the proposed algorithm incorporates a Double DQN structure consisting of two networks: the evaluation network
and the target network
. During training,
is used for action selection, while
is used to estimate target Q-values. Given a transition tuple
, the target Q-value is computed as Equation (25).
where
is the discount factor. The evaluation network is updated by minimizing the mean squared error between the predicted and target Q-values, while the target network parameters are periodically synchronized with those of the evaluation network to ensure stable learning and convergence.
Experience replay mechanism: To eliminate the temporal correlation between consecutive samples, an experience replay mechanism is adopted. A replay memory with a fixed capacity stores past transition tuples . During each training iteration, a mini-batch of samples is randomly drawn from the memory to update the network. This mechanism improves data utilization efficiency and enhances the stability of the learning process.
ε-Greedy exploration strategy: To balance exploration and exploitation, the D2QN adopts an ε-greedy policy during action selection. At each decision step, a random action is selected with probability to encourage exploration, while the action with the highest Q-value is chosen with probability to exploit the learned policy. As training progresses, the value of gradually decays, ensuring extensive exploration in the early stages and stable exploitation in later stages.
5. Experiment and Result Analysis
This section evaluates the effectiveness and superiority of the proposed D2QN-COEA algorithm through three groups of experiments. First, a set of orthogonal experiments is conducted to determine the optimal parameter configuration and analyze the sensitivity of key parameters to algorithm performance. Second, several comparative experiments are performed against classical multi-objective evolutionary algorithms, including NSGA-II [
41], MOPSO [
42], MOEA/D [
43], and SPEA2, to validate the performance advantage of the proposed method in solving the EA-DFJSP-JP problem. Finally, an ablation study is carried out to assess the contribution of each major component, highlighting the respective roles of the Double DQN mechanism, the co-evolutionary strategy, and the energy-aware optimization scheme.
All experiments are implemented in Python 3.8 and executed on a 64-bit Windows 11 operating system equipped with an Intel(R) Core(TM) Ultra 7 155H processor (3.80 GHz) and 16 GB of RAM. To ensure statistical robustness, each experiment is independently repeated thirty times, and the average results are reported. The performance of all algorithms is comprehensively evaluated using four widely adopted multi-objective quality indicators, including hypervolume (HV), inverted generational distance (IGD), generational distance (GD), and spacing (SP), which collectively measure convergence accuracy, diversity preservation, and distribution uniformity of the obtained Pareto front.
5.1. Experimental Design
To comprehensively evaluate the performance of the proposed D2QN-COEA algorithm in solving the EA-DFJSP-JP, twenty-four benchmark instances are designed to cover small-, medium-, and large-scale scenarios. The instance configurations are derived from both classical benchmark settings in the literature and empirical data collected from real-world manufacturing enterprises, thereby ensuring both theoretical representativeness and practical relevance. Each instance is denoted using the format “F–M–N,” where F represents the number of factories, M the number of machines in each factory, and N the total number of jobs. For example, the instance “2–5–10” corresponds to a scheduling problem with two factories, each equipped with five machines, processing ten jobs in total. The parameter settings for instance generation are summarized as follows. (1) The number of factories reflects typical multi-factory collaborative manufacturing systems. (2) The number of machines per factory is determined according to standard shop-floor configurations. (3) The number of jobs covers various production scales ranging from small-batch to large-batch manufacturing. (4) The number of operations per job simulates the complexity of real-world process routes. The processing time of each operation is generated as (in hours), based on statistical analysis of historical production data collected from a cooperative manufacturing enterprise. (5) The priority weight of each job represents the relative importance of customer orders. (6) The due date of job is set as , where is calibrated to match the tight delivery constraints observed in real-world order fulfillment and denotes the lower bound of the job’s completion time, ensuring that due dates remain feasible yet challenging. (7) Machine power parameters are specified as kW for processing power and kW for idle power consumption, based on standard CNC machine specifications.
5.2. Sensitivity Analysis
The performance of D2QN-COEA in solving the EA-DFJSP-JP problem is significantly influenced by its parameter configuration. To identify the optimal parameter combination and analyze the effects of key parameters on algorithmic performance, an orthogonal experimental design was adopted. This approach enables efficient exploration of multiple factors simultaneously while minimizing the number of required experiments. Four key parameters were selected for analysis: population size, archive size, learning rate of the DQN, and exploration rate (ε). Each parameter was set at several discrete levels: population size ∈ {10, 20, 30, 40, 50}, archive size ∈ {25, 30, 40, 50, 65, 80}, learning rate ∈ {0.0001, 0.0005, 0.001, 0.005, 0.01}, and exploration rate ∈ {0.0, 0.05, 0.1, 0.2, 0.3}. The orthogonal table was constructed to ensure balanced representation of parameter interactions, and each experimental configuration was independently executed five times on the 2–5–20 instance. The same evaluation metrics as introduced earlier were used to assess convergence, diversity, and uniformity of the Pareto front. The average results are summarized in
Table 5, and the overall performance trends are illustrated in
Figure 4. In
Figure 4, the star symbols denote the optimal parameter values for each metric, while the shaded areas represent the standard deviation of the results across five independent runs, reflecting the stability of the algorithm under different parameter settings.
According to the analysis of the orthogonal results, the optimal parameter configuration was determined as population size = 20, archive size = 80, learning rate = 0.01, and exploration rate = 0.3. The reasoning for each parameter selection is summarized as follows.
- (1)
Population size. A population size of 20 yielded the highest overall performance (HV = 0.6859, IGD = 0.3303, GD = 0.2595, SP = 0.0022). Although a population of 10 achieved a comparable HV value, its convergence stability was weaker. When the size exceeded 20, all metrics deteriorated significantly, with HV decreasing to 0.4306 at size 50. Hence, 20 was chosen as the most efficient and stable configuration.
- (2)
Archive size. An archive size of 80 achieved the best performance (HV = 0.6561, IGD = 0.3503, GD = 0.3266). Although SP slightly increased (0.0135), the improvement in convergence and diversity outweighed the marginal loss in distribution uniformity. Increasing archive capacity effectively preserved elite solutions and enhanced population diversity.
- (3)
Learning rate. A learning rate of 0.01 provided the highest HV (0.6422) and the lowest GD (0.3269). While IGD was slightly better at 0.005, the HV improvement was more substantial, suggesting that a higher learning rate facilitates faster convergence and better adaptation in policy learning. Therefore, 0.01 was selected as the optimal setting.
- (4)
Exploration rate. An exploration rate (ε) of 0.3 produced the best trade-off between exploration and exploitation, achieving superior results in HV (0.6413), IGD (0.3694), and GD (0.3194). Although SP was marginally higher, the gains in convergence and robustness were dominant. In contrast, low or greedy exploration (ε ≤ 0.05) led to premature convergence, verifying the importance of moderate stochastic exploration for maintaining search diversity.
To evaluate the robustness of D2QN-COEA under different problem characteristics and system configurations, comprehensive sensitivity analyses were conducted with respect to due-date tightness and energy-related parameters, which commonly vary in real-world distributed manufacturing environments.
Three due-date tightness scenarios were examined by adjusting the tightness factor in the due-date definition . Specifically, a tight scenario () represents highly urgent orders with minimal slack time, a medium scenario () serves as the baseline reflecting typical industrial conditions, and a loose scenario () corresponds to relaxed delivery requirements. Six representative instances covering small to extra-large problem scales were selected: 2-3-20, 2-5-40, 3-4-40, 3-6-60, 4-5-100, and 5-6-160. Each scenario–instance combination was independently executed 30 times, and the average values of four performance metrics were recorded.
The experimental results are summarized in
Table 6. Under tight due-date conditions, D2QN-COEA exhibited moderately degraded but still competitive performance, with an average HV of 1.2684 and corresponding GD and IGD values of 0.0865 and 0.1047, respectively. This performance degradation is expected, as stringent due dates substantially reduce the feasible solution space and limit flexibility in balancing tardiness minimization and energy efficiency.
Conversely, under loose due-date conditions, performance improved consistently across all metrics. As shown in
Table 6, the average HV increased to 1.3892, while GD and IGD decreased to 0.0647 and 0.0775, respectively. The expanded feasible region allows the algorithm to explore a broader range of trade-off solutions that balance delivery performance and energy consumption more effectively. Notably, the spacing metric reported in
Table 6 remained highly stable across all three scenarios, varying only between 0.0006 and 0.0010. This stability indicates that D2QN-COEA maintains a uniform distribution of Pareto-optimal solutions regardless of due-date tightness, which can be attributed to the archive maintenance strategy and the co-evolutionary framework. Importantly, although absolute performance values varied across scenarios, the relative superiority of D2QN-COEA over baseline algorithms remained consistent. Even under the most restrictive tight scenario, D2QN-COEA outperformed the second-best baseline by 15.3% in HV and achieved reductions of 62.7% and 58.4% in GD and IGD, respectively. Wilcoxon rank-sum tests confirmed that all pairwise comparisons were statistically significant across all due-date scenarios (
).
To assess robustness with respect to variations in energy consumption parameters, sensitivity analyses were performed by uniformly scaling both processing power and idle power. Three scenarios were considered: a low-power scenario (−10%), a baseline scenario, and a high-power scenario (+10%), as defined in
Section 5.1. The same six representative instances were tested, with 30 independent runs for each configuration. The results are presented in
Table 7. When both processing and idle power were reduced by 10%, the average HV decreased marginally to 1.3104, while GD and IGD increased slightly to 0.0801 and 0.0963, respectively. These minor changes are primarily caused by compression of the energy objective range, which reduces the effective space for exploring trade-offs between tardiness and energy consumption. Nevertheless, convergence quality remained substantially superior to that achieved by all baseline algorithms under standard conditions.
In the high-power scenario, performance metrics remained stable, with average HV, GD, and IGD values of 1.3476, 0.0724, and 0.0881, respectively. The expanded energy objective range facilitates clearer differentiation among trade-off solutions, resulting in slightly improved convergence behavior. Across all power configurations, the spacing metric reported in
Table 7 remained nearly unchanged, further confirming robust diversity preservation. An additional experiment examined sensitivity to the idle-to-processing power ratio, which directly affects the potential for energy savings through idle time consolidation. Three ratio configurations were tested while maintaining comparable total energy consumption. Although absolute energy values varied across ratios, the structure of the Pareto fronts and convergence behavior remained stable. Across all configurations, HV values deviated by less than 2.8% from the baseline, while D2QN-COEA consistently maintained superiority margins exceeding 200% in HV and 75% in GD and IGD compared with baseline algorithms. Overall, the sensitivity analyses confirm the robustness of D2QN-COEA under variations in both problem-specific and system-level parameters. Across all tested configurations, the minimum observed performance advantage relative to the second-best baseline algorithm was 13.1% in HV improvement, 56.3% in GD reduction, and 54.2% in IGD reduction.
Statistical significance was verified using Wilcoxon rank-sum tests with Bonferroni correction, with all adjusted -values below 0.001. Effect size analysis using Cohen’s yielded values ranging from 1.18 to 2.94, indicating consistently large practical significance. Furthermore, core algorithmic mechanisms remained stable under parameter perturbations. The Double DQN–based operator selection mechanism exhibited coefficients of variation below 12%, indicating robust decision-making across scenarios. The co-evolutionary framework maintained effective diversity preservation, with spacing variation below 14.3%. The energy-adjustment strategy consistently reduced idle energy consumption by an average of 24.3%, confirming its effectiveness across diverse energy configurations. In summary, these results demonstrate that D2QN-COEA is robust to variations in due-date constraints and energy parameters, thereby validating its practical applicability in real-world distributed manufacturing environments.
5.3. Comparison with Other Algorithms
To comprehensively evaluate the performance of the proposed D2QN-COEA, four classical multi-objective evolutionary algorithms, namely NSGA-II, MOPSO, MOEA/D, and SPEA2, were selected for comparative experiments. These algorithms have been extensively applied in multi-objective optimization and are widely recognized for their robustness and representativeness. To ensure the fairness of comparison, all algorithms adopted the same encoding and decoding scheme as well as identical constraint-handling mechanisms. The population size and the maximum number of function evaluations were kept consistent across all algorithms. The experiments were conducted on twenty-four benchmark instances of the EA-DFJSP-JP problem, covering small, medium, and large-scale configurations. Each algorithm was independently executed five times on every instance, and the average and standard deviation of four evaluation indicators, including HV, IGD, GD, and Spacing, were recorded. The detailed comparative results are reported in
Table 8,
Table 9,
Table 10 and
Table 11, where the best values for each instance are highlighted in bold.
Figure 5 presents the boxplot comparison of the four indicators among the five algorithms.
The comparative results clearly demonstrate the superior performance of D2QN-COEA across almost all instances and evaluation metrics. Regarding convergence measured by the GD metric, D2QN-COEA achieved the best results in twenty-three out of twenty-four instances, outperforming the other algorithms by a considerable margin. The reduction in GD was more than half compared with the second-best algorithm on average, indicating significantly enhanced convergence capability. Even in large-scale instances, D2QN-COEA maintained stable and low GD values, reflecting its strong ability to converge efficiently in complex search spaces. In contrast, MOPSO generally produced larger GD values, revealing its lower search efficiency when solving discrete optimization problems such as job shop scheduling.
For the IGD metric, which simultaneously reflects convergence and diversity, D2QN-COEA again achieved dominant performance on most test instances. Its average IGD value was substantially lower than that of the comparative algorithms, demonstrating a closer approximation to the true Pareto front. Moreover, the standard deviation of D2QN-COEA was the smallest among all algorithms, indicating its robustness and high repeatability. This stability benefits from the adaptive learning mechanism of the Double Deep Q-Network, which dynamically selects local search operators based on the current state of solutions, allowing the algorithm to consistently maintain high-quality Pareto sets under different problem conditions.
For the Spacing metric, D2QN-COEA obtained the best or nearly best results in more than half of the benchmark cases, while the remaining results were still close to optimal. Its average Spacing value was markedly smaller than that of the other algorithms, which implies that the obtained Pareto fronts were more uniformly distributed with less variation among neighboring solutions. Such balanced distributions are particularly valuable in multi-objective decision-making because they provide decision-makers with a well-dispersed set of trade-off solutions. The improvement in distribution uniformity mainly arises from the cooperative effect between the elite archive updating mechanism and the energy-aware adjustment strategy, which together maintain structural balance and diversity in the solution set.
The comparison of the HV metric further verifies the comprehensive superiority of the proposed algorithm. D2QN-COEA achieved the highest HV values in the vast majority of instances, with only minor differences in a few cases compared with NSGA-II. Its average HV value was significantly higher than those of the comparative algorithms, reflecting better convergence and diversity simultaneously. The performance advantage of D2QN-COEA became even more evident in large-scale problems, where it maintained rapid convergence and high-quality Pareto fronts despite the growing search space. This result highlights that the combination of co-evolutionary global exploration and knowledge-guided reinforcement learning effectively enhances the optimization capability of the algorithm.
From the perspective of stability and robustness, D2QN-COEA consistently exhibited smaller standard deviations across all indicators, confirming its reliability under repeated independent runs. For instance, the variance of its GD values was much lower than that of other algorithms, demonstrating excellent consistency in convergence performance. Such stability is of practical importance for manufacturing scheduling, as it ensures that the optimization results remain reliable and reproducible under different computational conditions. Overall, D2QN-COEA significantly outperformed traditional multi-objective evolutionary algorithms in convergence accuracy, diversity preservation, and solution distribution. By integrating the adaptive local search capability of the Double Deep Q-Network, the global exploration ability of the co-evolutionary framework, and the energy-aware optimization mechanism, the proposed algorithm achieves a superior balance between efficiency and solution quality, thereby validating its effectiveness and advancement for solving the EA-DFJSP-JP problem.
5.4. Case Study
To further evaluate the practical applicability of the proposed D2QN-COEA, a case study was conducted based on a representative medium-scale distributed manufacturing scenario. The case involves two cooperative factories, each equipped with five CNC machines, responsible for completing the production of eighty customer orders. Each order has a specific priority level and delivery deadline, and the number of operations for each job ranges from three to eight. The scheduling objective is to simultaneously minimize the total weighted tardiness and the total energy consumption while satisfying all technological and precedence constraints.
The optimization results of the five algorithms for this case are summarized in
Table 12, where the mean and standard deviation values of four performance metrics (HV, GD, IGD, and Spacing) are reported. The best results for each indicator are highlighted in bold. As shown in
Table 10, D2QN-COEA consistently achieved the best performance across all metrics. Its HV value was significantly higher than those of the other algorithms, exceeding the second-best method by more than 15%, which demonstrates its superior overall optimization capability. Regarding convergence metrics, both GD and IGD values obtained by D2QN-COEA were markedly lower than those of the comparative algorithms, indicating faster and more stable convergence toward the true Pareto front. In terms of solution distribution, D2QN-COEA achieved the smallest Spacing value, revealing that its solutions are more uniformly distributed in the objective space and offer a better balance between conflicting objectives. Furthermore, the standard deviations of all metrics were relatively small, confirming the stability and consistency of the proposed algorithm across multiple independent runs.
A representative Pareto-optimal schedule obtained by D2QN-COEA is illustrated in
Figure 6, which shows the Gantt chart of one typical solution. The selected solution achieves a well-balanced trade-off between total weighted tardiness and total energy consumption. As observed from the Gantt chart, production loads are evenly distributed between the two factories, and machine utilization is compact and well-organized, ensuring that all precedence and resource constraints are satisfied. These results demonstrate that D2QN-COEA can effectively generate high-quality scheduling solutions in complex distributed manufacturing environments, confirming its practicality and superiority for real-world applications of the EA-DFJSP-JP problem.
6. Conclusions
This study proposed the EA-DFJSP-JP, which integrates energy efficiency and job prioritization within a distributed manufacturing framework. A bi-objective optimization model was formulated to minimize total weighted tardiness and total energy consumption, considering both processing and idle power usage. To effectively solve this NP-hard problem, a D2QN-COEA was developed. The proposed approach embeds domain knowledge into a deep reinforcement learning framework to enable adaptive operator selection, while a co-evolutionary strategy enhances global exploration and convergence stability.
Comprehensive experiments on 24 benchmark instances and a real-world case study demonstrated that D2QN-COEA consistently outperforms classical multi-objective evolutionary algorithms in terms of convergence accuracy, diversity preservation, and energy efficiency. The algorithm achieves superior hypervolume values, lower convergence distances, and more uniform Pareto front distributions, confirming its robustness and scalability for large-scale distributed scheduling problems. The case analysis further verified its practical applicability, showing balanced workload distribution and significant reductions in tardiness and energy consumption across cooperative factories. Quantitatively, the proposed algorithm achieves an average improvement of over 15% in HV and a reduction of more than 50% in GD, indicating significantly superior convergence accuracy and solution quality. The validity of these findings is subject to the boundary conditions of the proposed EA-DFJSP-JP model. Specifically, the results are based on the assumptions that (1) total energy consumption comprises only processing and idle power, (2) cross-factory processing of a single job is prohibited once assigned, and (3) inter-factory transportation time is considered negligible or integrated into processing durations.
In summary, integrating domain knowledge with deep reinforcement learning provides an effective pathway toward intelligent, energy-efficient decision-making in distributed manufacturing systems. Future research will focus on extending the proposed framework to stochastic and dynamic environments, incorporating real-time energy feedback and digital twin technologies to further enhance adaptability and industrial applicability.