Profit-Oriented Multi-Objective Dynamic Flexible Job Shop Scheduling with Multi-Agent Framework Under Uncertain Production Orders

Ma, Qingyao; Lu, Yao; Chen, Huawei

doi:10.3390/machines13100932

Open AccessArticle

Profit-Oriented Multi-Objective Dynamic Flexible Job Shop Scheduling with Multi-Agent Framework Under Uncertain Production Orders

by

Qingyao Ma

¹

,

Yao Lu

^1,*

and

Huawei Chen

²

¹

School of Mechanical and Electrical Engineering, Guizhou Normal University, Guiyang 550025, China

²

School of Mechanical and Electrical Engineering, Guang’an Institute of Technology, Guang’an 638000, China

^*

Author to whom correspondence should be addressed.

Machines 2025, 13(10), 932; https://doi.org/10.3390/machines13100932

Submission received: 25 August 2025 / Revised: 19 September 2025 / Accepted: 22 September 2025 / Published: 9 October 2025

(This article belongs to the Section Industrial Systems)

Download

Browse Figures

Versions Notes

Abstract

In the highly competitive manufacturing environment, customers are increasingly demanding punctual, flexible, and customized deliveries, compelling enterprises to balance profit, energy efficiency, and production performance while seeking new scheduling methods to enhance dynamic responsiveness. Although deep reinforcement learning (DRL) has made progress in dynamic flexible job shop scheduling, existing research has rarely addressed profit-oriented optimization. To tackle this challenge, this paper proposes a novel multi-objective dynamic flexible job shop scheduling (MODFJSP) model that aims to maximize net profit and minimize makespan on the basis of traditional FJSP. The model incorporates uncertainties such as new job insertions, fluctuating due dates, and high-profit urgent jobs, and establishes a multi-agent collaborative framework consisting of “job selection–machine assignment.” For the two types of agents, this paper proposes adaptive state representations, reward functions, and variable action spaces to achieve the dual optimization objectives. The experimental results show that the double deep Q-network (DDQN), within the multi-agent cooperative framework, outperforms PPO, DQN, and classical scheduling rules in terms of solution quality and robustness. It achieves superior performance on multiple metrics such as IGD, HV, and SC, and generates bi-objective Pareto frontiers that are closer to the ideal point. The results demonstrate the effectiveness and practical value of the proposed collaborative framework for solving MODFJSP.

Keywords:

double deep Q-network; dynamic flexible job shop scheduling problem; multi-agent reinforcement learning; uncertain production orders

1. Introduction

Over the past few decades, economic globalization has reshaped industrial and value chains with unprecedented depth and breadth. While unlocking economies of scale and accelerating technological diffusion, it has also continuously exposed traditional manufacturing to uncertainties such as demand fluctuations and factor price shocks [1]. As the cornerstone of industrial systems and the macroeconomy, the capacity of the manufacturing sector directly reflects a nation’s comprehensive strength and international competitiveness [2]. With the advancement of intelligent manufacturing technologies, manufacturing paradigms are shifting from uniformity to diversification, imposing higher demands on enterprises’ production organization and management [3].

The job shop scheduling problem (JSP) is one of the most fundamental and frequently deliberated decision-making issues in enterprise production management [4]. Its essence lies in arranging task sequences and allocating resources under limited time and resource constraints, with the aim of minimizing production cycles, reducing costs, and maximizing efficiency [5]. As a significant extension of JSP, the flexible job shop scheduling problem (FJSP) allows each operation to be processed on multiple alternative machines and enables flexible adjustment of process routes according to task requirements [6]. This feature not only enhances the flexibility and resource utilization of production systems but also alleviates bottlenecks, enabling enterprises to maintain high production efficiency and delivery reliability even under complex scenarios such as fluctuating order structures and tight due dates [7]. Consequently, FJSP has become an indispensable tool for modern production scheduling optimization and enterprise operation management, and it has been widely applied in high-demand industries such as semiconductor manufacturing, automotive production, and metallurgical processing [8].

In real-world production, enterprises under fierce market competition and increasing customer demands are often required to simultaneously shorten production cycles and enhance profits [9]. However, the production process is frequently accompanied by dynamic events such as job insertions, machine breakdowns, and due date changes, which transform the scheduling problem from a static FJSP into a dynamic FJSP (DFJSP), thereby significantly increasing decision-making complexity [10]. Moreover, a single optimization objective can no longer meet the needs of modern manufacturing systems, and thus the DFJSP has further evolved into the multi-objective flexible job shop scheduling problem (MODFJSP) [11].

For the MODFJSP, researchers have proposed various heuristic, metaheuristic, and machine learning approaches [12,13,14]. Although metaheuristic algorithms can effectively cope with uncertainties, they often require re-tuning when applied to new production lines or disturbances, thus limiting their generalization ability [9]. Deep reinforcement learning (DRL) offers advantages in adaptability and real-time responsiveness, and in particular, multi-agent DRL (MADRL) has demonstrated greater robustness and scalability in solving MODFJSP [15]. However, existing MADRL-based studies mainly focus on minimizing production time and reducing costs, while placing insufficient emphasis on profit maximization. Especially under uncertain production orders, how to simultaneously ensure production efficiency and maximize net profit remains an urgent challenge. The main contributions of this research are as follows.

For the MODFJSP with dynamic new job arrivals, high-profit urgent jobs, and fluctuating due dates, an optimization model is constructed with two objectives: minimizing makespan and maximizing net profit.
A sequential multi-agent collaborative framework of “job selection → machine assignment” is constructed: the job agent first decides the job to be processed, and the machine agent then assigns the pending operation of the selected job to a feasible machine set, thereby enabling efficient solution of the proposed MODFJSP.
Customized reward functions are designed for the two agents with the objectives of maximizing profit and minimizing makespan, accompanied by adaptive state representations and variable action output mechanisms, which significantly enhance learning efficiency and enable multi-objective collaborative optimization.
The system conducts a comparative analysis of various mainstream DRL algorithms and classical dispatching rules in solving the proposed MODFJSP, thereby identifying the optimal DRL-based solution.

2. Literature Review

In this section, we first review JSP-related studies focusing on cost or profit, then summarize the research contributions on solving FJSP using DRL, and finally highlight the existing research gaps.

2.1. Research on Production Scheduling Problems Considering Cost or Profit

In recent years, an increasing number of researchers have focused on the JSP with energy cost considerations under time-of-use (TOU) electricity pricing, aiming to reduce operating expenses for manufacturing enterprises [16]. For example, Shen et al. [17] studied the TOU mechanism by optimizing operation start times, sequencing, and machine assignments, demonstrating that even with a slight increase in makespan, energy costs can be effectively reduced. The study provided a rigorous and practical basis for cost reduction and efficiency improvement in FJSP under TOU pricing. Similarly, Park and ham [18] incorporated TOU pricing and planned machine downtime to minimize makespan and total energy cost, employing integer linear programming and constraint programming approaches to effectively lower energy consumption in manufacturing.

By contrast, fewer studies have approached scheduling from a profit-oriented perspective, considering the combined impact of scheduling decisions on enterprise revenue and cost. Zouadi et al. [19] innovatively addressed profit orientation and order acceptance by applying a robust decomposition method to maximize production profit, particularly suitable for industries with volatile prices and fragmented customer demand. In the existing FJSP literature, profit-oriented research remains scarce. However, thanks to the dual flexibility of jobs and machines, FJSP still holds substantial potential for profit improvement. Moreover, the MODFJSP can further embed various dynamic events to achieve global adaptive optimization of production and profit. This study innovatively incorporates enterprise profit into the MODFJSP framework, thereby enhancing the effectiveness of scheduling decisions.

2.2. Research on Solving FJSPs with DRL

With the increasing complexity of scheduling problems and the presence of multiple dynamic disturbances (such as job insertions and price fluctuations), traditional heuristic algorithms can hardly provide efficient responses [12]. DRL, leveraging its trial-and-error mechanism and adaptive policy learning, offers a new decision-making paradigm for such high-dimensional, dynamic, and non-stationary environments [20]. In the field of FJSP, an increasing number of researchers have proposed various solutions for FJSP [21]. For example, Luo [22] designed a Deep Q-Network (DQN) algorithm to address DFJSP by incorporating new job insertions to reduce delays. Lu et al. [23] applied a Double Deep Q-Network (DDQN) to solve DFJSP with dynamic job arrivals and urgent job insertions, aiming to minimize makespan (C_max) and average tardiness. Lei et al. [12] proposed a proximal policy optimization (PPO) algorithm with a multi-pointer graph network (MPGN) structure to reduce makespan in FJSP. Existing studies have further verified that when solving FJSP with single-agent DRL methods, the state–action space often suffers from combinatorial explosion, whereas adopting a multi-agent approach can effectively alleviate this issue [15].

Therefore, multi-agent deep reinforcement learning has been widely applied to solve DFJSP [24]. For example, Liu et al. [25] focused on DFJSP with dynamic job arrivals, aiming to minimize total tardiness. Yan et al. [26] also considered DFJSP with dynamic job arrivals and employed a DQN-based multi-agent framework to address factory selection, job allocation, and operation sequencing, achieving the objective of minimizing total tardiness through agent collaboration. Ref. [27] proposed a multi-agent deep reinforcement learning based scheduling framework to address the complex scheduling problem in flexible job shops with the integration of automated guided vehicles (AGVs), achieving collaborative optimization of production and transportation resources. Table 1 summarizes the above studies on applying deep reinforcement learning methods to FJSP, along with our proposed approach.

2.3. Research Gaps

By summarizing the existing studies, three research gaps can be identified:

Most cost- or profit-oriented FJSP studies mainly focus on electricity costs under TOU pricing policies, while giving insufficient attention to the penalty costs incurred by tardiness.
Most studies focus solely on cost; however, in practice, some jobs differ in the remuneration offered by customers due to process complexity or specific requirements. How to incorporate such heterogeneous profits into DFJSP and maximize net profit remains to be further investigated.
Most DRL-based studies on DFJSP primarily consider dynamic events such as new job insertions, while neglecting profit and due date fluctuations. Moreover, when targeting profit maximization and makespan minimization, further technical challenges arise in the design of scheduling strategies and reward functions to improve agent learning efficiency.

3. Problem Description for MODFJSP

In this section, to illustrate the considered MODFJSP, we first provide a detailed description. Next, the notations and constraints are presented. Finally, the mathematical model is formulated.

3.1. Problem Description

The MODFJSP considered in this study can be defined as follows. A set of n consecutively arriving jobs J = {1, 2, …, n} must be processed on m machines M = {1, 2, …, m}. Each job i consists of n_i operations O_i = {1, …, n_i}, where O_i,j denotes the jth operation of job i. Each operation O_i,j can be processed on any machine selected from its predefined feasible machine set M_i,j (M_i,j⊆M). The processing time of operation O_i,j on machine k is denoted as t_i_,j_,k. The arrival time and due date of job i are denoted as A_i and D_i, respectively. S_i_,j_,k and C_i,j,k represent the actual start time and completion time of operation O_i_,j on machine k, respectively. The system consists of initial orders and dynamically arriving

{J o b}_{n e w}

new jobs. Due dates are adjusted according to environmental changes, and some jobs are urgent, generating higher revenue but incurring stricter tardiness penalties. The energy consumption cost is calculated based on TOU electricity pricing. The objectives are to maximize net profit and minimize makespan under resource constraints. The decision process proceeds in two steps: first, selecting the job to be processed, and second, assigning its next operation to a machine. For ease of understanding, Figure 1 illustrates the considered MODFJSP.

3.2. Mathematical Model

Before presenting the mathematical model of MODFJSP, we provide a detailed description of the notations used. Table 2 summarizes these notations, including the variables related to jobs, machine resources, processing times, and scheduling decisions.

In real-world production, production efficiency is often reflected in the reduction in C_max. Job processing generates electricity consumption costs, while delivery delays incur penalties, which are even higher for urgent orders. Due to differences in process complexity, customer remuneration varies across jobs, and energy consumption also differs among machines. Based on these considerations, this study proposes a bi-objective model aiming to minimize C_max and maximize net profit. The objectives and constraints are summarized as follows.

F_{1} = Minimize (C_{m a x})

(1)

F_{2} = Maximum (\sum_{i \in J} T P_{i}) = Maximum (\sum_{i \in J} p_{i} - \sum_{t \in T} c_{t} \cdot e_{t} - \sum_{i \in J} p e n a l t y_{i})

(2)

Subject to:

\sum_{i \in J} x_{t}^{i, j, k} \leq 1, \forall t \in T, k \in M_{i, j}, j \in O_{i}

(3)

b_{t}^{k} \geq x_{t}^{i, j, k}, \forall t \in T, k \in M_{i, j}, i \in J, j \in O_{i}

(4)

C_{i} \geq 0, \forall i \in J

(5)

t_{i, j, k} > 0, \forall i \in J, j \in O_{i}, k \in M_{i, j}

(6)

(C_{i, j} - t_{i, j, k} - C_{i, j - 1}) \cdot x_{t}^{i, j, k} \geq 0, \forall i \in I_{n}, j \in J_{i}, k \in M_{i, j}

(7)

(C_{i, 1} - t_{i, 1, k} - A_{i}) \cdot x_{t}^{i, 1, k} \geq 0, \forall i \in I_{n}, k \in M_{i, j}

(8)

Objective Equation (1) minimizes the C_max, while Objective Equation (2) maximizes net profit after accounting for energy costs under TOU pricing and tardiness penalties. Constraint Equation (3) ensures that each job can be processed on at most one machine at any given time. Constraint Equation (4) specifies that if a job is assigned to machine k at time t, then the machine must be in the “on” state at that time. Constraint Equation (5) requires that the completion time of each job be non-negative. Constraint Equation (6) ensures that the processing time of each operation is non-negative. Constraint Equation (7) guarantees that the operations of each job are executed in sequence. Constraint Equation (8) ensures that jobs can only be processed after their arrival time.

4. Multi-Agent Collaboration Framework for the Proposed MODFJSP

Table 3 provides the explanations of the variables and notations used in this section, serving as a reference for the subsequent content. This section first explains the symbols and notations used, then introduces the core algorithms adopted by the multi-agent collaboration framework. Subsequently, it elaborates on the design of each agent’s state inputs, reward functions, and action spaces. Finally, it presents the complete implementation process and technical details for solving the MODFJSP.

4.1. The DRL Algorithm for the Multi-Agent Cooperative Framework

The interaction between reinforcement learning and the scheduling environment is modeled as a Markov Decision Process (MDP) [28]. MDP is the standard framework for sequential decision-making problems and is typically represented as a four-tuple (S, A, P, R) [29]. Here, S denotes the state space, A the action space, R the reward function, and P the state transition function. Figure 2 illustrates the workflow of reinforcement learning (RL) under the MDP framework [30]. DRL is an extension of RL that integrates deep neural networks into state representation and policy learning. Its main challenge lies in how to stably and efficiently optimize the policy to maximize cumulative rewards, for which various algorithmic strategies (e.g., DQN, DDQN, PPO) have been proposed [15]. Among them, DDQN is an improvement over DQN designed to alleviate the overestimation of Q-values [31]. DDQN introduces two networks: the action-selection network (policy network) and the target-evaluation network (target network), thereby decoupling action selection from value estimation [32]. The target update formula is given as

y_{s t}^{D D Q N} = r_{s t} + γ Q (s_{s t + 1}, \underset{a^{'}}{a r g m a x} Q (s_{s t + 1}, a^{'}; θ); θ^{-})

(9)

Here,

θ

denotes the parameters of the online network (policy network), and

θ^{-}

denotes the parameters of the target network. First, the online network is used to select the optimal action at the next decision point st + 1, i.e.,

\underset{a^{'}}{a r g m a x} Q (s_{s t + 1}, a^{'}; θ^{'})

. a′ presents candidate action from the action space. Then, the target network evaluates the Q-value of this action.

4.2. Definition of DRL Problem

Given the characteristics of FJSP, this study applies DDQN to solve the MOFJSP by employing a new multi-agent framework with a sequential decision-making process: first, the job agent selects the job to be processed and interacts with the MODFJSP environment constructed in this study; then, the environment generates the candidate machine states for the current operation of the selected job; finally, the machine agent chooses an appropriate machine based on this information, forming a “job selection–machine assignment” multi-agent collaborative framework. The overall process is illustrated in Figure 3, which provides a more intuitive representation of the cooperation between agents.

For the above components, this section sequentially introduces the state inputs, reward functions, and action spaces of the two agents, followed by the detailed procedure of the complete multi-agent solution for MODFJSP.

4.3. State Input

In existing DRL-based studies for solving MODFJSP, the processing status of all jobs is often abstracted into a set of variables [33], through which agents perceive the shop-floor environment. However, such approaches usually lack fine-grained characterization of unfinished jobs. Based on the characteristics of FJSP and the optimization objectives, this study treats the relevant features of each unfinished job

i \in U C_{s t}^{j o b}

and its pending operation on each candidate machine k ∈ M_i,j as variable states, which are then recognized by the job agent and the machine agent. This variable state design enables agents to enhance their decision-making capabilities more effectively.

4.3.1. State Input of the Job Agent

To support the job agent’s decision-making in job selection, the job state

S_{s t}^{'}

is defined. Each unfinished job state

s_{s t}^{i}

within it varies with

{UC}_{s t}^{j o b}

and can be expressed as

S_{s t}^{'} = {s_{s t}^{i} | i \in {UC}_{s t}^{j o b}}, s_{s t}^{i} = [i, R_{i}^{o p}, T_{i}^{r e m}, {TD}_{i}^{n o r m}, U f_{i}, I_{i}^{u r g}, {TP}_{i}^{n o r m}]

(10)

Except for the job index i, all other features are normalized to enhance the feature discriminability and convergence efficiency of the neural network. A detailed description is provided in Table 4.

4.3.2. State Input of the Machine Agent

For the selected job

i_{s t}

and its pending operation

j \leftarrow O_{s t}^{i_{s t}} + 1

, the machine agent constructs a state vector for each available machine

{k \in M}_{i, j}

:

S_{s t}^{″} = {s_{s t}^{k}, k \in M_{i_{s t}, j}}, s_{s t}^{k} = [k, e_{k}^{n o r m}, t_{i_{s t}, j, k}^{n o r m}, Cos t_{k}^{n o r m}, A_{k}^{n o r m}, T P_{i_{s t}}^{n o r m}]

(11)

Table 5 summarizes the detailed descriptions of the available machine state variables.

4.4. Reward Functions

4.4.1. Reward Function of the Job Agent

The job agent takes into account the effects of tardiness and job urgency. If a job is urgent and incurs tardiness, the reward function provides stronger positive guidance while imposing penalties on the degree of tardiness. In this way, under the same conditions, jobs with higher urgency and existing tardiness are prioritized, thereby reducing overall tardiness penalties and improving delivery reliability. Algorithm 1 presents the pseudocode of the reward function.

Algorithm 1 Reward function of the job agent
Input: Selected job $i_{s t}$ , $U_{i}, {TD}_{i_{s t}}$ ; $U f_{i_{s t}}$
Output: $r_{s t}^{'}$
1.	$r_{s t}^{'} \leftarrow 0$
2.	${T D}_{m a x} \leftarrow {a r g m a x}_{i \in J} {{T D}_{i}}$
3.	$T D_{i_{s t}}^{n o r m} \leftarrow \frac{T D_{i_{s t}}}{T D_{m a x} + ε}, ε$ = 1 × 10⁻⁶
4.	if selected job $i_{s t}$ $\in$ $U_{i}$ then
5.	$r_{s t}^{'}$ + = $ω_{1} \cdot U f_{i_{s t}}$ − $ω_{2} \cdot {TD}_{i_{s t}}^{n o r m}$
6.	else
7.	$r_{s t}^{'} - = {TD}_{i_{s t}}^{n o r m}$
8.	end if
9.	return $r_{s t}^{'}$

4.4.2. Reward Function of the Machine Agent

When selecting machines, the machine agent comprehensively considers processing profit, electricity cost, and tardiness impact. First, these three factors are normalized as

{T P}_{i_{s t}}^{n o r m}, {C o s t}^{n o r m}

, and

T D_{i_{s t}}^{n o r m}

, respectively. In the decision-making process, profit is assigned the highest weight, followed by electricity cost from energy consumption, and finally the impact on tardiness. The electricity cost is normalized as the ratio of the actual electricity cost to the benchmark of the “highest electricity price,” where

c^{h i g h}

denotes the maximum unit price in the TOU tariff. Algorithm 2 presents the pseudocode of the reward function for machine selection.

Algorithm 2 Reward function of the machine agent
Input: Selected job i_st and machine k_st, ${T P}_{i_{s t}}, T D_{i}$
1.	${T D}_{m a x} \leftarrow {a r g m a x}_{i \in J} {{T D}_{i}}$
2.	${C o s t}_{k_{s t}}^{n o r m} \leftarrow \frac{\sum_{t \in [s_{i_{s t}, j, k_{s t}}, c_{i_{s t}, j, k_{s t}}]} c_{t} \cdot e_{t}}{\sum_{t \in [s_{i_{s t}, j, k_{s t}}, c_{i_{s t}, j, k_{s t}}]} c^{h i g h} \cdot e_{t}}$
3.	${T P}_{i_{s t}}^{n o r m} \leftarrow \frac{{T P}_{i_{s t}}}{{a r g m a x}_{i \in J} {{T P}_{i}}}$ , ${T P}_{i_{s t}}$ is calculated according to Equation (2)
4.	$T D_{i_{s t}}^{n o r m} \leftarrow \frac{T D_{i_{s t}}}{T D_{m a x} + ε}, ε$ = $1 \times 10^{- 6}$
5.	$r_{s t}^{″} = ω_{3} \cdot {T P}_{i_{s t}}^{n o r m} - ω_{4} \cdot {C o s t}_{k_{s t}}^{n o r m} - T D_{i_{s t}}^{n o r m}$
6.	return $r_{s t}^{″}$

4.5. Definition of Action Spaces

4.5.1. Action Space of the Job Agent

In the decision-making process, the job agent constructs its action space from the states of unfinished jobs, computes the corresponding action values, and then selects the optimal action using the ϵ-greedy strategy. The job index corresponding to the chosen action is the job actually scheduled.

A c t i o n {s p a c e}^{'} = {s_{s t}^{i} | i \in U C_{s t}^{j o b}}

(12)

a_{s t}^{i} = \underset{i \in U C_{s t}^{j o b}}{a r g m a x} Q (s_{s t}^{i}; θ)

(13)

Figure 4 presents the overall structure of the job agent and its interaction process with the environment. At each decision point st, the job agent first identifies the states of unfinished jobs as input. The policy network outputs Q-value estimates for all actions, and actions are selected using a greedy strategy. The target network has the same structure as the policy network but is updated differently: the policy network parameters are updated via gradient descent based on the mean squared error (MSE) loss function, while the target network parameters are gradually migrated from the policy network through soft updates, thereby improving training stability and convergence. Meanwhile, DDQN introduces an experience replay mechanism, in which the state transition samples

(S_{s t}^{'}, a_{s t}^{i}, r_{s t}^{'}, S_{s t}^{'})

obtained from environment interactions are stored in a replay buffer. During training, mini-batches of samples are randomly drawn to break data correlations and improve sample efficiency.

4.5.2. Action Space of the Machine Agent

The action space of the machine agent is composed of the set of state feature vectors of all candidate machines capable of processing the job.

A c t i o n {s p a c e}^{″} = {s_{s t}^{k} | k \in M_{i_{s t}, j}}

(14)

The machine agent computes the corresponding action values based on the given features and selects the optimal action using the ϵ-greedy strategy. The machine index corresponding to the chosen action is then assigned as the processing machine for

O_{i_{s t}, j} .

a_{s t}^{k} = \underset{k \in M_{i_{s t}, j}}{a r g m a x} Q (s_{s t}^{k}; θ)

(15)

The network structure of the machine agent is illustrated in Figure 5. At each scheduling point st, after the job agent selects the job to be processed, the MODFJSP environment provides the machine states of the candidate machine set

M_{i_{s t}, j}

for its pending operation. The policy network then outputs Q-value estimates for all candidate machine features, and a greedy strategy is adopted for action selection. The target network has the same structure as the policy network, and its update and sampling procedures are identical to those of the job agent.

4.6. Multi-Agent Collaborative Framework for the Proposed MODFJSP

The two agents learn and update independently but collaborate in the same environment following the sequence of “job selection → machine assignment.” Multi-agent DDQN(MADDQN), experience replay, and soft updates of the target network are employed to enhance stability and reduce overestimation bias. The detailed procedure is given in Algorithm 3. For further illustration, Figure 6 demonstrates the process: the environment provides job states → the job agent selects a job

i_{s t}

→ the environment provides feasible machine states

S_{s t}^{″} (i_{s t})

→ the machine agent selects a machine

k_{s t}

→ the environment updates the schedule and computes energy consumption and tardiness → the transition is stored in the replay buffer and the networks are updated, and this cycle continues until all jobs are scheduled.

Algorithm 3 Pseudocode of the training process for solving MODFJSP using MADDQN
1.	Initialize job agent replay memory buffer 1 to capacity N, and machine agent buffer 2 to capacity N′
2.	Initialize job agent’s online network Q with random weight $θ$ and target network $θ^{-}$
3.	Initialize machine agent’s online network Q with random weight $θ$ and target network $θ^{-}$
4.	for epoch = 1: L do
5.	Generate the job agent state $S_{s t}^{'} = {s_{s t}^{i} \| i \in U C_{s t}^{j o b}}, s_{s t}^{i} = [i, 0,0, 0,0, 0,0]$
6.	for $s t$ = 1: $\sum_{i \in J} n_{i}$ do//where $s t$ represents a decision point corresponding to an operation completion or the arrival of a new job
7.	Job agent execute action $a_{s t}^{i}$ to obtain the selected job $i_{s t}$ ; the environment then provides the available machine state $S_{s t}^{″}$ for the pending operation ( $j \leftarrow O_{s t}^{i_{s t}} + 1$ ).
8.	The machine agent makes the decision $a_{s_{t}}^{k}$ , interacts with the environment to assign the selected machine $k_{s t}$ , and then the environment updates to provide the next job agent state $S_{s t + 1}^{'}$ . The next available machine state $S_{s t + 1}^{″}$ is observed after the job agent selects next job $i_{s t + 1}$ .
9.	Compute the reward function $r_{s t}^{'}$ for job agent and $r_{s t}^{″}$ for machine agent based on Section 4.4.1 and Section 4.4.2. buffer 1←( $S_{s t}^{'}, a_{s t}^{i}, r_{s t}^{'}, S_{s t + 1}^{'}$ ), buffer 2← ( $S_{s t}^{″}, a_{s t}^{k}, r_{s t}^{″}, S_{s t + 1}^{″})$ .
10.	If the size of buffer 1 $\geq$ batch size 1 then:
11.	Sample minibatch from buffer 1
12.	for each sample:
13.	$y_{s t}^{D D Q N} = r_{s t}^{'} + γ Q (S_{s t + 1}^{'}, \underset{a^{'}}{\arg \max} Q (S_{s t + 1}^{'}, a^{'}; θ); θ^{-})$ //Use target network to evaluate target value
14.	end for
15.	$L_{l o s s}$ = MSE $(y_{s t}^{D D Q N}, Q (S_{s t}^{'}, a_{s t}^{i}; θ))$ //Compute job agent loss
16.	Update $θ$ by minimizing $L_{l o s s} . θ^{-}$ ← (1 − $τ$ ) $θ^{-} + τ \cdot θ$ //Soft update of job agent’s target network
17.	end if
18.	If the size of buffer 2 $\geq$ batch size 2 then:
19	Sample minibatch from buffer 2
20.	for each sample:
21.	$y_{s t}^{D D Q N} = r_{s t}^{″} + γ Q (S_{s t + 1}^{″}, \underset{a^{'}}{\arg \max} Q (S_{s t + 1}^{″}, a^{'}; θ); θ^{-})$ //Use target network to evaluate target value
22.	end for
23.	$L_{l o s s}$ = MSE $(y_{s t}^{D D Q N}, Q (S_{s t}^{″}, a_{s t}^{k}; θ))$ //Compute machine agent loss
24.	Update $θ$ by minimizing $L_{l o s s} . θ^{-}$ ← (1- $τ$ ) $θ^{-} + τ \cdot θ$ //Soft update of machine agent’s target network
25.	end if
26.	end for
27.	end for

5. Numerical Experiments

This section first introduces the method of generating numerical simulation instances, followed by a description of the multi-objective evaluation metrics and the necessary parameter settings to ensure the validity of the experiments. To verify the effectiveness of the proposed multi-agent collaborative model in solving MODFJSP, it is compared with commonly used multi-agent reinforcement learning algorithms as well as classical scheduling rules suitable for this study, thereby further demonstrating the advantages of the approach. Finally, the optimization results are visualized to provide more intuitive research insights.

5.1. Generation of Numerical Instances

Since classical FJSP benchmark instances do not contain information on job arrival times or due dates, this study randomly generates numerical simulation instances for training and testing the proposed multi-agent collaborative model. Following the method in [22], the inter-arrival times of newly inserted jobs are assumed to follow an exponential distribution, Exponential (1/

λ_{n e w}

). The due date of job i is calculated as the sum of its arrival time

A_{i}

and the average processing time, scaled by a due date tightness factor (DDT):

D_{i} = A_{i} + D D T \cdot \sum_{j = 1}^{n_{i}} \frac{\sum_{k \in M_{i, j}} t_{i, j, k}}{| M_{i, j} |}

=

A_{i} + D D T \cdot \sum_{j = 1}^{n_{i}} {\bar{t}}_{i, j},

where

{\bar{t}}_{i, j}

denotes the average processing time of operation O_i,j over its feasible machine set M_i,j. By combining different values of

λ_{n e w}

and DDT, a total of 24 numerical instances are generated. For urgent jobs, customers provide higher prices. The detailed experimental settings are summarized in Table 6. Electricity costs are calculated based on the TOU pricing scheme referenced in [34], with specific settings provided in Table 7.

5.2. Evaluation Metrics

Three indicators are used to evaluate the proposed method in the multi-objective optimization problem: Set Coverage (SC) [35], Inverted Generational Distance (IGD) [36], and Hypervolume (HV) [37].

(1): IGD

It is used to measure the closeness and coverage of a solution set generated by an algorithm relative to the true Pareto Front. However, due to the complexity of MODFJSP, the ideal Pareto Front P* is unknown, and thus the Pareto front approximated from the non-dominated solutions obtained by all algorithms is used as P* [38]. Moreover, a lower IGD value indicates that the solutions derived from the evaluated algorithm exhibit better convergence and diversity. It is calculated using Equation (16).

I G D (Q, P^{*}) = \frac{\sum_{p_{i} \in P^{*}} \min_{q \in Q} ‖p_{i} - q‖}{| P^{*} |}

(16)

P^{*} = {p_{1}, p_{2}, . . ., p_{N}}

: the true Pareto front.

Q = \{q_{1}, q_{2}, . . ., q_{V}\} :

the set of non-dominated solutions generated by the algorithm.

\min_{q \in Q} ‖p_{i} - q‖

: the Euclidean distance from a true solution

p_{i}

to its closest point in the approximate solution set

Q

.

(2): HV

In multi-objective optimization, HV is a metric frequently used to measure the volume of the objective space covered by a set of solutions. Specifically, it represents the union of hyper-rectangular regions formed between each non-dominated point

q_{i}

and a reference point r. A larger HV value indicates that the solution set is closer to the Pareto front, has broader coverage, and is of higher quality. It is calculated by applying Equation (17).

H V (Q, r) = V o l u m e (\underset{q_{i} \in Q}{\cup} [q_{i}, r])

(17)

The reference point is set as r = (1, 1, 1).

(3): SC

It is used to measure the dominance relationship between two sets of non-dominated solutions, i.e., to compare their relative quality and convergence. Here, U denotes the non-dominated solution set generated by the current algorithm, and V denotes that generated by the comparison algorithm. u dominates v means that u is no worse in all objectives and strictly better in at least one.

SC(U,V) = 1: all solutions in U dominate all solutions in V → clear superiority.
SC(U,V) = 0: none of the solutions in V are dominated by any solution in U → clear inferiority.
0 < SC(U,V) < 1: the solution set U has a certain degree of dominance over V.

The closer the value of SC(U,V) is to 1, the higher the solution quality of the current algorithm and the stronger its dominance advantage. Conversely, the closer it is to 0, the weaker the advantage, even indicating inferiority.

S C (U, V) = \frac{|{v \in V | \exists u \in U, u ≺ v}|}{| V |}

(18)

To facilitate measurement and reduce the influence caused by differences in units or magnitudes among objectives, all objective values are normalized using Equation (19). In this case,

f_{g}

represents the original value of the gth objective, while

f_{g}^{m i n}

and

f_{g}^{m a x}

denote the minimum and maximum values of the gth objective across all solutions [39]. By unifying the scale, this normalization procedure improves the reliability and comparability of multi-objective performance measures.

\hat{f_{g}} = \frac{f_{g} - f_{g}^{\min}}{f_{g}^{\max} - f_{g}^{\min}}, g \in {1, 2}

(19)

5.3. Experimental Parameter Settings

The numerical model was coded in Python 3.10.0 and executed on a Windows 11 system equipped with a 2.30 GHz Intel Core i7-12700H processor and 16 GB of RAM. Both the proposed algorithm and the baseline algorithms are implemented in Python. To realize the proposed approach, Python is employed as the primary programming language, supported by several libraries: Pytorch = 1.12.0 is used to construct and train neural networks, while NumPy = 1.23.0 handles numerical computations and data processing, and matplotlib (3.5.3) is employed for result visualization.

Unlike heuristic algorithms, RL relies on agents continuously exploring and trying actions to achieve convergence; therefore, the number of training epochs is crucial for deep reinforcement learning [40]. We conducted experiments with different epoch settings on an instance with

{J o b}_{i n i t i a l} = 15, {J o b}_{n e w} = 10, m = 8, λ_{n e w} = 20, D D T = 20

. For each epoch setting, 10 runs were performed to calculate the HV metric in order to determine the optimal epoch number. A boxplot, as a statistical tool, was used to characterize data dispersion and accurately describe the stability of optimization performance [4]. Figure 7. box plots of HV values under different training epochs; different colors represent different numbers of training epochs. As shown in Figure 7, across 10 runs of the algorithm, the model trained with 1800 epochs achieved an average HV value of 0.767600375, which was higher than the other six settings (0.370214, 0.510288, 0.616378, 0.696846, 0.585371, and 0.617534, respectively). Therefore, the model trained for 1800 epochs overall produced the best results and demonstrated greater stability across multiple runs.

In addition, the hyperparameters of the two agents were tuned, and the optimal values were determined by monitoring the cumulative reward in each training epoch. Table 8 and Table 9 summarize the optimal parameter settings obtained from the experiments, while Figure 8 illustrates the cumulative reward results per epoch under different job batch sizes, where agent1 represents the job agent and agent2 represents the machine agent. As the batch size increases, the number of scheduling operations and optimal actions also rises, leading to higher cumulative rewards. This demonstrates that the proposed method can achieve stable convergence under different operating conditions.

5.4. Experimental Results

5.4.1. Comparison of Results

To evaluate the effectiveness and stability of the proposed multi-agent reinforcement learning framework in solving the MODFJSP described in this study, commonly used DRL baselines in the FJSP domain (PPO, DQN, DDQN) are selected for comparison [12,25,26]. Table 10 lists several typical combinations of cooperative agents. In addition, classical scheduling rules applicable to the context of this study, as shown in Table 11, are introduced as benchmarks [22]. A systematic evaluation of solution quality is then conducted based on three evaluation metrics, with particular emphasis on maximizing net profit and minimizing makespan.

In this study, each method is independently executed 10 runs on 24 instances, and the average IGD and HV values as well as the SC metric are calculated to comprehensively evaluate the advantages and disadvantages of the proposed method compared with other approaches. Moreover, the dominance relationships between the non-dominated solutions generated by our method and those of the comparison methods are analyzed. The evaluation results are presented in Table 12, Table 13 and Table 14, with the best values highlighted in bold.

First, for fixed composite scheduling rules such as EDD–SPT-M and FIFO–EAM, the lack of adaptability to optimization objectives often leads to repeatedly generating similar solutions for the two objectives, thereby limiting their optimization potential. Although random selection of jobs and machines provides exploratory capability, the strong randomness in decision-making makes it difficult to form a stable learning process and achieve convergence. In addition, among the reinforcement learning combinations, our proposed DDQN–DDQN method performs slightly worse than DDQN–DQN in large-scale scenarios with higher DDT values and a larger number of jobs. Nevertheless, their HV and IGD values remain close, indicating only minor differences in the overall quality of the Pareto front. Finally, although PPO has shown advantages in some FJSP studies, in this work its instability and low sample efficiency are amplified by the variable action sets and complex state representations, resulting in inferior overall performance compared with other reinforcement learning methods.

5.4.2. Visual Comparison of Results

For a deeper comparison, Figure 9 illustrates the bi-objective Pareto fronts obtained after non-dominated filtering for different instance scales. Overall, DDQN–DDQN is closer to the ideal point in most instances and maintains good front coverage. Even in the few cases where it does not outperform in terms of HV, IGD, or SC metrics, its pareto front remains close to that of the best-performing method, with only minor visual differences.

Unlike intelligent optimization algorithms that can generate multiple candidate solutions in parallel, DRL produces one scheduling solution per epoch. Guided by the reward functions and through continuous iterations, it can eventually approach the optimal solution. Specifically, the DRL agent interacts with the environment through actions, generating one scheduling solution in each training epoch; under the guidance of reward signals, it updates iteratively, gradually improving solution quality and finally converging to a (near-)optimal solution. As shown in Figure 10, the proposed collaborative multi-agent method quickly explores high-reward solutions in the early training stage under reward-driven guidance, and subsequently stabilizes and converges to an approximately optimal solution.

5.5. Decision Support Module

Due to inherent conflicts between the two objectives, such as an excessive pursuit of low C_max potentially increasing the number of jobs processed simultaneously, thereby driving up electricity costs and reducing profits; conversely, maximizing profits may delay certain processes to off-peak electricity periods, consequently increasing C_max. To comprehensively evaluate the potential of the proposed model, the technique for order preference by similarity to ideal solution (TOPSIS) method is employed to identify optimal or near-optimal solutions among the non-inferior solutions [41,42]. Table 15 presents representative non-dominated solutions obtained by the algorithm.

Decision-makers can obtain the most satisfactory solution from non-inferior solutions using the TOPSIS method [43]. First, establish weight vectors based on decision-makers’ preferences: C_max priority W1 = [0.7, 0.3], profit priority W2 = [0.3, 0.7].

Based on the Pareto solution design decision matrix X in Table 15

X = (\begin{matrix} \begin{matrix} \begin{matrix} 400 & 1069352 \end{matrix} \\ \begin{matrix} 261 & 1063606 \end{matrix} \\ \begin{matrix} 262 & 1064400 \end{matrix} \end{matrix} \\ \begin{matrix} ⋮ \\ \begin{matrix} 353 & 1068191 \end{matrix} \\ \begin{matrix} 320 & 1067856 \end{matrix} \\ \begin{matrix} 382 & 1068462 \end{matrix} \end{matrix} \end{matrix})

Due to the differing dimensions of various objectives, it is challenging to evaluate performance metrics. By normalizing different values, a normalized decision matrix Y is constructed.

Y = (\begin{matrix} 0.420782 & 0.354204 \\ 0.274560 & 0.352301 \\ ⋮ & ⋮ \\ 0.371340 & 0.353819 \\ 0.336625 & 0.353708 \\ 0.401846 & 0.353909 \end{matrix})

Given a weight vector reflecting decision-maker preferences, the normalized matrix Y is weighted row by row to obtain the weighted normalized matrix:

V^{(1)} = (v_{m, n}^{(1)}) = Y \cdot d i a g (W 1), V^{(2)} = (v_{m, n}^{(2)}) = Y \cdot d i a g (W 2), m = 1, \dots, 8, n = 1, 2 .

Since one objective should be as small as possible while the other should be as large as possible, the ideal solution and negative ideal solution can be defined as

A^{*} = {v_{1}^{* (k)}, v_{2}^{* (k)}}, A^{-} = {v_{1}^{- (k)}, v_{2}^{- (k)}}

Among them,

\{\begin{matrix} v_{1}^{* (k)} = \underset{m}{m i n} v_{m, 1}^{(k)}, v_{1}^{- (k)} = \underset{m}{m a x} v_{m, 1}^{(k)}, p r e f e r e n c e f o r C_{m a x} \\ v_{2}^{* (k)} = \underset{m}{m a x} v_{m, 2}^{(k)}, v_{2}^{- (k)} = \underset{m}{m i n} v_{m, 2}^{(k)}, p r e f e r e n c e f o r \sum_{i \in J} {TP}_{i} \end{matrix}

Calculate the Euclidean distance between the ideal solution and the negative ideal solution:

D_{m}^{* (k)} = \sqrt{(v_{m, 1}^{(k)} - v_{1}^{* (k)})^{2} + (v_{m, 2}^{(k)} - v_{2}^{* (k)})^{2}}, D_{m}^{- (k)} = \sqrt{(v_{m, 1}^{(k)} - v_{1}^{- (k)})^{2} + (v_{m, 2}^{(k)}, v_{2}^{- (k)})^{2}}

Define the relative closeness coefficient using the TOPSIS method:

C_{m}^{(k)} = \frac{D_{m}^{- (k)}}{D_{m}^{* (k)} + D_{m}^{- (k)}}, m = 1, \dots, 8, 0 \leq C_{m}^{(k)} \leq 1

The larger

C_{m}^{(k)}

, the better the plan, which means it is based on the decision-maker’s preferences. Therefore, Table 16 shows that the weight vector W1 corresponds to “C_max priority,” and the optimal solution calculated based on this weight is S2; the weight vector W2 corresponds to “profit priority,” and the optimal solution calculated based on this weight is S3.

6. Conclusions

In this study, for the MODFJSP, we construct a “job selection–machine assignment” multi-agent collaborative framework that comprehensively considers factors such as dynamic job arrivals, fluctuating due dates, and high-profit urgent orders, with the dual objectives of maximizing net profit and minimizing makespan. In terms of method design, variable state representations, reward functions, and variable action spaces are customized for the two types of agents to enhance learning efficiency and stability. Systematic experiments are then conducted on 18 instances, comparing the proposed framework with PPO, DQN, and other deep reinforcement learning algorithms as well as classical scheduling rules. The results show that the proposed framework demonstrates a certain advantage in solution quality and robustness. Statistics indicate that, compared with other algorithms and scheduling rules, it can achieve a 0.368% increase in profit.

The profit function employed by this research does not account for labor, maintenance, inventory, or other financial factors, thus retaining a certain gap from real-world business metrics. Furthermore, the methodology has not yet been validated against traditional heuristic algorithms, and its effectiveness and advantages require further exploration.

Subsequent work will focus on integrating more practical profit functions and introducing more robust and efficient multi-agent reinforcement learning algorithms to optimize the overall trade-off between profitability and timeliness in dynamic and uncertain industrial environments. Additionally, flexible profit models may be considered, enabling users to adjust profit parameters according to actual needs, thereby enhancing the model’s applicability and practical value.

Author Contributions

Q.M.: Conceptualization, Writing—Original Draft, Formal analysis, Data curation, Visualization, Software, Funding acquisition. Y.L.: Methodology, Conceptualization, Formal analysis, Writing—Review & Editing, Funding acquisition, Supervision. H.C.: Supervision, Writing—Review & Editing. All authors have read and agreed to the published version of the manuscript.

Funding

The research reported in this paper is financially supported by the Graduate Research Fund of Guizhou Province (Grant No. 2024YJSKYJJ165), and the Guizhou Provincial Basic Research Program (Natural Science) (Grant No. Qian Ke He Ji Chu-ZK [2024] General 439).

Data Availability Statement

All data generated or analyzed during this study are included in this paper. https://github.com/maqingyao123/The-data-for-Profit-Maximizing-Multi-Agent-....- (accessed on 19 September 2025).

Conflicts of Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Abbreviations

The following abbreviations are used in this manuscript:

Acronym	Full name	Explanation
RL	Reinforcement learning	A machine learning paradigm where an agent learns to make sequential decisions by interacting with an environment, receiving rewards or penalties, and optimizing its policy to maximize long-term cumulative reward.
DRL	Deep reinforcement learning	A method that combines deep learning with reinforcement learning to solve sequential decision-making problems in high-dimensional state and action spaces.
MDP	Markov decision process	A mathematical framework to formalize reinforcement learning environments, defined by states, actions, rewards, and transition probabilities.
TOU	Time of use	An electricity pricing scheme where the unit price of electricity varies depending on the time of day, encouraging users to shift consumption to off-peak periods.
PPO	Proximal policy optimization	An improved policy gradient algorithm that constrains policy updates to ensure training stability and higher sample efficiency.
DQN	Deep Q learning	A reinforcement learning algorithm that combines deep neural networks with Q-learning to approximate the action-value function.
DDQN	Double deep Q-network	An extension of DQN that uses two networks to reduce overestimation bias in Q-value estimation, improving performance.
MODFJSP	Multi-objective dynamic flexible job shop scheduling	A scheduling problem that considers both dynamic environments and multiple optimization objectives.
SC	Set coverage	A metric used to evaluate the coverage of a solution set across a set of test instances.
IGD	Inverted generational distance	A performance indicator that measures how close the obtained solutions are to the true Pareto front.
HV	Hypervolume	A metric that evaluates the volume covered by a solution set in the objective space, reflecting both convergence and diversity.
JSP	Job shop scheduling problem	A classical scheduling problem where jobs must be processed on machines, typically aiming to minimize makespan or delays.
FJSP	Flexible job shop scheduling problem	An extension of JSP that allows operations to be assigned to one of several alternative machines.
DFJSP	Dynamic flexible job shop scheduling problem	A further extension of FJSP that accounts for dynamic job arrivals and environmental uncertainties.
C_max	Max end time (Makespan).	The maximum completion time of all jobs, often used as a key optimization objective in scheduling.
TOPSIS	Technique for order preference by similarity to ideal solution	A multi-criteria decision-making method that ranks alternatives based on their relative closeness to an ideal and anti-ideal solution.

References

Geng, K.; Liu, L.; Wu, S. A Reinforcement Learning Based Memetic Algorithm for Energy-Efficient Distributed Two-Stage Flexible Job Shop Scheduling Problem. Sci. Rep. 2024, 14, 30816. [Google Scholar] [CrossRef]
Peng, W.; Yu, D.; Xie, F. Multi-Mode Resource-Constrained Project Scheduling Problem with Multiple Shifts and Dynamic Energy Prices. Int. J. Prod. Res. 2025, 63, 2483–2506. [Google Scholar] [CrossRef]
Luo, C.; Gong, W.; Lu, C. Knowledge-Driven Two-Stage Memetic Algorithm for Energy-Efficient Flexible Job Shop Scheduling with Machine Breakdowns. Expert Syst. Appl. 2024, 235, 121149. [Google Scholar] [CrossRef]
Yao, Y.; Li, X.; Gao, L. A DQN-Based Memetic Algorithm for Energy-Efficient Job Shop Scheduling Problem with Integrated Limited AGVs. Swarm Evol. Comput. 2024, 87, 101544. [Google Scholar] [CrossRef]
Destouet, C.; Tlahig, H.; Bettayeb, B.; Mazari, B. Flexible Job Shop Scheduling Problem under Industry 5.0: A Survey on Human Reintegration, Environmental Consideration and Resilience Improvement. J. Manuf. Syst. 2023, 67, 155–173. [Google Scholar] [CrossRef]
Zhang, Q.; Shao, W.; Shao, Z.; Pi, D.; Gao, J. Deep Reinforcement Learning Driven Trajectory-Based Meta-Heuristic for Distributed Heterogeneous Flexible Job Shop Scheduling Problem. Swarm Evol. Comput. 2024, 91, 101753. [Google Scholar] [CrossRef]
Yu, W.; Zhang, L.; Ge, N. An Adaptive Multiobjective Evolutionary Algorithm for Dynamic Multiobjective Flexible Scheduling Problem. Int. J. Intell. Syst. 2022, 37, 12335–12366. [Google Scholar] [CrossRef]
Lv, L.; Fan, J.; Zhang, C.; Shen, W. A Multi-Agent Reinforcement Learning Based Scheduling Strategy for Flexible Job Shops under Machine Breakdowns. Robot. Comput. Integr. Manuf. 2025, 93, 102923. [Google Scholar] [CrossRef]
Zhao, F.; Wang, W.; Zhu, N.; Xu, T. An Inverse Reinforcement Learning Algorithm with Population Evolution Mechanism for the Multi-Objective Flexible Job-Shop Scheduling Problem under Time-of-Use Electricity Tariffs. Appl. Soft Comput. 2025, 170, 112764. [Google Scholar] [CrossRef]
Ahmadi, E.; Zandieh, M.; Farrokh, M.; Emami, S.M. A Multi Objective Optimization Approach for Flexible Job Shop Scheduling Problem under Random Machine Breakdown by Evolutionary Algorithms. Comput. Oper. Res. 2016, 73, 56–66. [Google Scholar] [CrossRef]
Wang, H.; Cheng, J.; Liu, C.; Zhang, Y.; Hu, S.; Chen, L. Multi-Objective Reinforcement Learning Framework for Dynamic Flexible Job Shop Scheduling Problem with Uncertain Events. Appl. Soft Comput. 2022, 131, 109717. [Google Scholar] [CrossRef]
Lei, K.; Guo, P.; Zhao, W.; Wang, Y.; Qian, L.; Meng, X.; Tang, L. A Multi-Action Deep Reinforcement Learning Framework for Flexible Job-Shop Scheduling Problem. Expert Syst. Appl. 2022, 205, 117796. [Google Scholar] [CrossRef]
Lv, Z.; Zhao, Y.; Kang, H.; Gao, Z.; Qin, Y. An Improved Harris Hawk Optimization Algorithm for Flexible Job Shop Scheduling Problem. Comput. Mater. Contin. 2024, 78, 2337–2360. [Google Scholar] [CrossRef]
Wang, Z.; He, M.; Wu, J.; Chen, H.; Cao, Y. An Improved MOEA/D for Low-Carbon Many-Objective Flexible Job Shop Scheduling Problem. Comput. Ind. Eng. 2024, 188, 109926. [Google Scholar] [CrossRef]
Zhang, X.; Zhu, G.-Y. A Literature Review of Reinforcement Learning Methods Applied to Job-Shop Scheduling Problems. Comput. Oper. Res. 2025, 175, 106929. [Google Scholar] [CrossRef]
Catanzaro, D.; Pesenti, R.; Ronco, R. Job Scheduling under Time-of-Use Energy Tariffs for Sustainable Manufacturing: A Survey. Eur. J. Oper. Res. 2023, 308, 1091–1109. [Google Scholar] [CrossRef]
Shen, L.; Dauzère-Pérès, S.; Maecker, S. Energy Cost Efficient Scheduling in Flexible Job-Shop Manufacturing Systems. Eur. J. Oper. Res. 2023, 310, 992–1016. [Google Scholar] [CrossRef]
Park, M.-J.; Ham, A. Energy-Aware Flexible Job Shop Scheduling under Time-of-Use Pricing. Int. J. Prod. Econ. 2022, 248, 108507. [Google Scholar] [CrossRef]
Zouadi, T.; Chargui, K.; Zhani, N.; Charles, V.; Sreedharan, V.R. A Novel Robust Decomposition Algorithm for a Profit-Oriented Production Routing Problem with Backordering, Uncertain Prices, and Service Level Constraints. Ann. Oper. Res. 2024, 341, 1361–1362. [Google Scholar] [CrossRef]
Yuan, E.; Wang, L.; Song, S.; Cheng, S.; Fan, W. Dynamic Scheduling for Multi-Objective Flexible Job Shop via Deep Reinforcement Learning. Appl. Soft Comput. 2025, 171, 112787. [Google Scholar] [CrossRef]
Zhang, C.; Juraschek, M.; Herrmann, C. Deep Reinforcement Learning-Based Dynamic Scheduling for Resilient and Sustainable Manufacturing: A Systematic Review. J. Manuf. Syst. 2024, 77, 962–989. [Google Scholar] [CrossRef]
Luo, S. Dynamic Scheduling for Flexible Job Shop with New Job Insertions by Deep Reinforcement Learning. Appl. Soft Comput. 2020, 91, 106208. [Google Scholar] [CrossRef]
Lu, S.; Wang, Y.; Kong, M.; Wang, W.; Tan, W.; Song, Y. A Double Deep Q-Network Framework for a Flexible Job Shop Scheduling Problem with Dynamic Job Arrivals and Urgent Job Insertions. Eng. Appl. Artif. Intell. 2024, 133, 108487. [Google Scholar] [CrossRef]
Wang, H.; Lin, W.; Peng, T.; Xiao, Q.; Tang, R. Multi-Agent Deep Reinforcement Learning-Based Approach for Dynamic Flexible Assembly Job Shop Scheduling with Uncertain Processing and Transport Times. Expert Syst. Appl. 2025, 270, 126441. [Google Scholar] [CrossRef]
Liu, R.; Piplani, R.; Toro, C. Deep Reinforcement Learning for Dynamic Scheduling of a Flexible Job Shop. Int. J. Prod. Res. 2022, 60, 4049–4069. [Google Scholar] [CrossRef]
Yan, Y.; Yi, W.; Pei, Z.; Chen, Y. Multi-Agent Reinforcement Learning for Distributed Flexible Job Shop Scheduling with Random Job Arrival. IEEE Access 2025, 13, 80941–80957. [Google Scholar] [CrossRef]
Popper, J.; Ruskowski, M. Using multi-agent deep reinforcement learning for flexible job shop scheduling problems. Procedia CIRP 2022, 112, 63–67. [Google Scholar] [CrossRef]
Yu, K.; Jin, K.; Deng, X. Review of Deep Reinforcement Learning. In Proceedings of the 2022 IEEE 5th Advanced Information Management, Communicates, Electronic and Automation Control Conference (IMCEC), Chongqing, China, 16 December 2022; IEEE: New York, NY, USA, 2022; pp. 41–48. [Google Scholar]
Kaloev, M.; Krastev, G. Experiments Focused on Exploration in Deep Reinforcement Learning. In Proceedings of the 2021 5th International Symposium on Multidisciplinary Studies and Innovative Technologies (ISMSIT), Ankara, Turkey, 21 October 2021; IEEE: New York, NY, USA, 2021; pp. 351–355. [Google Scholar]
Wan, L.; Cui, X.; Zhao, H.; Li, C.; Wang, Z. An Effective Deep Actor-Critic Reinforcement Learning Method for Solving the Flexible Job Shop Scheduling Problem. Neural Comput. Appl. 2024, 36, 11877–11899. [Google Scholar] [CrossRef]
Yue, L.; Peng, K.; Ding, L.; Mumtaz, J.; Lin, L.; Zou, T. Two-Stage Double Deep Q-Network Algorithm Considering External Non-Dominant Set for Multi-Objective Dynamic Flexible Job Shop Scheduling Problems. Swarm Evol. Comput. 2024, 90, 101660. [Google Scholar] [CrossRef]
Yuan, M.; Huang, H.; Li, Z.; Zhang, C.; Pei, F.; Gu, W. A Multi-Agent Double Deep-Q-Network Based on State Machine and Event Stream for Flexible Job Shop Scheduling Problem. Adv. Eng. Inform. 2023, 58, 102230. [Google Scholar] [CrossRef]
Luo, S.; Zhang, L.; Fan, Y. Dynamic Multi-Objective Scheduling for Flexible Job Shop by Deep Reinforcement Learning. Comput. Ind. Eng. 2021, 159, 107489. [Google Scholar] [CrossRef]
Chen, W.; Wang, J.; Yu, G. Energy-Efficient Scheduling for a Hybrid Flow Shop Problem While Considering Multi-Renewable Energy. Int. J. Prod. Res. 2024, 62, 8352–8372. [Google Scholar] [CrossRef]
Zitzler, E.; Thiele, L. Multiobjective Evolutionary Algorithms: A Comparative Case Study and the Strength Pareto Approach. IEEE Trans. Evol. Comput. 1999, 3, 257–271. [Google Scholar] [CrossRef]
Coello, C.A.C.; Cortés, N.C. Solving Multiobjective Optimization Problems Using an Artificial Immune System. Genet. Program. Evolvable Mach. 2005, 6, 163–190. [Google Scholar] [CrossRef]
Wang, J.; Wu, C.; Peng, Y. Multi-Objective Scheduling for an Energy-Efficient Flexible Job Shop Problem with Peak Power Constraint. Appl. Soft Comput. 2024, 167, 112330. [Google Scholar] [CrossRef]
Wei, L.; He, J.; Guo, Z.; Hu, Z. A Multi-Objective Migrating Birds Optimization Algorithm Based on Game Theory for Dynamic Flexible Job Shop Scheduling Problem. Expert Syst. Appl. 2023, 227, 120268. [Google Scholar] [CrossRef]
Wu, J.; Liu, Y. A Modified Multi-Agent Proximal Policy O ptimization Algorithm for Multi-Objective Dynamic Partial-Re-Entrant Hybrid Flow Shop Scheduling Problem. Eng. Appl. Artif. Intell. 2025, 140, 109688. [Google Scholar] [CrossRef]
Serrano-Ruiz, J.C.; Mula, J.; Poler, R. Job Shop Smart Manufacturing Scheduling by Deep Reinforcement Learning. J. Ind. Inf. Integr. 2024, 38, 100582. [Google Scholar] [CrossRef]
Wang, Y.-M.; Elhag, T.M.S. Fuzzy TOPSIS Method Based on Alpha Level Sets with an Application to Bridge Risk Assessment. Expert Syst. Appl. 2006, 31, 309–319. [Google Scholar] [CrossRef]
Roy, A.; Wallenius, J. Nonlinear and Unconstrained Multiple-Objective Optimization: Algorithm, Computation, and Application. Nav. Res. Logist. 1991, 38, 623–635. [Google Scholar] [CrossRef]
Dai, M.; Tang, D.; Giret, A.; Salido, M.A. Multi-Objective Optimization for Energy-Efficient Flexible Job Shop Scheduling Problem with Transportation Constraints. Robot. Comput. Integr. Manuf. 2019, 59, 143–157. [Google Scholar] [CrossRef]

Figure 1. The framework model of the MODFJSP considered in this study.

Figure 2. RL framework under the MDP.

Figure 3. Multi-agent collaborative framework of “job selection–machine assignment”.

Figure 4. Structure of the job agent.

Figure 5. Structure of the machine agent.

Figure 6. Decision-making sequence of the proposed multi-agent framework for MODFJSP.

Figure 7. Boxplot of HV values across different epochs.

Figure 8. Cumulative rewards per epoch under different operating conditions.

Figure 9. Bi-objective Pareto fronts across instance scales after non-dominated filtering.

Figure 10. Per-epoch convergence of each objective on different problem scales.

Table 1. Studies on DRL-based flexible job shop scheduling problems.

Work	RL Algorithm	Agent	Problem	Dynamic Events	Objectives
[22]	DQN	Single	DFJSP	New job insertions	Tardiness
[23]	DDQN	Single	MODFJSP	New job insertions and urgent job insertions	C_max, Tardiness
[12]	PPO	Single	FJSP	None	C_max
[25]	DDQN	Multiple	DFJSP	New job insertions	Total tardiness
[26]	DQN	Multiple	DFJSP	New job insertions	Total tardiness
[27]	PPO	Multiple	FJSP with AGVs	None	Makespan; Average lateness; Max lateness
Our work	DDQN	Multiple	MODFJSP	New job insertions, high-profit urgent job insertions, and due day changes	C_max, Total net Profit

Table 2. Notations for the mathematical model.

Parameters	Description
$J$	Set of n jobs, with $J$ = {1, 2, …, n}.
$M$	Set of machines, with M = {1, 2, …, m}.
$O_{i}$	Set of operations of job i, with $O_{i}$ = {1, …, n_i}.
$T$	Set of all time points within the entire scheduling time horizon, with T = {1,2, 3, …, C_max}.
$U C_{s t}^{j o b}$	Set of unfinished jobs at decision point st.
$U_{i}$	Set of urgent jobs.
$t_{i, j, k}$	Processing time of operation O_i,j on machine k, $i \in J$ , $k \in M$ , $j \in O_{i}$ .
${J o b}_{n e w}$	Number of newly arrived jobs.
$M_{i, j}$	The set of available machines for operation, $i \in J$ , $j \in O_{i}$ .
$p_{i}$	Selling price of job i, i.e., the price offered by the customer, $i \in J$ .
$c_{t}$	TOU electricity price (yuan/kWh) at time t, $t \in T$ .
$T D_{i}$	Tardiness of job i, $i \in J$ .
$e_{k}$	Energy processing consumption of machine k (kW), $k \in M$ .
$e_{k i}$	Idle power consumption of machine k (kW), $k \in M$ .
$e_{t}$	Total power consumption of the shop at time t (kW): $e_{t} = \sum_{k \in M} \sum_{i \in J} \sum_{j \in O_{i}} x_{t}^{i, j, k} \cdot e_{k} + (b_{t}^{k} - x_{t}^{i, j, k}) \cdot e_{k i}$ , including both processing and idle power consumption, $i \in J$ , $k \in M$ , $j \in O_{i}$ .
$U f_{i}$	Urgency level of urgent jobs, $i \in U_{i}$ , $U f_{i} \in [0,1]$ .
$T P_{i}$	Net profit of job i, $i \in J$ .
$O_{s_{t}}^{i}$	Index of the operation of job i completed at scheduling point st, $i \in J$ .
$p e n a l t y_{i}$	Tardiness penalty of job i, $i \in J$ .
C_i	Completion time of job i.
C_i,j	Completion time of operation O_i,j.
Variables	Description
$x_{t}^{i, j, k}$	Time-related task and resource allocation variables: at time t, operation O_i,j of job j being processed on machine k is 1.
$b_{t}^{k}$	Equal to 1 if and only if machine k is in the ON state.

Table 3. Explanation of variables covered in Section 4.

Parameters	Description
$S_{s t}^{'}$	At decision point st, the set of job states. $S_{s t}^{'}$ = { $s_{s t}^{1}, s_{s t}^{2}, \dots, s_{s t}^{i}, \dots$ }
$s_{s t}^{i}$	The state corresponding to job i at the decision point, where $i \in U C_{s t}^{j o b}$
$S_{s t}^{″}$	At decision point st, the set of available machine statuses for the next machining operation of the selected job. $S_{s t}^{″}$ = { $s_{s t}^{1}, s_{s t}^{2}, {\dots, s}_{s t}^{k}, \dots$ }
$s_{s t}^{k}$	The state corresponding to machine k, ${k \in M}_{i, j}$
$O_{s t}^{i}$	Index of completed machining operations for job i, where $i \in U C_{s t}^{j o b}$
$r_{s t}^{'}$	Job agent reward
$r_{s t}^{″}$	Machine agent reward
${TD}_{i}$	Job i delay time
$A c t i o n s p a c e'$	Action space of the job agent
$A c t i o n s p a c e ″$	Action space of machine agent
$a_{s t}^{i}$	At decision point st, the action taken by the job agent is to select the job state with the highest Q-value.
$i_{s t}$	At decision point st, the job index corresponding to the job state selected by $a_{s t}^{i} .$
$a_{s t}^{k}$	At decision point st, the action taken by the machine agent is the selection of the machine state with the highest Q-value.
$k_{s t}$	At decision point st, the machine index corresponding to $a_{s t}^{k} .$
$τ$	DDQN soft update parameters.

Table 4. Descriptions of the state input variables of the job agent.

Features	Description
$i$	Job index, where ${UC}_{s t}^{j o b}$ denotes the set of unfinished jobs.
$R_{i}^{o p}$	Ratio of the remaining operations to the total operations of the job.
$T_{i}^{r e m}$	Estimated remaining processing time intensity based on the average operation processing time, normalized.
$U f_{i}$	Urgency level of the job, set to 0 if it is not an urgent job.
$I_{i}^{u r g}$	Urgent job indicator: 1 if job i belongs to the urgent job set, and 0 otherwise.
${TP}_{i}^{n o r m}$	Normalized profit of job i.

Table 5. Descriptions of the available machine state variables.

Features	Description
k	Machine index, ${k \in M}_{i_{s t}, j}$
$e_{k}^{n o r m}$	Power of the machine when processing the operation, normalized by the maximum power
$t_{i_{s t}, j, k}^{n o r m}$	Processing time of machine k for operation $O_{i_{s t}, j}$ , normalized by the maximum processing time among its available machines
$Cos t_{k}^{n o r m}$	Estimated electricity cost incurred by processing on machine k.
$A_{k}^{n o r m}$	Available start time of machine k for processing the pending operation $j \leftarrow O_{s t}^{i_{s t}} + 1$ of job $i_{s t}$ , normalized by the maximum available time of all machines.
${TP}_{i_{s t}}^{n o r m}$	Estimated profit generated by processing on machine k, normalized by the maximum profit among all candidates.

Table 6. Parameter settings of numerical simulation instances.

Parameters	Value
Set of the number of initial jobs ${J o b}_{i n i t i a l}$	{15, 30, 50, 100}
Set of the number of new inserted jobs ${J o b}_{n e w}$	{10, 6, 12}
Processing time $t_{i, j, k}$ per operation	U {1, 50}
Set of the number of total machines m	{8, 10, 20, 30}
Processing energy consumption $e_{k}$	U {12, 17}KW
Idle power $e_{k i}$	$e_{k i} =$ 30% × $e_{k}$
Due date tightness DDT	{0.5, 1, 1.5}
λ_new	{20, 50}
Number of available machines of each operation	U {1, m − 1}
Number of operations belonging to a job	U {1, 5}
Number of urgent jobs	20% of total jobs
Non-urgent profit $P_{i}$	$p_{i} = a \cdot \sum_{j = 1}^{n_{i}} \underset{k \in M_{i, j}}{m a x} (t_{i, j, k}) \cdot c^{h i g h} + b$
Profit adjustment for urgent jobs	${p_{u r g e n t} = p}_{i} \times U [1.3,1.7]$
Tardiness penalty for job i	$p e n a l t y_{i} = β \cdot T D_{i} \cdot (1 + U f_{i})$

Table 7. TOU price.

	TOU Power $Price c_{t}$	Period of Time
Electricity purchase price (yuan/kWh)	Peak: 1.241	9:00~12:00; 17:00~22:00
	Flat: 0.779	8:00~9:00; 12:00~17:00; 22:00~23:00
	Valley: 0.488	0:00~8:00; 23:00~24:00

Table 8. Job agent parameter settings.

Hyperparameters	Value
Number of training epochs L	1800
Learning rate	1 × 10⁻³
Discount factor γ	0.99
ε-greedy exploration rate	0.1
Minimum ε-value	0.05
Batch size 1	64
Buffer capacity	10,000
Soft update coefficient $τ$	0.01
Optimizer	Adam
Weight parameters in the reward function: $ω_{1}, ω_{2}$	6, 4

Table 9. Machine agent parameter settings.

Hyperparameters	Value
Number of training epochs L	1800
Learning rate	1 × 10⁻³
Discount factor γ	0.99
ε-greedy exploration rate	0.1
Minimum ε-value	0.05
Batch size 2	64
Buffer capacity	10,000
Soft update coefficient $τ$	0.01
Optimizer	Adam
Weight parameters in the reward function: $ω_{3}, ω_{4}$	5, 4

Table 10. Description of compared algorithms.

Algorithms	Description
DDQN-DDQN	Our method: Job agent with DDQN; Machine agent with DDQN
DDQN-DQN	Job agent with DDQN; Machine agent with DQN
PPO-DDQN	Job agent with PPO; Machine agent with DDQN
PPO-PPO	Job agent with PPO; Machine agent with PPO

Table 11. Description of scheduling rules.

Rules	Priority Description
EDD-SPT-M	Job selection: EDD prioritizes the job with the earliest due date; Machine selection: SPT-M prioritizes the machine with the shortest processing time.
FIFO-EAM	Job selection: FIFO prioritizes the earliest arriving job; Machine selection: EAM prioritizes the earliest available machine.
Random	Randomly select both job and machine.

Table 12. IGD values for solutions obtained by DDQN–DDQN, DDQN–DQN,PPO-DDQN, PPO-PPO and classical scheduling rules.

$λ_{n e w}$	DDT	${(J o b}_{i n i t i a l}, {J o b}_{n e w}, m)$	DDQN-DDQN	PPO-DDQN	PPO-PPO	DDQN-DQN	FIFO-EAM	EDD-SPT-M	Random
20	0.5	(15,10,8)	0.054859	0.057435	0.173360	0.061103	0.380170	0.490278	0.289240
		(30,6,10)	0.042265	0.051395	0.104919	0.0445135	0.420481	0.483181	0.300935
		(50,12,20)	0.022450	0.039487	0.149944	0.027592	0.494244	0.411686	0.405403
		(100,12,30)	0.000510	0.010893	0.093641	0.002407	0.338582	0.537263	0.437711
	1	(15,10,8)	0.053759	0.074556	0.084140	0.056995	0.352341	0.439906	0.154800
		(30,6,10)	0.044555	0.064114	0.084884	0.052853	0.360538	0.542772	0.305371
		(50,12,20)	0.029263	0.048181	0.107453	0.032112	0.436225	0.451014	0.361074
		(100,12,30)	0.097993	0.071701	0.019414	0.105504	0.04047	0.405968	0.110002
	1.5	(15,10,8)	0.060887	0.089392	0.160640	0.050309	0.332754	0.447768	0.185778
		(30,6,10)	0.042315	0.070105	0.117966	0.044867	0.408433	0.46043	0.323373
		(50,12,20)	0.054425	0.061335	0.071502	0.044325	0.044744	0.790365	0.399020
		(100,12,30)	0.049678	0.040152	0.106691	0.015308	0.090093	0.104488	0.092739
50	0.5	(15,10,8)	0.047233	0.063687	0.274370	0.05187	0.417043	0.464195	0.301849
		(30,6,10)	0.048418	0.052739	0.061828	0.042306	0.355806	0.595167	0.268373
		(50,12,20)	0.034162	0.061210	0.080823	0.050394	0.044621	0.740737	0.264720
		(100,12,30)	0.037861	0.04037	0.07939	0.040909	0.044794	0.400768	0.441831
	1	(15,10,8)	0.045325	0.064684	0.161846	0.059643	0.365292	0.492496	0.331768
		(30,6,10)	0.394420	0.060557	0.076853	0.048337	0.387177	0.433037	0.309464
		(50,12,20)	0.062509	0.076393	0.086067	0.054579	0.333961	0.782902	0.372529
		(100,12,30)	0.070671	0.039404	0.066345	0.060627	0.333424	0.215337	0.267733
	1.5	(15,10,8)	0.055236	0.078507	0.077265	0.058287	0.350005	0.478642	0.126799
		(30,6,10)	0.040968	0.093851	0.063894	0.049112	0.391466	0.503196	0.189945
		(50,12,20)	0.041931	0.071664	0.08153	0.061443	0.031274	0.783230	0.188447
		(100,12,30)	0.073273	0.084930	0.076034	0.105645	0.090458	0.113773	0.151171

Table 13. HV values for solutions obtained by DDQN–DDQN, DDQN–DQN, PPO-DDQN, PPO-PPO and classical scheduling rules.

$λ_{n e w}$	DDT	${(J o b}_{i n i t i a l}, {J o b}_{n e w}, m)$	DDQN-DDQN	PPO-DDQN	PPO-PPO	DDQN-DQN	FIFO-EAM	EDD-SPT-M	Random
20	0.5	(15,10,8)	0.767600	0.744720	0.731368	0.758419	0.623297	0.055088	0.590809
		(30,6,10)	0.77246	0.743107	0.722714	0.765869	0.588382	0.012006	0.683379
		(50,12,20)	0.873795	0.843027	0.806715	0.847061	0.752313	0.009377	0.564390
		(100,12,30)	0.899389	0.876845	0.810713	0.893988	0.868386	0.875418	0.777251
	1	(15,10,8)	0.705976	0.675435	0.690266	0.690535	0.514127	0.081000	0.595298
		(30,6,10)	0.792162	0.791562	0.786056	0.782185	0.582115	0.012874	0.528932
		(50,12,20)	0.825888	0.832226	0.772855	0.827112	0.702721	0.025889	0.567808
		(100,12,30)	0.995241	0.942097	0.771155	0.993896	0.789911	0.858213	0.793605
	1.5	(15,10,8)	0.686472	0.678807	0.698275	0.699047	0.530544	0.084557	0.560230
		(30,6,10)	0.793845	0.774062	0.752306	0.785848	0.605827	0.014820	0.618818
		(50,12,20)	0.794735	0.793765	0.781856	0.801739	0.712925	0.038169	0.678874
		(100,12,30)	0.993544	0.95656	0.674900	0.858861	0.747503	0.881494	0.702293
50	0.5	(15,10,8)	0.714744	0.705503	0.713064	0.706853	0.492632	0.051258	0.668093
		(30,6,10)	0.768952	0.786448	0.761735	0.778419	0.595167	0.008706	0.590771
		(50,12,20)	0.825685	0.811636	0.791437	0.823030	0.701750	0.009807	0.612933
		(100,12,30)	0.998798	0.983789	0.740564	0.997938	0.823208	0.899702	0.815501
	1	(15,10,8)	0.773239	0.749264	0.768887	0.753430	0.580352	0.037332	0.517169
		(30,6,10)	0.799822	0.769466	0.778661	0.788496	0.550567	0.010945	0.502802
		(50,12,20)	0.814062	0.828795	0.795007	0.838231	0.745097	0.027127	0.600298
		(100,12,30)	0.987041	0.937499	0.748739	0.985659	0.814192	0.853308	0.744819
	1.5	(15,10,8)	0.646318	0.601106	0.635441	0.636972	0.516858	0.028875	0.492524
		(30,6,10)	0.613633	0.613161	0.613002	0.613548	0.610379	0.010002	0.414054
		(50,12,20)	0.814149	0.779443	0.777327	0.816031	0.707079	0.003183	0.772396
		(100,12,30)	0.966884	0.920609	0.680371	0.956579	0.773687	0.788774	0.730839

Table 14. SC values for solutions obtained by DDQN–DDQN, DDQN–DQN, PPO-DDQN, PPO-PPO and classical scheduling rules.

			DDQN-DDQN (U) vs. DDQN-DQN (V)		DDQN-DDQN (U) vs. PPO-DDQN (V)		DDQN-DDQN (U) vs. PPO-PPO (V)		DDQN-DDQN (U) vs. EDD-SPT-M (V)		DDQN-DDQN (U) vs. FIFO-EAM (V)		DDQN-DDQN (U) vs. Random (V)
$λ_{n e w}$	DDT	${(J o b}_{i n i t i a l}, {J o b}_{n e w}, m)$	SC(U,V)	SC(V,U)	SC(U,V)	SC(V,U)	SC(U,V)	SC(V,U)	SC(U,V)	SC(V,U)	SC(U,V)	SC(V,U)	SC(U,V)	SC(V,U)
20	0.5	(15,10,8)	0.353	0.462	0.478	0.346	0.759	0.154	1.000	0.000	1.000	0.000	1.000	0.115
		(30,6,10)	0.625	0.333	0.821	0.1875	1.000	0.000	1.000	0.000	1.000	0.000	1.000	0.000
		(50,12,20)	0.591	0.263	1.000	0.000	0.957	0.000	1.000	0.000	1.000	0.000	0.500	0.184
		(100,12,30)	0.000	0.667	1.000	0.000	1.000	0.000	1.000	0.000	1.000	0.000	1.000	0.000
	1	(15,10,8)	0.583	0.250	0.842	0.000	0.773	0.100	1.000	0.000	1.000	0.000	1.000	0.000
		(30,6,10)	0.538	0.53125	0.857	0.09375	0.941	0.0625	1.000	0.000	1.000	0.000	0.545	0.031
		(50,12,20)	0.618	0.263	0.952	0.026	0.926	0.000	1.000	0.000	1.000	0.000	1.000	0.000
		(100,12,30)	0.000	1.000	1.000	0.000	1.000	0.000	1.000	0.000	1.000	0.000	1.000	0.000
	1.5	(15,10,8)	0.500	0.581	0.864	0.000	0.667	0.667	1.000	0.000	1.000	0.000	0.182	0.000
		(30,6,10)	0.576	0.207	0.909	0.000	0.957	0.000	1.000	0.000	0.207	0.000	1.000	0.000
		(50,12,20)	0.500	0.342	0.903	0.079	0.800	0.105	1.000	0.000	0.026	0.000	0.364	0.053
		(100,12,30)	1.000	0.000	1.000	0.000	1.000	0.000	1.000	0.000	1.000	0.000	1.000	0.000
50	0.5	(15,10,8)	0.667	0.186	0.871	0.056	0.636	0.056	1.000	0.000	1.000	0.000	1.000	0.000
		(30,6,10)	0.216	0.750	0.911	0.100	0.931	0.050	1.000	0.000	1.000	0.000	0.500	0.000
		(50,12,20)	0.444	0.237	0.903	0.000	0.857	0.000	1.000	0.000	0.000	0.000	0.583	0.089
		(100,12,30)	0.333	0.000	1.000	0.000	1.000	0.000	1.000	0.000	1.000	0.000	1.000	0.000
	1	(15,10,8)	0.679	0.241	0.913	0.034	0.722	0.172	1.000	0.000	1.000	0.000	0.414	0.000
		(30,6,10)	0.697	0.406	0.931	0.000	0.840	0.061	1.000	0.000	1.000	0.000	0.750	0.061
		(50,12,20)	0.622	0.394	0.926	0.091	0.500	0.091	1.000	0.000	0.000	0.000	0.636	0.152
		(100,12,30)	1.000	0.000	1.000	0.000	1.000	0.000	1.000	0.000	0.000	0.000	1.000	0.000
	1.5	(15,10,8)	0.423	0.375	0.917	0.038	0.781	0.115	1.000	0.000	1.000	0.000	1.000	0.000
		(30,6,10)	0.600	0.182	1.000	0.000	0.778	0.200	1.000	0.000	1.000	0.000	0.158	0.120
		(50,12,20)	0.069	0.692	0.971	0.026	0.762	0.103	1.000	0.000	0.000	0.077	0.200	0.128
		(100,12,30)	0.833	0.000	1.000	0.000	1.000	0.000	0.000	0.000	1.000	0.000	1.000	0.000

Table 15. The Pareto solutions of the case study.

Pareto Solution Numbers	C_max	$\sum_{i \in J} {TP}_{i}$
S1	400	1,069,352
S2	261	1,063,606
S3	262	1,064,400
S4	402	1,069,528
S5	269	1,067,695
S6	353	1,068,191
S7	320	1,067,856
S8	382	1,068,462

Table 16. Examples of optimal solutions in non-inferiority solution sets based on preference mechanisms.

Pareto Solution Numbers	C_max	$\sum_{i \in J} {TP}_{i}$	$C_{m}^{(1)}$	$C_{m}^{(2)}$
S1	400	1,069,352	0.015197	0.032514
S2	261	1,063,606	0.994364	0.970063
S3	262	1,064,400	0.991388	0.972908
S4	402	1,069,528	0.005636	0.029937
S5	269	1,067,695	0.943237	0.942523
S6	353	1,068,191	0.347535	0.348040
S7	320	1,067,856	0.581564	0.581684
S8	382	1,068,462	0.141909	0.143759

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ma, Q.; Lu, Y.; Chen, H. Profit-Oriented Multi-Objective Dynamic Flexible Job Shop Scheduling with Multi-Agent Framework Under Uncertain Production Orders. Machines 2025, 13, 932. https://doi.org/10.3390/machines13100932

AMA Style

Ma Q, Lu Y, Chen H. Profit-Oriented Multi-Objective Dynamic Flexible Job Shop Scheduling with Multi-Agent Framework Under Uncertain Production Orders. Machines. 2025; 13(10):932. https://doi.org/10.3390/machines13100932

Chicago/Turabian Style

Ma, Qingyao, Yao Lu, and Huawei Chen. 2025. "Profit-Oriented Multi-Objective Dynamic Flexible Job Shop Scheduling with Multi-Agent Framework Under Uncertain Production Orders" Machines 13, no. 10: 932. https://doi.org/10.3390/machines13100932

APA Style

Ma, Q., Lu, Y., & Chen, H. (2025). Profit-Oriented Multi-Objective Dynamic Flexible Job Shop Scheduling with Multi-Agent Framework Under Uncertain Production Orders. Machines, 13(10), 932. https://doi.org/10.3390/machines13100932

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Profit-Oriented Multi-Objective Dynamic Flexible Job Shop Scheduling with Multi-Agent Framework Under Uncertain Production Orders

Abstract

1. Introduction

2. Literature Review

2.1. Research on Production Scheduling Problems Considering Cost or Profit

2.2. Research on Solving FJSPs with DRL

2.3. Research Gaps

3. Problem Description for MODFJSP

3.1. Problem Description

3.2. Mathematical Model

4. Multi-Agent Collaboration Framework for the Proposed MODFJSP

4.1. The DRL Algorithm for the Multi-Agent Cooperative Framework

4.2. Definition of DRL Problem

4.3. State Input

4.3.1. State Input of the Job Agent

4.3.2. State Input of the Machine Agent

4.4. Reward Functions

4.4.1. Reward Function of the Job Agent

4.4.2. Reward Function of the Machine Agent

4.5. Definition of Action Spaces

4.5.1. Action Space of the Job Agent

4.5.2. Action Space of the Machine Agent

4.6. Multi-Agent Collaborative Framework for the Proposed MODFJSP

5. Numerical Experiments

5.1. Generation of Numerical Instances

5.2. Evaluation Metrics

5.3. Experimental Parameter Settings

5.4. Experimental Results

5.4.1. Comparison of Results

5.4.2. Visual Comparison of Results

5.5. Decision Support Module

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI