1. Introduction
Intelligent manufacturing is the core of the new scientific revolution, which is achieved through the use of information technology to achieve the rapid development of productivity and solve social problems such as energy consumption. The future outlook and direction in terms of the development of digitalized, networked, and intellectualized manufacturing is intelligent manufacturing, which is essentially the achievement of “smart workshops” that are based on cyber-physical production systems (CPPS) [
1]. Machine tools [
2] are the “mothership” of the equipment manufacturing industry, and it is impossible to achieve intelligent manufacturing without the intelligence of machine tools.
In line with the ANSI/ISA-95.00.02-2018 standard [
3], modern factories have become closely integrated with actual workshop production scenarios based on the structure of enterprise resource planning (ERP), manufacturing execution systems (MESs), and process control systems (PCSs) as prototypes and manufacturing operations management (MOM) consisting of production, maintenance, quality, inventory, etc. Machine tool workshops are becoming more flexible and intelligent alongside the development of cloud computing, the Internet of Things [
4], big data [
5], machine learning, and other advanced technologies. Because of the wide combination and deep penetration of automation hardware, such as robots, CNC machine tools, stackers, and sensors with intelligent MOM, visualization system (WVS), and logistics management system (WMS) software, a manufacturing shop has the capabilities of autonomous perception, analysis, decision making, and processing. Referring to the GB/T 37393-2019 standard [
6], which mentions the digital workshop system structure, an intelligent workshop system integrating the physical production–data acquisition–recipe control–execution flow has been built, such as in the example of a smart gear workshop structure in
Figure 1 [
7].
Due to the dynamic configuration of production methods required to achieve flexible production in the workshop, the core of intelligent manufacturing, MES, is the focal point of intellectualization transformation upon which other functional requirements can be extended. Based on the classic decision-making activity of the flexible job shop scheduling problem (FJSP) in MES, this paper analyzes the current challenges of scheduling algorithms (the key technology of intelligent manufacturing) and provides theoretical and practical solutions to intelligently match the dispersed resources (manpower, materials, processing equipment, etc.) of the workshop in real time to meet the needs of diversification, customization, and small-batch production.
Compared to the classical job shop scheduling problem (JSP) [
8], the FJSP breaks through the uniqueness restriction of production resources and has been proven to be a strong NP-hard problem [
9] whose real-time optimization improves production efficiency while reducing costs. From
Figure 1, with real-time monitoring, data collection, machine tool machining, and the rapidly changing production state of the smart workshop, the FJSP presents the following characteristics. Firstly, production dynamics: various uncertain events occur, such as random job arrival, machine failures, and delivery date changes, all of which require rescheduling to adapt to dynamic changes in the production environment. Secondly, human–machine interaction constraints: when solving production scheduling, a decision maker has a preference for order arrangement and production targets. Without holographic modeling, the handling of unexpected events still requires the subjective opinion and judgment of decision makers.
At present, the traditional methods of solving the dynamic FJSP (DFJSP) are mainly heuristic and metaheuristic algorithms. Heuristic methods, such as first in first out (FIFO) and first in last out (FILO), etc., are simple and efficient, but they have poor universality and uneven solution quality due to different scheduling rules applicable to different types of scheduling problems and production objectives. Metaheuristic methods, such as the genetic algorithm (GA) [
10] and particle swarm algorithm [
11], improve solution quality through parallel searching and iterative searching, but their time complexity is poor and they do not have the required characteristic of real-time scheduling optimization in a smart workshop.
With the advance in artificial intelligence and machine learning, Zhang et al. [
12] solved the JSP in 1995 using a temporal difference algorithm, which was the first time reinforcement learning (RL) was applied in the scheduling field. The core idea of using RL in solving the scheduling problem is to transform the dynamic scheduling process into a Markov decision process (MDP) [
13]. When an operation is finished or random events occur, a scheduling rule is determined according to the production state. Because different production objectives correspond to different reward functions and scheduling rules, traditional RL cannot simultaneously optimize all objectives to solve the multi-objective DFJSP (MODFJSP) [
14]. Hierarchical reinforcement learning (HRL) [
15,
16,
17] has long held the promise of learning such complex tasks, in which a hierarchy of policies is trained to perform decision making and control at different levels of spatiotemporal abstraction. A scheduling agent is trained using the two-layer policy, in which a higher-level controller learns a goal policy over a longer time scale and a lower-level actuator applies atomic actions to the production environment to satisfy the temporary objective. Therefore, HRL maximizes external cumulative return in the long run while achieving a satisfactory compromise considering multiple production objectives.
In the actual production environment of machine tool processing, the swift completion of products results in higher inventory pressure, whereas delays in completing the job result in financial damage [
18]. The total machine load not only affects financial costs but also involves energy saving and emission reduction. For the real-time optimization and decision-making of multi-objective flexible scheduling in a smart shop floor, an HRL method is proposed in this study to solve the MODFJSP considering random job arrival so as to minimize penalties for earliness and tardiness as well as total machine load. The four contributions of this research are as follows:
- (1)
To the best of our knowledge, this is the first attempt to solve the MODFJSP with random job arrival and minimize the total penalties for earliness and tardiness, as well as total machine load, using HRL. The work can thus fill a research gap regarding solving the MODFJSP using HRL.
- (2)
A key problem in multi-objective optimization is solved by one human–machine interaction feature, i.e., the scheduling expert or management decision maker assigns the relative importance of the two objectives, which are combined with subjective decision information in the algorithmic optimization process to obtain a compromise solution.
- (3)
The HRL-based scheduling agent consists of a single-stream double deep -network (DDQN) as a high-level controller and a two-stream dueling DDQN (DDDQN) as a low-level actuator. This ensures the effectiveness and generalization of the proposed method under the premise of the agent’s learning speed.
- (4)
To balance and optimize the two production scheduling targets in real time, four state indicators are designed for the high-level goal, and each state indicator corresponds to an external reward function to maximize the cumulative return during training.
The overall structure of the study takes the form of six sections, including this introductory section.
Section 2 introduces a brief review of RL-based dynamic scheduling methods. The mathematical model for the MODFJSP with random job arrival in a smart machine tool processing workshop is established in
Section 3.
Section 4 presents the background of DDQNs and DDDQNs, and the implementation details are provided.
Section 5 provides a case study of the proposed HRL algorithm in the flexible production scheduling of gears and presents the results of numerical experiments. Conclusions and future research directions are summarized in
Section 6.
2. Related Works
To intelligently match the dispersed production resources of the smart workshop in real time, more and more researchers and practitioners have been paying attention to RL algorithms, software, and frameworks to solve production scheduling problems.
Fonseca et al. [
19] applied Q-learning to study the flow job shop scheduling problem (FSP) for minimum completion time. He et al. [
20] solved the dynamic FSP (DFSP) for minimum cost and energy consumption, in the context of the textile industry, using multiple deep Q-network (DQN) agents. Shahrabi et al. [
21] solved the dynamic JSP (DJSP) to minimize average flow time via Q-learning, dynamically adjusting the parameters of a variable neighborhood search algorithm. Kuhnle et al. [
22] proposed a framework for design, implementation, and evaluation of on-policy RL to solve the JSP with order dynamic arrival in order to maximize machine utilization while minimizing order delivery time. Wang et al. [
23] solved an assembly JSP with random assembly times for minimum total weighted earliness penalty and completion time using dual Q-learning agents, where the top-level agent focused on the scheduling policy and the bottom-level agent optimized global targets. Bouazza et al. [
24] used intelligent software products to solve the partially FJSP with new job insertions to minimize makespan through the use of a Q-learning algorithm. Luo et al. [
14] proposed a two-layer deep reinforcement learning model where the high-level DDQN determines the optimization objective and the low-level DDQN selects the scheduling rule to solve the FJSP with minimum total delay and maximum average machine utilization. Johnson et al. [
25] proposed a multi-agent system and applied multiple independent DDQN agents to solve the FJSP in a robotic assembly production cell with random job arrival to minimize makespan.
Table 1 summarizes the differences between the aforementioned studies and our work.
From the above literature review, research has mainly focused on using RL to solve single-objective DFSPs, DJSPs, and DFJSPs. Use of RL to solve multi-objective DFJSPs has not been deeply explored in research. Additionally, there is currently no RL method that considers the dynamic preference of decision makers toward production targets via human–computer interaction.
Studies in the literature [
19,
21,
23,
24] have used RL with linear value function approximation, which forces state discretization when dealing with continuous-state problems. Refined perception of the environment leads to an explosion in the number of discrete states [
26], with vast increases in the computational requirements of the model (e.g., state–action pair Q-tables) and reduced agent learning speed. Therefore, to reduce computational complexity, simple discretization discards some critical information about the state of the domain structure, which ultimately affects the quality of the agent’s decision making.
In the work of Luo et al. [
14], an HRL-based agent was utilized to solve the MODFJSP, as opposed to a DDQN which learns slowly with the increased number of actions [
27]. Because there is not a single heuristic rule that performs well in all production scheduling problems, this study expands the action space by increasing the number of scheduling rules and applies a combination of DDQN and DDDQN hierarchical reinforcement learning to solve the MODFJSP so as to improve the learning efficiency and generalization of the algorithm.
4. Proposed HRL
4.1. Background of DDQNs and DDDQNs
Since DQNs [
28] first combined RL with nonlinear value functions in 2013, representing a milestone, RL has been rapidly developed and become widely applicable. In RL, an agent interacts with environment
at each
of a sequence of discrete time steps, perceives state
,
(set of all states), and selects action
from possible action set
under its policy
, where
is mapping probability
from
to
. Responding to
, the environment
presents new state
and assigns scalar reward
to the agent. They interact until the agent reaches a terminal state. The agent obtains the total accumulated return
,
, where discount rate
trades off immediate and delayed rewards. Solving an RL task means finding an optimal policy,
, to maximize the expected accumulated return from each state,
, over the long run. The recursive expression of the state–action value function
is shown in Equation (6), which turns Bellman equations into updated rules for improving approximations of the desired value functions and obtaining the dynamic programming algorithm.
On-policy RL is attractive in continuous control, but off-policy RL provides more generalized, stable, and efficient learning methods in discrete control [
15]. In the DQN, the parameters,
, of the neural network are adjusted by randomly sampling state–action–reward transition tuples (
,
,
,
) at each time step
. The iterative updating formulas of
and the target
are shown in Equations (7) and (8), respectively, where
is the learning rate used by the gradient descent algorithm. It is clear that the state–action values from the same neural network are used in selecting and evaluating an action. Therefore, the predicted value of
is substantially overestimated, which may reduce the learning quality of an agent.
In order to decouple the selection from the evaluation, a DDQN was designed by Hasselt et al. [
29], where the online network
value and the target network
value are used to select an action and evaluate the action, respectively. The iterative formula of target
is shown in Equation (9).
According to Equations (11) and (12), only one state–action value
is updated at each time step, and all other action values of
remain untouched. When the number of actions increases, an agent needs increasingly more action value updates for learning. In addition, the differences among action values for
are often very small relative to the magnitude of
. For example, after training with the DDQN in [
18], the average gap between the
values of the best and the second-best action across visited states is roughly 0.06, whereas the average action value across those states is about 17. Action values are frequently reordered, and actions chosen by behavior strategies are correspondingly changed, which brings small amounts of noise in the updates.
Based on the above two reasons, Wang et al. designed a DDDQN [
27] with two streams to separately estimate state value,
and the advantages,
, for each action. Here,
represents the parameters of the sharing layers, while
and
denote the parameters of each of the two streams. The state–action
value is calculated using Equation (10), where
is an
-dimensional vector.
4.2. Model Architecture
In this paper, an HRL framework is used to solve the MODFJSP. The algorithm model is shown in
Figure 2, including the manufacturing environment of a smart machine tool workshop, the hierarchical agent, and the reinforcement learning process. The instances are generated from the scheduling type, constraint conditions, and dynamic attributes defined by the production environment of a smart workshop. The production instance is expressed as a semi-Markov decision process (semi-MDP) [
30] through definitions of different levels of temporal abstraction, states, actions, and rewards. The agent constantly interacts with the semi-MDP to obtain the training data sample set and performs training expression and policy learning through the HRL algorithm.
The hierarchical agent consists of a high-level controller, , and a low-level actuator, . According to the current production state of the workshop, the high-level controller determines the temporary production goal, which is directly related to the desired observation value. The low-level actuator chooses the scheduling action rule according to the current state and the production goal and directly applies the action to the shop floor environment.
The DDQN algorithm is utilized for the high-level controller, consisting of five fully connected layers with three hidden layers. The number of nodes in the input and output layers is eight (the number of production state features) and four (the number of state indicators of goals), respectively. The activation function is a rectified linear unit (ReLU) and the parameter optimization function is Adam.
The low-level actuator adopts the DDDQN learning algorithm consisting of six fully connected layers with four hidden layers. The hidden layers contain two sharing layers and two separating layers. The number of nodes in the input is nine (corresponding to one higher goal) and the number of nodes in the output is ten (the number of scheduling rules). The activation function and optimization function are consistent with the high-level controller. The learning process is as follows:
- (1)
The high-level controller obtains the current state, , of the flexible production environment of a smart machine tool machining workshop;
- (2)
The high-level controller determines a temporary optimization objective, , according to the online network’s value and the behavior policy ;
- (3)
The low-level actuator obtains and ;
- (4)
According to the online network’s value and the behavior policy , the low-level actuator determines scheduling rule , and operation and a feasible machine, , are selected;
- (5)
The smart machine tool machining workshop performs , and is transferred to the next production state, ;
- (6)
The high-level controller obtains the extrinsic reward, , and the experience tuple (, , , ) is stored in the high-level experience replay, ;
- (7)
The low-level actuator obtains the intrinsic reward, , from the high-level controller, and the experience tuple (, , , , , ) is stored in the low-level experience replay, ;
- (8)
Randomizations of the samples are both performed in and to respectively update the online network parameters and .
4.3. State Features
To comprehensively represent the complex production situation of a smart machine tool machining shop at the rescheduling point, eight generic state features are extracted, including three task-specific features, four environment-specific features, and one human–computer interaction feature. The specific details of these features are as follows:
- (1)
Average job arrival number per unit time ;
- (2)
Number of machines ;
- (3)
Number of newly added jobs ;
- (4)
Average utilization rate ;
The average utilization rate of the machines is denoted by
, which is calculated using Equation (11) [
14,
18]. At rescheduling point
, the completion time of the last operation on machine
and the current number of completed operations of job
are denoted by
and
, respectively.
- (5)
Estimated total machine load ;
represents the processing time of each operation processed by machine
with
operations having been processed at the current time
.
represents the load of machine
, and the machine load at the rescheduling point
is
.
represents the current estimated remaining processing time, so
is equal to
plus
. The calculating method for
is given in Algorithm 1.
Algorithm 1 Procedure of calculating the estimated total machine load |
1: Input: |
2: , , , , |
3: Output: |
4: |
5: Procedure |
6: ← 0, ← 0 |
7: for (; ; ) do |
8: ← |
9: end for |
10: for (; ; ) do |
11: if < then |
12: ← 0 |
13: for (; ; ) do |
14: |
15: ← + |
16: end for |
17: ← + |
18: end if |
19: end for |
20: ← + |
21: Return |
- (6)
Estimated earliness and tardiness rate and estimated earliness and tardiness penalty ;
The methods for calculation of
and
are the same as those for the actual earliness and tardiness rate,
, and the actual earliness and tardiness penalty,
, in [
18], respectively.
- (7)
Degree of relative importance .
Since managers, scheduling decision makers, and experienced experts have preferences in terms of scheduling objectives, their subjective opinions and judgments are still needed to deal with emergencies under the premise of nonholographic modeling. The degree of relative importance between 1 and 9 is obtained through human–computer interaction, indicating the relative importance of to . Thereafter, the production target is affected by many complex factors, so is randomly generated in this study.
4.4. Action Set
According to [
31,
32], the ten scheduling rules were designed to complete operation sequencing and machine selection of the FJSP. The set of unfinished jobs at current time
is denoted by
. The ten scheduling rules are as follows:
- (1)
Dispatching Rule 1: According to Equation (12), the job,
, is selected from the uncompleted job set,
, and the minimum redundancy time of its remaining operations is the selection principle. The operation,
, is selected, and the machine with the minimum completion time is allocated for
. Moreover, the machine is selected according to Equation (13) in dispatching rules (1)–(8).
- (2)
Dispatching Rule 2: According to Equation (14), the job,
, with the largest estimated remaining processing time is selected from the uncompleted job set,
, and its operation,
, is selected.
- (3)
Dispatching Rule 3: According to Equation (15), the job,
, with the smallest estimated remaining processing time is selected from the uncompleted job set,
, and its operation,
, is selected.
- (4)
Dispatching Rule 4: According to Equation (16), the job,
, with the smallest sum of the processing time of the current process and the average processing time of the subsequent process is selected from the uncompleted job set,
, and its operation,
, is selected.
- (5)
Dispatching Rule 5: According to Equation (17), the job,
, with the largest ratio of the processing time of the subsequent process to the estimated remaining processing time is selected from the uncompleted jobs,
, and its operation,
, is selected.
- (6)
Dispatching Rule 6: According to Equation (18), the job,
, with the smallest value for the subsequent processing time multiplied by the estimated remaining processing time is selected from the uncompleted jobs, and its operation,
, is selected.
- (7)
Dispatching Rule 7: According to Equation (19), the job,
, with the largest ratio of the processing time of the subsequent process to the estimated total processing time is selected from the uncompleted jobs, and its operation,
, is selected.
- (8)
Dispatching Rule 8: According to Equation (20), the job,
, with the smallest value for the subsequent processing time multiplied by the estimated total processing time is selected from the uncompleted job set,
, and its operation,
, is selected.
- (9)
Dispatching Rule 9: According to Equation (21), the job,
, with the earliest delivery date is selected from the uncompleted job set,
, and its operation,
, is selected. From the suitable machine set of
, the machine tool with the smallest load is then selected according to Equation (22).
- (10)
Dispatching Rule 10: According to Equation (23), the job,
, with the smallest critical ratio (CR) of the minimum redundancy time to the estimated remaining processing time is selected from
, and its operation,
, is selected. The machine is selected according to Equation (25).
4.5. Reward Mechanism
In the traditional RL framework, the learned policy corresponds to a maximization of the expected return for a single reward function [
33]. In the HRL framework, a range of reward functions,
, which are indexed or parametrized by a current goal,
, are considered to accomplish complex control tasks.
Each goal,
, corresponding to a set of states,
, is considered to be achieved when the hierarchical agent is in any state,
[
33]. The high-level controller produces the action,
, which is a state feature of the lower-level actuator when performing actions that yield an observation close to
, and the lower-level policy,
, gets an intrinsic reward,
.
4.5.1. High-Level Goals
Some state features are more natural goal subspaces because of the high-level goal,
, indicating desired relative changes in observations [
15]. The two scheduling objectives of this paper optimize four status indicators of estimated earliness and tardiness penalty,
, estimated earliness and tardiness rate,
, estimated total machine load,
, and average utilization rate of the machines,
. Therefore, the value set of high-level goal
is
.
4.5.2. Extrinsic Reward
To keep the increasing direction of the cumulative return consistent with the direction of optimizing the objectives and avoiding sparse rewards [
16], the extrinsic reward function
corresponds to the high-level goal and four status indicators at times
and
. Since the state indicator features
and
are closely related to the production objectives, their corresponding rewards and punishments fluctuate wildly to improve the learning efficiency of the agent.
If the high-level goal
at rescheduling point
,
and
are selected as the feature indicators, and
is calculated using Equation (24).
If the high-level goal
,
and
are selected as the feature indicators, and
is calculated using Equation (25).
If the high-level goal
,
and
are selected as the feature indicators, and
is calculated using Equation (26).
If the high-level goal
,
and
are selected as the feature indicators, and
is calculated using Equation (27).
4.5.3. Intrinsic Reward
Intrinsic motivation [
33], which is closely related to the intelligence level of an agent, involves learning with an intrinsically specified objective.
At decision point
, after receiving the goal,
, from the high-level controller, the low-level actuator selects a scheduling rule, which is applied to the smart workshop environment. The higher-level controller provides the low-level policy with an intrinsic reward,
. In [
15], the intrinsic reward,
, is parameterized based on the distance between the current state,
, and the goal state,
. It is calculated using Equation (28).
4.6. Action Selection Strategy
The production scheduling objective is dynamically controlled in a cooperative human–machine way, and the degree of relative importance,
, is added to the behavior selection strategy of the high-level controller. In this study, a
behavior strategy was designed, which is calculated using Equation (29) where
is a random number between 0 and 1. If the value of
is larger, it means that decision makers make a decision: objective
is more important than objective
, and optimizing
is a priority at current time
. Furthermore, the indication characteristic,
, of the high-level goal,
, is closely related to
. Therefore, if
, the high-level controller selects
with the current production target
; otherwise, the annealed linearly
strategy is used.
4.7. Procedure of the HRL Algorithm
By defining five key elements (state, dispatching rule, goal, reward, and behavior strategy), the MODFJSP is formulated and transformed into an HRL problem. Algorithm 2 is the training method of the hierarchical scheduling agent, where
is the number of epochs to train the neural network,
is the training time in an epoch,
is the random integer between 1 and 9,
represents the rescheduling time,
is the sum of the operations of all current jobs at the current time
, and
is the update step of the target network.
Algorithm 2 The HRL training method |
1: Initialize replay memory to capacity and memory to capacity |
2: Initialize high-level online network action-value with random weights |
3: Initialize high-level target network action-value with weights |
4: Initialize low-level online network action-value with random weights |
5: Initialize low-level target network action-value with weights |
6: for epoch = 1: do |
7: for episode = 1: do |
8: Initialize a new production instance with , and |
9: Initialize state |
10: Initialize the high-level feature |
11: Select high goal according to and |
12: Initialize the low-level feature |
13: for t = 1: do |
14: Select an action according to and |
15: Execute action , calculate the immediate reward using Equations (24)–(27) and using Equation (28) and observe the next workshop state |
16: Set the high-level feature |
17: Select high goal according to and |
18: Set the low-level feature |
19: Store transition: |
20: Sample a random minibatch of transitions from |
21: Set |
22: Calculate the loss function and perform Adam with respect to the parameters of online network |
23: Sample a random minibatch of transitions from |
24: Set |
25: Calculate the loss function and perform Adam with respect to the parameters of online network |
26: Every steps, reset and |
27: end for |
28: end for |
29: end for |
5. Numerical Experiments
As a mechanical component that transmits movement and power, gears are an important basic component of mechanical equipment. Due to its wide range of applications, improvements in green gear production efficiency contribute to the construction of advanced equipment manufacturing systems. Gears are rich in variety and have different processes, such as pre-hot and post-hot processing of planetary gears and post-hot processing of disk gears and shaft gears. The gear production line involves a turning and milling unit, tooth shaping unit, internal grinding and flat grinding unit, external grinding unit, and gear grinding unit. The problem instances were generated by actual gear production data from a factory, and the HRL-based agent was trained to solve the MODFJSP with random gear arrival.
In this section, the process of training the scheduling agent, the settings of hyperparameters, and three performance metrics in terms of multi-objective optimization are provided, followed by a comparison of learning rates between the DDDQN and DDQN and performance comparisons of the proposed HRL algorithm with each action scheduling rule. To show the effectiveness, generality, and efficiency of the HRL algorithm, we compared it with other RL algorithms, metaheuristics, and heuristics with different production configurations. To further verify the generalization of the proposed method, the trained scheduling agent was tested on a new set of extended instances with larger production configurations. The training and test results and two videos demonstrating the MODFJSP being solved using the proposed HRL algorithm are available in the
Supplementary Materials.
5.1. Parameter Settings
5.1.1. Parameter Settings of Problem Instances
At the very beginning, there are several jobs in a flexible gear production workshop. The arrival of subsequent new gears obeys a Poisson distribution [
18], whereas the arrival interval follows an exponential distribution with an average rate,
[
21].
represents a real interval uniform distribution, and
is an integer interval uniform distribution. The parameter settings are shown in
Table 3.
5.1.2. Hyperparameter Settings
In line with the literature [
34], MODFJSPs are divided into
classes using different parameter settings of
,
, and
. The configuration of
,
, and
in
Table 2 is repeated two times, generating
different production instances. To more effectively evaluate the performance of the HRL algorithm, we randomly divided the 54 instances into 38 training instances (occupying 70% of all instances) and 16 validation instances (occupying 30% of all instances). In the process of training the agent, the 200 epochs are set. There are 38 episodes in an epoch for one episode generated on each instance, so the agent is trained on a total number of
instances.
The low-level policy is updated under the control of the high-level policy. The high-level goal
, which corresponded to action
in the past transfer experience sample, corresponds to action
(
) in the current low-level policy. Moreover, the action of the low-level policy affects the state distribution of the high-level policy, resulting in unstable learning in the high-level controller. To address this nonstationary problem, this study adopts the method in [
14]: the high-level replay memory size
is equal to minibatch size
. The hyperparameter settings and their values are shown in
Table 4.
The proposed HRL algorithm and the smart shop floor environment for machine tools processing of gears were coded in Python 3.8.3. The training and test experiments were performed on a PC with an Intel(R) Core (TM) i7-6700 @ 3.40 GHz CPU and 16 GB of RAM.
5.2. Performance Metrics
The main aim of solving the MODFJSP is to find a set of uniformly distributed nondominated solutions. To fully evaluate the quality of the Pareto-optimal front , three metrics are utilized to analyze performance in terms of convergence, distribution, and comprehensiveness. Because the real Pareto-optimal front is unknown in advance, the solutions obtained by all compared algorithms in the paper are merged, and those that are nondominated are taken as .
In general, a set of solutions with smaller values in generational distance (GD) [
14,
35], spread (Δ) [
14,
36], and inverted generational distance (IGD) [
14,
37] is preferred. A smaller GD value means that the Pareto-optimal front
is closer to the real Pareto-optimal front
, indicating higher convergence between the Pareto-optimal solutions. The smaller the Δ value, the more evenly distributed in the target space the Pareto solutions in
. The smaller the IGD value, the higher the convergence and distributivity of the synthetically obtained solutions.
5.3. Comparison of Learning Rates between the DDQN and DDDQN
To demonstrate the learning effectiveness of the low-level actuator, the DDDQN and the DDQN were trained with a single target:
. In addition, the reward function is shown in Equation (30), where
.
The single target value of the first 200 epochs calculated by both algorithms is shown in
Figure 3, where the target value of one epoch is equal to the average of 38 different production instances. It is easy to see from the two curves that the average goal value drops smoothly and that volatility decreases gradually as the training steps increase. Furthermore, the DDDQN converges faster than the DDQN. It is further demonstrated that the DDDQN with dual streams improves the learning efficiency of the agent when solving DFJSPs with a large action space.
5.4. Comparisons of the HRL Algorithm with the Proposed Composite Dispatching Rules
To verify the effectiveness and generalization of the proposed HRL algorithm, 27 different instances were generated for each type of MODFJSP. Moreover, the policy RA, randomly selecting the high-level goal and the low-level scheduling rule, was designed to demonstrate the learning ability of the HRL agent. In each instance, the HRL algorithm and the composite rules were independently repeated 20 times. The GD, Δ, and IGD values of the Pareto-optimal front obtained by each method are available in the
Supplementary Materials.
From the experimental results, the proposed HRL algorithm outperforms other comparative methods in terms of convergence, diversity, and comprehensiveness of Pareto solutions for most production instances. Firstly, compared with RA, the HRL algorithm obtained better results in all test instances, demonstrating its ability to learn difficult hierarchical policies when solving the MODFJSP. Secondly, compared with scheduling rules, HRL obtained the best results for most instances in terms of the convergence of GD, further indicating that there is no single scheduling rule that performs optimally in all MODFJSPs. It also obtained the best results for all test instances in terms of the diversity of Δ and the comprehensiveness of IGD.
The high-level goal
is highly correlated with the scheduling objectives. The high-level DDQN controller determines a feasible goal,
, based on the current state at each rescheduling point. The low-level DDDQN actuator selects a scheduling rule based on the production state and
. Through the long-term training process, the agent effectively trades off between the two production objectives. Accordingly, the proposed HRL algorithm outperforms the single scheduling rule in terms of effectiveness and generalization. The Pareto fronts obtained by HRL and 10 scheduling rules for some representative instances are shown in
Figure 4.
5.5. Comparison of HRL to Other Methods
To further verify the effectiveness and generalization of the proposed HRL algorithm, the trained agent was compared to three other RL algorithms (HRL with a DDQN as the low-level actuator (DDHRL), DDDQN and SARSA), a famous metaheuristic algorithm (GA) and two of the most commonly used heuristics rules (FIFO and shortest subsequent operation (SSO)). An instance was generated for each type of MODFJSP, and the HRL and the other algorithms were independently repeated 20 times in each instance. The GD, Δ, and IGD values of the Pareto-optimal fronts obtained using comparative methods are as available in the
Supplementary Materials.
In this study, the only difference between the DDHRL method and the proposed HRL algorithm was the network structure of the low-level actuator. The DDDQN without
was the same as the low-level actuator of the proposed HRL algorithm. In SARSA, nine discrete states are designed using a neural network with a self-organizing mapping layer (SOM) from [
18,
38]. A
-table with 9 × 10
-values was maintained. The immediate reward function of the single agent, DDDQN and SARSA, was calculated using Equation (30).
In the GA [
33], the method in [
39,
40] was used for fast nondominated ordering. The fitness calculation, the selection, crossover, mutation operations, and the hyperparameter settings were from [
18].
FIFO chose the next job operation with the earliest arriving time, and SSO chose the next job operation with the shortest subsequent processing time from the unfinished jobs. The selection of processing machine for both was determined using Equation (14).
5.5.1. Effectiveness Analysis
To show the effectiveness of the proposed HRL algorithm, the average values of the three metrics for all the algorithms compared in all test instances were calculated, as shown in
Figure 5. As can be seen in
Figure 5, the proposed HRL algorithm outperformed the competing methods. It can also be seen that RL (HRL, DDHRL, DDDQN, and SARSA) outperformed the heuristic methods (FIFO and SSO) in almost all instances, indicating the effectiveness of the proposed scheduling rules in terms of the two investigated objectives, whereas hierarchical reinforcement learning (HRL and DDHRL) outperformed traditional RL (DDDQN and SARSA) and the metaheuristic method (GA), which confirms the necessity and effectiveness of using a single agent with a two-layer hierarchical policy. Additionally, the proposed HRL algorithm was superior to DDHRL, which demonstrates the superiority of the dueling architecture in low-level policy.
5.5.2. Generalization Analysis
To verify the generalizability of HRL, the winning rate was defined in terms of each metric, which was calculated in line with [
18], as shown in
Figure 6. For the convergence metric GD, HRL had the best results in 24 kinds of instances, and the winning rate was about 89%. For the diversity metric Δ, HRL had the smallest value for 20 scheduling problems, and the winning rate was about 74%. For the comprehensive metric IGD, HRL had the minimum value for 20 instances, with a winning rate of about 74%. The proposed HRL algorithm had the highest winning rate of the three metrics and generally performed at a level that was superior to the compared algorithms.
5.5.3. Efficiency Analysis
To show the efficiency of HRL, the average CPU times of all the algorithms that were used for comparison in all test instances were calculated and are available in the
Supplementary Materials. Because of the number of jobs greatly expanding the size of the scheduling solution space, average CPU times were grouped by the number of newly added jobs, as shown in
Table 5.
As can be seen in
Table 5, FIFO and SSO are highly efficient, but they have poor solution quality and generalization, as shown above. GA also does not exhibit real-time characteristics. In addition, the time complexity of hierarchical reinforcement learning (HRL and DDHRL) and traditional RL (DDDQN and SARSA) is roughly the same, but the proposed HRL algorithm outperforms traditional RL significantly in terms of effectiveness and generalization.
Furthermore, considering the number of jobs in the test instances [
34], the average scheduling time of HRL is 0.66 s on a PC with low specifications, which could reach the millisecond level or even less with the support of greater computing power. Therefore, HRL demonstrates the ability to optimize scheduling in real time in smart workshops.
It can be seen that, on the whole, the HRL algorithm proposed in this study clearly outperformed the other six methods in terms of effectiveness and generalization and has real-time characteristics. HRL solves the multi-objective scheduling problem as a semi-MDP, where the high-level policy determines the temporary objective according to the production state and the low-level policy determines the ongoing action based on the state and temporary objective. Therefore, hierarchical deep neural networks trained by HRL have multi-objective learning and decision-making capabilities at the rescheduling point and are more effective, robust, generalized, and efficient.
5.6. Extended Application of HRL
To further verify the effectiveness and generalization of HRL, the trained agent was applied to scheduling instances related to gear production with larger production configurations: new planetary gear, disk gear, and shaft gear arrivals of 135, 330, and 500, respectively;
set to 10, 30, and 50, respectively; 11 flexible machining machines; and the rest of the parameters the same as in training. Each comparison method in
Section 5.5 was independently repeated 20 times on each extended instance. The metric values of the Pareto front obtained by each compared method are shown in
Table 6. The Pareto fronts of the HRL algorithm and other algorithms for the three real instances are given in
Figure 7, where the yellow line in the enlarged figure represents the real Pareto optimal front
.
As can be seen in
Table 6, HRL had the best results for all three metrics in the three extended instances.
Figure 7 shows that the set of nondominated solutions (
) provided by the HRL algorithm was equal to the true Pareto front
for the instance with 135 new planetary gear arrivals, close to the
maximum (100) in the training set. One Pareto solution was not found by the HRL algorithm in the instance with 330 new disk gear arrivals, whereas two Pareto solutions were not found and one nondominated solution from the proposed method was not in
for the instance with 500 shaft gear arrivals.
It can be seen that, on the whole, the greater the difference between the extended instances and the original instances, the greater the degradation in HRL performance. However, the overall performance of HRL did not significantly deteriorate, and it outperformed the six other methods in terms of effectiveness and generalization in extended large instances.
6. Conclusions
This study introduced an HRL method for solving the multi-objective dynamic FJSP with random job arrival in a smart machine tool processing workshop to satisfy the dual objectives of minimizing penalties for earliness and tardiness and total machine load. On the basis of establishing a mathematical model, a combined DDQN and DDDQN two-hierarchy architecture for the MODFJSP was constructed, and the continuous-state features, scheduling rules with large action spaces, and external and internal rewards were accordingly designed. Moreover, the decision-maker’s preference for production targets was integrated into the HRL algorithm as a state feature by human–computer interaction. Thus, by adaptively learning the feasible goal and efficiently exploring the dispatching rule space, the HRL-based agent not only conducts the scheduling in real time but also achieves a satisfactory compromise considering different objectives in the long term.
Numerical experiments were conducted on a large set of production instances to verify the effectiveness and generalization of the proposed HRL algorithm in practical applications of gear production. We showed that our approach with the proposed HRL algorithm produced state-of-the-art results in 24 (on convergence) and 20 (on both diversity and comprehensiveness) of the 27 test instances compared to DDHRL, DDDQNs, SARSAs, GAs, FIFO, and SSO, with no adjustment to the architecture or hyperparameters.
Real-time optimization of multi-objective DFJSPs through HRL intelligently matches the dispersed resources of the smart machine tool processing workshop and contributes to the implementation of an adaptive and flexible scheduling system, which meets the intellectualization and green needs of intelligent manufacturing. This work fills a research gap regarding solutions to MODFJSPs with random job arrival that minimize total penalties for earliness and tardiness as well as total machine load by using HRL. Moreover, the human–machine interaction feature integrates subjective decision information into the algorithmic optimization process to achieve a satisfactory compromise considering multiple objectives, which solves a key problem in multi-objective optimization.
In future work, the significance to production of real-time scheduling for flexible machining of machine tools in a smart workshop can be further improved through investigation of additional dynamic events and production objectives. Meanwhile, the number of actions (equal to the rule number of selecting an operation multiplied by the rule number of selecting a machine) should be increased for a more general agent. Such large action spaces are difficult to efficiently explore and, thus, successfully training DQN-like networks in this context is likely intractable [
26]. Consequently, we will apply state-of-the-art off-policy methods, such as use of a deep deterministic policy gradient (DDPG) [
26,
41] and proximal on-policy optimization [
42,
43], for solving MODFJSPs.