Multi-Objective Flexible Flow Shop Production Scheduling Problem Based on the Double Deep Q-Network Algorithm

: In this paper, motivated by the production process of electronic control modules in the digital electronic detonators industry, we study a multi-objective ﬂexible ﬂow shop scheduling problem. The objective is to ﬁnd a feasible schedule that minimizes both the makespan and the total tardiness. Considering the constraints imposed by the jobs and the machines throughout the manufacturing process, a mixed integer programming model is formulated. By transforming the scheduling problem into a Markov decision process, the agent state features and the actions are designed based on the processing status of the machines and the jobs, along with heuristic rules. Furthermore, a reward function based on the optimization objectives is designed. Based on the deep reinforcement learning algorithm, the Dueling Double Deep Q-Network (D3QN) algorithm is designed to solve the scheduling problem by incorporating the target network, the dueling network, and the experience replay buffer. The D3QN algorithm is compared with heuristic rules, the genetic algorithm (GA), and the optimal solutions generated by Gurobi. The ablation experiments are designed. The experimental results demonstrate the high performance of the D3QN algorithm with the target network and the dueling network proposed in this paper. The scheduling model and the algorithm proposed in this paper can provide theoretical support to make the production plan of electronic control modules reasonable and improve production efﬁciency.


Introduction
Digital electronic detonators represent advanced industrial devices that integrate traditional detonators with electronic control technologies.Electronic control modules are utilized to perform various functions, such as time constraints and safety control.The integration of advanced features, including precision, safety measures, and remote-control capabilities significantly improve the effectiveness and safety of blasting operations.The electronic control module plays a vital role in the digital detonator, as it is responsible for precise programming and timing control to achieve explosion triggering with millisecondlevel precision.The manufacturing process of the electronic control module comprises several sequential stages.It begins with the assembly and soldering of printed circuit boards (PCBs) in an automated production line of surface mount technology (SMT).Subsequently, these PCBs are processed using automated optical inspection (AOI) to transform them into semi-finished products.Then, these semi-finished products need to undergo a series of production and testing processes, including electrical performance testing of the semifinished products, a visual inspection of the injection process, electrical performance testing of the finished products, a visual inspection of spot welding, electrical performance testing of the finished products using all-in-one machines, and a visual inspection of the resistance and bridge wire.The production flow of the electronic control modules is shown in Figure 1.
performance testing of the finished products using all-in-one machines, and a visual inspection of the resistance and bridge wire.The production flow of the electronic control modules is shown in Figure 1.

Automated production line
Semi-finished production and testing processes Electronic control module During the practical production of electronic control modules, each manufacturing stage typically involves multiple processing and testing machines, where the production flow can be regarded as a typical flexible flow shop production environment.The allocation of machines and the sequencing of the job to be processed at each stage will directly affect the overall efficiency of the production.A good utilization of the multiple processing and testing machines is to minimize the makespan.The measurement of how well due dates are being met in practice is often represented by the objective of minimizing the total tardiness.Hence, it is necessary to find a scientific and reasonable scheduling scheme to ensure the efficiency of the overall production is achieved.
The flexible flow shop scheduling (FFSS) problem contains features of ordinary flow shop and parallel machine scheduling at each stage.The field of FFSS encompasses two types of scheduling: single-objective and multi-objective.
For the single-objective FFSS problem, Han et al. [1], Shi et al. [2], and Malekpour et al. [3] have first presented the development of intelligent optimization algorithms.These include the improved migratory bird optimization algorithm, the enhanced grey wolf algorithm, and the simulated annealing algorithm, designed to solve the FFSS problem with the objective of minimizing the makespan.Furthermore, Meng et al. [4] propose an enhanced genetic algorithm for minimizing energy consumption by incorporating energysaving techniques and decoding methodologies.Azadeh et al. [5] adopt the minimization of the total delay time as the optimization objective and introduce an integrated algorithm that combines simulation, artificial neural networks, and genetic algorithms.Reinforcement learning is a significant machine learning technique that focuses on determining optimal behaviors within a given environment to maximize expected benefits.For the scheduling problems, reinforcement learning exhibits the capability to flexibly select actions and generate scheduling policies based on the state characteristics, aligning with actual production conditions.For the FFSS problem of minimizing the maximum completion time, Han et al. [6] and Zhu et al. [7] first adopt the Q-learning algorithm and the proximal policy optimization algorithm, respectively.Reyna and Jimenez [8] introduce an improved Q-learning methodology for FFSS to minimize the makespan.Zhao et al. [9] study a hybrid approach combining the water wave algorithm with the Q-learning methodology.The FFSS problem in distributed assembly contexts is effectively addressed by the incorporation of the Q-learning algorithm, which strikes a balance between the exploration and exploitation capabilities of the algorithm.Ren et al. [10] solve the FFSS problem by employing a neural network using reinforcement learning, in order to minimize the makespan.During the practical production of electronic control modules, each manufacturing stage typically involves multiple processing and testing machines, where the production flow can be regarded as a typical flexible flow shop production environment.The allocation of machines and the sequencing of the job to be processed at each stage will directly affect the overall efficiency of the production.A good utilization of the multiple processing and testing machines is to minimize the makespan.The measurement of how well due dates are being met in practice is often represented by the objective of minimizing the total tardiness.Hence, it is necessary to find a scientific and reasonable scheduling scheme to ensure the efficiency of the overall production is achieved.
The flexible flow shop scheduling (FFSS) problem contains features of ordinary flow shop and parallel machine scheduling at each stage.The field of FFSS encompasses two types of scheduling: single-objective and multi-objective.
For the single-objective FFSS problem, Han et al. [1], Shi et al. [2], and Malekpour et al. [3] have first presented the development of intelligent optimization algorithms.These include the improved migratory bird optimization algorithm, the enhanced grey wolf algorithm, and the simulated annealing algorithm, designed to solve the FFSS problem with the objective of minimizing the makespan.Furthermore, Meng et al. [4] propose an enhanced genetic algorithm for minimizing energy consumption by incorporating energysaving techniques and decoding methodologies.Azadeh et al. [5] adopt the minimization of the total delay time as the optimization objective and introduce an integrated algorithm that combines simulation, artificial neural networks, and genetic algorithms.Reinforcement learning is a significant machine learning technique that focuses on determining optimal behaviors within a given environment to maximize expected benefits.For the scheduling problems, reinforcement learning exhibits the capability to flexibly select actions and generate scheduling policies based on the state characteristics, aligning with actual production conditions.For the FFSS problem of minimizing the maximum completion time, Han et al. [6] and Zhu et al. [7] first adopt the Q-learning algorithm and the proximal policy optimization algorithm, respectively.Reyna and Jimenez [8] introduce an improved Q-learning methodology for FFSS to minimize the makespan.Zhao et al. [9] study a hybrid approach combining the water wave algorithm with the Q-learning methodology.The FFSS problem in distributed assembly contexts is effectively addressed by the incorporation of the Q-learning algorithm, which strikes a balance between the exploration and exploitation capabilities of the algorithm.Ren et al. [10] solve the FFSS problem by employing a neural network using reinforcement learning, in order to minimize the makespan.
Given the intricacies of the production environments and the scheduling challenges encountered in real-world scenarios, there is a growing interest in addressing multi-objective FFSS problems.Li et al. [11] propose a multi-objective optimization model and develop an enhanced artificial bee colony algorithm to address the FFSS problem, to minimize both the makespan and the processing costs.Zhou et al. [12], Wang et al. [13], and Wang et al. [14] study the dual optimization objectives of minimizing the total energy consumption and the makespan.In order to address these challenges, the imperialist competitive algorithm, the decomposition-based hybrid multi-objective optimization algorithm, and the improved whale optimization algorithm are designed, respectively.The genetic algorithm (GA) is also commonly used to solve the multi-objective optimization problem (Rathnayake [15]).Kong et al. [16] design an improved GA to solve the FFSS problem with the objectives of minimizing the makespan, the total energy consumption, and the costs.To solve the FFSS problem, Lin et al. [17] derive a hybrid optimization algorithm, which integrates the harmony search algorithm and the GA, in order to minimize the makespan and the average flow time.To solve the FFSS problem and minimize the total completion time, the total energy consumption, and carbon emissions, Shi et al. [18] consider a variable-priority dynamic scheduling optimization algorithm based on the GA.Hasani et al. [19] present the non-dominated sorting genetic algorithm (NSGA-II) to solve the multi-stage FFSS problem, with the objective of minimizing the production costs and the total energy consumption.The NSGA-II algorithm is also employed by Wu et al. [20] and Feng et al. [21] for solving multi-objective FFSS problems.Gheisariha et al. [22] propose an enhanced algorithm based on the harmony search algorithm and the Gaussian mutation algorithm, to effectively optimize the makespan and average delay time simultaneously.Zhang et al. [23] employ a three-stage method based on decomposition to solve the FFSS problem, with the objective of minimizing the makespan and the total energy consumption.Mousavi et al. [24] present a heuristic algorithm based on the simulated annealing algorithm to solve the FFSS problem, with the objective of minimizing the makespan and the total tardiness.Schulz et al. [25] propose an iterated local search algorithm to solve the FFSS problem, with the objective of minimizing the makespan, the total energy costs, and the peak power.In addition, for reentrant FFSS, random FFSS, and blocking FFSS, some scholars have considered multiple optimization indicators, including the makespan, the total energy consumption, the total tardiness and earliness, and the advance quantity.The improved multi-objective ant lion algorithm [26], multi-objective artificial bee colony algorithm [27], and migrating birds optimization algorithm [28] have also been applied to find solutions to the multi-objective FFSS problem.In summary, with regard to the multi-objective FFSS problem, the current research primarily involves the development of intelligent optimization algorithms.
Although the application of intelligent optimization algorithms is widespread in the scheduling field, the solution results of the algorithms depend on the setting of the initial value to a great extent.If the initial value is not selected properly, it will greatly affect the convergence speed and the quality of the solution.Therefore, some scholars have tried new methods, such as deep reinforcement learning, to solve scheduling problems.
Deep reinforcement learning combines the robust applicability of reinforcement learning to cope with large-scale state spaces and dynamic changing environments.The function of deep learning is to acquire knowledge from historical data via multi-layer neural networks.The combination of deep learning and reinforcement learning enables the optimization of objective functions in more complex settings, which has gained significant attention to solve scheduling problems in recent years.Due to the complexity of multi-objective problems, few scholars have used deep reinforcement learning to solve FFSS problems.At present, most scholars have applied deep reinforcement learning to the flexible job shop scheduling (FJSS) problem.Du et al. [29] propose a Deep Q-Network (DQN) to address the FJSS problem involving the objectives of crane transportation and preparation time.Their experimental results show that the DQN algorithm can obtain better solutions than intelligent optimization algorithms such as the GA.Luo et al. [30] propose a two-hierarchy DQN algorithm to solve the dynamic FJSS problem, with the objective of optimizing both the total weighted tardiness and average machine utilization rate.Du et al. [31] propose a hybrid multi-objective optimization algorithm, combining the estimation of distribution algorithm and DQN algorithm, to address the FJSS problem, with the objective of opti-mizing both the makespan and the total electricity price simultaneously.Wang et al. [32] investigated the occurrence of dynamic events in the scheduling problem and established a multi-objective FJSS model to simulate the production environment.On the basis of the DQN algorithm, they incorporated a target network and formulated a Double Deep Q-Network (DDQN) algorithm.Experiments demonstrate the superiority and stability of their approach in comparison to various combined rules, widely recognized scheduling rules, and conventional deep Q-learning algorithms.Wu et al. [33] propose the structure of a dual-layer DDQN algorithm to solve the dynamic FJSS problem with new job arrivals, in order to optimize both the total delay time and the makespan.
The contributions of this paper are as follows: In this paper, the multi-objective FFSS problem is motivated by the actual production process of electronic control modules in the electronic detonators industry.On the basis of the DQN algorithm, the overvaluation problem caused by the maximization process is solved by using a target network.Additionally, the action value is decomposed into the optimal state value and the optimal advantage value by using a dueling network.This approach is particularly effective in handling states that exhibit lower correlation with actions and enhance algorithmic stability.Furthermore, by integrating the idea of Experience Replay, a D3QN algorithm is designed to solve the multi-objective FFSS problem and obtain a feasible schedule.
The remaining sections of this paper are structured as follows: Section 2 describes the problem and presents a mixed integer programming (MIP) model for the multi-objective FFSS problem.In Section 3, we present the D3QN algorithm to solve this scheduling problem.Section 4 reports the experimental results, and lastly, Section 5 presents the conclusion.

Problem Description
The multi-objective FFSS problem studied in this paper can be described as follows: There is a set of n jobs to be processed at s stages, where each stage has several identical parallel machines.Each job is associated with the distinct processing time on each machine at each stage.The jobs are processed sequentially at all stages.Several assumptions are made as follows: (1) Each job can only be processed on one machine at any time.(2) Each machine can process only one job at any time.(3) Job processing on a machine cannot be interrupted.(4) The job can be processed on any machine at each stage.( 5) Each job has its due date.Table 1 lists all the parameters used in the model.

Parameter Meaning
processing time of job i at stage j d i due date of job i S i,j starting time of job i at stage j C i,j completion time of job i at stage j C max completion time of the last job, makespan, If and only if the job i is processed on the machine k at stage j, X i,j,k = 1, otherwise If and only if job i is processed after job i at stage j, Y i,i ,j = 1, otherwise Y i,i ,j = 0 The objective of the FFSS problem is to find a feasible schedule of production such that both the makespan and the total tardiness are minimized.The multi-objective FFSS problem is described by a triplet FF s C max , T , where FF s means that the flexible flow shop involves s stages and C max and T denotes the makespan and the total tardiness, respectively.Hence, the MIP model presented in this paper is formulated as follows. Minimize: Subject to: where Formulas ( 1) and ( 2) represent the objectives to minimize the completion time of the last job at the final stage and the total tardiness for all jobs, respectively.Constraint (3) indicates that each job must be processed at all stages, and can be processed once on one machine at each stage.Constraint (4) signifies a sequence of the different jobs at the same stage.Constraint (5) shows the job sequence on the same machine.Constraint (6) implies that the starting time of a job at each stage is determined by its completion time at the previous stage.Constraint (7) implies that the completion time of a job at each stage is determined by its starting time and the processing time.

D3QN Algorithm for Solving Problem FF s C max , T
The problem FF s C max , T has been proved to be NP-hard.As the size of the problem expands, its complexity increases exponentially, making it challenging for the traditional accurate algorithms to find optimal solutions.Reinforcement learning and neural networks are combined in the deep reinforcement learning approach; this approach autonomously facilitates the acquisition of representations of environments and tasks from raw data.This approach is more suitable to address intricate decision-making problems.
Initially, the selection of the machines at each stage and the job sequence to be processed on each machine are determined by a sequential decision-making process.Subsequently, the scheduling problem is converted into a Markov decision problem, where the state is defined by the parameters of the machines and the jobs, including the average utilization rate of all machines, the average processing completion rate of all jobs, and the average processing tardiness rate of all jobs.Concurrently, heuristic rules are utilized as actions, calculating immediate rewards while taking into account the present state of both the makespan and the total tardiness, as well as the change in state following the execution of the actions.
Considering the increasing computational complexity of the DQN algorithm, coupled with the challenges of parameter adjustment and the tendency of overfitting, the issue of Q-value overestimation in the algorithm is solved by the incorporation of a target network.Simultaneously, the idea of Q-value decomposition is introduced, based on the proposal of the D3QN algorithm, which combines the state value function, the advantage function by using a dueling network, and an experience replay buffer.This method can be applied to deal with the problem FF s C max , T , with the objective to optimize the scheduling scheme and ultimately achieve enterprise productivity.

Problem Transformation
The problem FF s C max , T is solved by the D3QN algorithm.The primary step is to transform the scheduling problem into a Markov decision process.In this process, the state is utilized to depict the variations and characteristics of the overall manufacturing system environment.Actions are used to represent the decision-making behavior of the agent, while the rewards are employed to reflect the outcomes of the interactions between the agent and the environment.The definitions of the state, the action, and the reward are outlined as follows.

State Features
The state is defined by the variations in the characteristics of the machines and the jobs.Given the fluctuation of certain properties in them, along with the inconsistency of dimensions, six critical features are selected to describe the states of the scheduling problem.State features 1 and 2 describe the characteristics of the machines: State feature 1 represents the average utilization rate of all machines.State feature 2 represents the standard deviation of the average utilization rate of all machines.State features 3-6 describe the characteristics of the jobs: State feature 3 represents the average processing completion rate of all jobs.State feature 4 represents the standard deviation of the average processing completion rate of all jobs.State feature 5 represents the average processing tardiness rate of all jobs.State feature 6 represents the standard deviation of the average processing tardiness rate of all jobs.
Let t denote a decision moment.A decision moment refers to the moment when the agent needs to choose the action that will be rewarded the most according to the state features.The agent makes decisions at the moment of the state transition.The decision moment t is the moment when the state transitions for the t-th time.At the decision moment t, the definitions of the state features are shown as below.
State feature 1. where represents the machine utilization.If job i is processed on machine k, α = 1, otherwise α = 0. CT k (t) denotes the total overload time of machine k; OP i (t) denotes the number of the stages for the completed job i.State feature 2.
State feature 3. where s represents the processing completion rate of job i.State feature 4.
where Tard i (t) = represents the tardiness rate of job i.

Action
The actions in the D3QN algorithm for solving the problem FF s C max , T are determined based on the decision of the jobs and the machines.To minimize the action space and describe the actual production process accurately, the rules for job selection and machine selection are designed with reference to the heuristic scheduling rule and objective functions C max and T. The rules for job selection are outlined as follows.

1.
SPT Rule: Select a job using the shortest processing time rule (SPT rule).The jobs are indexed in the SPT rule; 2.
LPT Rule: Select a job using the longest processing time rule (LPT rule).The jobs are indexed in the LPT rule; 3.
EDD Rule: Select a job using the earliest due date rule (EDD rule).The jobs are indexed in the EDD rule; 4.
ODD Rule: Select a job using the operation due date rule (ODD rule).The jobs are indexed in the ODD rule; 5.
SRP Rule: Select a job using the shortest remaining processing time rule (SRP rule).
The jobs are indexed in the SRP rule; 6.
LNP Rule: Select a job using the longest processing time for the next process rule (LNP rule).The jobs are indexed in the LNP rule; 7.
SNP Rule: Select a job using the shortest processing time for the next process rule (SNP rule).The jobs are indexed in the SNP rule.
The rules for the selection of a machine for jobs are as follows: 1. FCFS Rule: Select the machine using the first come first serve rule (FCFS rule).The machines are indexed in the FCFS rule; 2.
WINQ rule: Select the machine using the shortest total processing time rule (WINQ rule).The machines are indexed in the WINQ rule.
By combining the above two rule sets together, a total of 14 combination rules are obtained, serving as the actions in the D3QN algorithm for solving the problem FF s C max , T .The specific actions are shown in Table 2.In the initial state s 0 , the agent proceeds to select a job and allocates it to a machine where all machines are idle.Subsequently, the system transitions into a new state when a job has finished processing on a machine.At a decision moment t, the agent selects an action a t based on s t .The state then transfers into s t+1 at the next decision moment, while the agent receives a time-delay reward r t .As previously mentioned, state transitions occur at the time when any job completes a certain process on the machine.Then, the agent needs to select an action in this new state.Once all the jobs have completed the last process, the agent finishes the work.

Reward
In a deep reinforcement learning algorithm, the reward function can be formulated to guide the learning process of the agent to meet the requirements of multi-objective optimization.The reward function of the D3QN algorithm is formulated by evaluating variations in the current state and the state after the action execution in two aspects: the makespan and the total tardiness.In the process of action selection, the ε-greedy strategy is adopted to facilitate a comprehensive exploration and exploitation, with the aim of better spotting the relationships, including the states, the actions, and the rewards.The specific steps are outlined as follows.
Step 1. Exploration.Using the ε-greedy strategy, a random number is compared with ε.If the random number is less than ε, the agent randomly selects an action from the 14 job-machine combination rules.This strategy ensures that the agent can explore the environment, rather than being limited solely to the existing optimal actions.
Step 2. Exploitation.According to the ε-greedy strategy, the exploitation phase is entered when the random number exceeds ε.By calculating the Q-value for each available action in the current state, the agent selects the action that maximizes the reward and executes it.
Step 3. The reward is calculated by using the makespan and the total tardiness, as outlined below: where f 1 (t) represents the relationship between the makespan and the reward, and f 2 (t) represents the relationship between the total tardiness and the reward.
Step 4. The instant reward at the time of decision moment t, is as follows: The final reward function, denoted as R, is defined as the summation of rewards across K decision moments, as shown in Equation (17).
where K is the total number of moments the agent needs to make a decision.By formulating the reward function mentioned above, the agent can be directed towards efficient learning and decision-making in addressing problem FF s C max , T with the objective of optimizing both the makespan and the total tardiness.

D3QN Algorithm
The traditional DQN algorithm uses a single neural network to fit the optimal action value function.To achieve a more efficient approximation of the optimal action value function, a dueling network comprising two subnetworks is designed for the D3QN algorithm.These two subnetworks are used to approximate the state value function and the dominance function, respectively.The D3QN algorithm decomposes the optimal action value into the optimal state value and the optimal advantage value by using the dueling network.In the D3QN algorithm, the inputs of the dueling network are the states of the machines and the jobs.The outputs of the two subnetworks are the value of the state and the advantage of each action, respectively.The value of the state and the advantage of the action can be used to determine the optimal action value.The specific calculation formula is outlined as follows: where Q represents a dueling network, and Q(s, a; w) represents the action value of the optimal action a in state s.w = (w V , w D ), where w V and w D are the parameters for the optimal state value and the optimal advantage network, respectively.V(s; w The dueling network facilitates the agent with a more accurate and efficient capacity for learning the state value functions for the problem FF s C max , T .The network structure of the target network in the D3QN algorithm is identical to that of the dueling network.The issue of overestimating the action value is solved by calculating action values through the target network.The incorporation of the experience replay in the D3QN algorithm breaks the correlation of the data sequence, making the trained data mutually independent.Additionally, it permits the reutilization of the collected data to achieve the objective of the algorithm with less data.The D3QN algorithm is designed to solve the problem FF s C max , T by combining the target network, the dueling network, and the experience replay.The implementation flow of the D3QN algorithm is depicted in Figure 2.  The specific steps of the D3QN algorithm presented in this paper are given as below.
Step 1. Initialize the parameters of the problem FF s C max , T and the D3QN algorithm.Schedule the problem parameters: the set of jobs N, the set of machines M, the processing time p i,j of job i ∈ N at stage j ∈ S, and the due date d i of the job i ∈ N. Algorithm parameters: the learning rate α, the exploration rate ε, the discount factor γ, the sample batch size batch_size for updating the network, the number of steps of the target network update C, and the maximum iteration number Max_episode.Initialize the dueling network Q and the target network Q − by using the random dueling network parameters w now and the target network parameter w − now .Let episode = 0. Step 2. Define the initial state, s 0 = {U ave , U std , CRJ ave , CRJ std , Tard ave , Tard std }.
Step 3. Choose an action a t randomly with the probability of ε or use the dueling network to choose the optimal action for the current state a t = arg max a∈A Q(s t , a; w now ).
Execute action a t , update the state to s t+1 , receive the reward r t , and store the resulting quadruple (s t , a t , r t , s t+1 ) into the experience replay D.
Step 4. Check whether the quantity of data in D is greater than batch_size.If it is greater, randomly take batch_size quadruples (s j , a j , r j , s j+1 ), j ∈ {0, 1, 2, . .., |D|} for training; otherwise, return to Step 3. The training process is as follows: (1) When the dueling network parameter is w now , and the state is s j , use the dueling network Q for positive propagation.According to the Formula ( 18), obtain the q value of the action a j , qj = Q(s j , a j ; w now ).
(2) Use the dueling network to select the action a * = arg max a∈A Q(s j+1 , a; w now ) with the maximum q value at state s t+1 .
(3) Use the target network to obtain the q value of output state s t+1 under a * , qj+1 = Q − (s j+1 , a * ; w − now ).(4) Calculate the TD target y j = r j + γ qj+1 and TD error δ j = qj − y j .( 5) Perform the reverse spread to the dueling network and obtain the gradient ∇ w Q(s j , a j ; w now ).(6) Update the dueling network parameters by the average stochastic gradient descent algorithm, w new ← w now − αδ j ∇ w Q(s j , a j ; w now ) .( 7) Assign the parameters of the dueling network to the target network after C steps.
Step 5. Determine whether all jobs are processed or not and go to Step 6 if completed.Otherwise, return to Step 3.
Step 6. Determine whether episode reaches Max_episode, and if it is less than Max_episode, episode = episode + 1, return to Step 2. Otherwise, output the optimal scheduling solutions, ending the process.
The flow chart of the D3QN algorithm is shown in Figure 3. (5) Perform the reverse spread to the dueling network and obtain the gradient ) ; , ( (6) Update the dueling network parameters by the average stochastic gradient descent algorithm, ) ; , ( (7) Assign the parameters of the dueling network to the target network after C steps.
Step 5. Determine whether all jobs are processed or not and go to Step 6 if completed.Otherwise, return to Step 3.
Step 6. Determine whether episode reaches Max_episode, and if it is less than Max_episode, episode = episode + 1, return to Step 2. Otherwise, output the optimal scheduling solutions, ending the process.
The flow chart of the D3QN algorithm is shown in Figure 3.

Computational Experiments
In this section, we report the computational experiments to evaluate the performance of the D3QN algorithm for the problem To validate the adaptability of the

Computational Experiments
In this section, we report the computational experiments to evaluate the performance of the D3QN algorithm for the problem FF s C max , T .To validate the adaptability of the D3QN algorithm for various problem sizes, the test instances are randomly generated and compared with Gurobi and the heuristic rules.

Experimental Environment and Parameter Settings
The D3QN algorithm is coded in Python, and the program was run on the PyCharm Community Edition 2021.3.3.The experiments are conducted on a personal computer with Intel (R) Core i5-6300HQ CPU @2.30GHz, and 8.00 GB RAM.
Based on the D3QN algorithm, the parameters are set as follows: α is the learning rate, which controls the magnitude of the weight parameters and updates in the training process of the neural network, and it is set at α = 0.001; γ is the discount factor used to calculate the cumulative rewards, and it is set at γ = 0.95; ε represents the greedy factor, and it is set at ε = 0.6.The upper limit of iterations Max_episode is 500.The parameters of the problem FF s C max , T are set as follows: the processing time of the jobs at each stage are generated by the uniform distributions, the processing times of the electrical performance testing p i1 , p i3 , p i5 ∼ U [1,10], and the processing times of the visual testing p i2 , p i4 , p i6 ∼ U [11,20].

The Experimental Results of the Model and the D3QN Algorithm
In order to validate the effectiveness of the model and the D3QN algorithm, Gurobi optimization software (Gurobi Optimizer version 10.0.2 build v10.0.2rc0 (win64)) is employed to solve five groups of fifteen experimental instances spanning different sizes.Each instance is subject to a maximum allowable runtime constraint of one hour.The due dates of the jobs at each stage follows a uniform distribution, denoted as d i ∼ U [12,40].The comparation of the solutions generated using Gurobi and the D3QN algorithm is shown in Table 3. Columns 4-8 represent the incumbents, the bestbounds, and the runtimes obtained using Gurobi for the instances of the multi-objective mixed integer programming model.The symbol "-" denotes that Gurobi is unable to obtain the global optimal solutions within one hour.Gurobi by 6.061% and shorter than that of Gurobi by 0. The total tardiness T generated by the D3QN algorithm is longer than that of 10.618%, and shorter than that of Gurobi by 0. As the size of the problem increases, the D3QN algorithm runs faster than Gurobi, and Gurobi cannot solve the instance of the problem with a size of n = 9 within one hour.

The Results for Large-Size Instances
In order to verify the effectiveness of the D3QN algorithm, the experiments are conducted with problems of various sizes.The parameters of the problem FF s C max , T are set as follows: the number of the stages s = 6, the number of the machines m j at stage j are 8, 2, 8, 1, 4, and 1, respectively.The due dates of the jobs at each stage follow a uniform distribution, d i ∼ U [36,90].The number of jobs n is 15, 30, 50, 100, and 200, respectively.The makespan C max and the total tardiness T obtained by the different heuristic algorithms, the GA, and the D3QN algorithm, are shown in Table 4.The algorithms Rule1-Rule14 in Table 4 are constructed according to the heuristic rules corresponding to the actions of the job and the machine.As shown in Table 4, the D3QN algorithm is compared with several heuristic algorithms for the different instances of the scheduling problem.C max is reduced by at least 0.19% and at most 26.17%.T is reduced by at least 0.18% and at most 42.92%.The experimental results illustrate that the D3QN algorithm provided for the problem FF s C max , T can effectively obtain better solutions than 14 heuristic rules and the GA.

Ablation Experiment Results of the D3QN Algorithm
The purpose of an ablation experiment is to assess the influence of individual elements on performance by systematically eliminating or modifying them.Based on the DQN algorithm, the DDQN algorithm is constructed by incorporating a target network, while the D3QN algorithm is formulated by integrating both a target network and a dueling network.Ablation experiments are designed for these three algorithms to assess the influence of incorporating a target network and a dueling network in resolving the problem FF s C max , T of different sizes, as shown in Table 5.
Based on the outcomes presented in Table 5, the improvement offered by the DDQN algorithm is not evident, which solely incorporates the target network compared with the original DQN algorithm.However, with the incorporation of both the target network and the dueling network in the D3QN algorithm, significant improvements in performance are achieved when solving scheduling problems of five different sizes.The D3QN algorithm not only yields superior results in terms of the makespan and the total tardiness, but also exhibits enhanced convergence stability.Taking 100 jobs as an example, the variation trends of the makespan with respect to the number of iterations are shown in Figure 4a, and the variation trends of total tardiness with respect to the number of iterations are shown in Figure 4b.The experimental results indicate that the target network and the dueling network can mutually boost each other and achieve a better performance.Based on the outcomes presented in Table 5, the improvement offered by the DDQN algorithm is not evident, which solely incorporates the target network compared with the original DQN algorithm.However, with the incorporation of both the target network and the dueling network in the D3QN algorithm, significant improvements in performance are achieved when solving scheduling problems of five different sizes.The D3QN algorithm not only yields superior results in terms of the makespan and the total tardiness, but also exhibits enhanced convergence stability.Taking 100 jobs as an example, the variation trends of the makespan with respect to the number of iterations are shown in Figure 4a, and the variation trends of total tardiness with respect to the number of iterations are shown in Figure 4b.The experimental results indicate that the target network and the dueling network can mutually boost each other and achieve a better performance.

Results Analysis of Scheduling Problem Based on D3QN Algorithm
Taking the number of the jobs with n = 15 as an instance, the D3QN algorithm can obtain the optimal scheduling solution with a makespan of 272 and a total tardiness of 1478 for the problem T C FF s , max .The Gantt chart of the optimal schedule for the instance with n = 15 is shown in Figure 5.

Results Analysis of Scheduling Problem Based on D3QN Algorithm
Taking the number of the jobs with n = 15 as an instance, the D3QN algorithm can obtain the optimal scheduling solution with a makespan of 272 and a total tardiness of 1478 for the problem FF s C max , T .The Gantt chart of the optimal schedule for the instance with n = 15 is shown in Figure 5.

Action Selections Based on the D3QN Algorithm
In order to analyze the usage frequency of various actions involved in the optimal strategy based on the experimental results, a usage frequency distribution diagram of the 14 job-machine combination rules is generated.As shown in Figure 6, the actions that have been used more than 5000 times include ODD-FCFS, SRP-FCFS, EDD-WINQ, and SRP-WINQ.These actions have made significant contributions to achieve optimal solutions.The usage frequency of other actions exhibits a relatively even distribution, and the performance is not particularly noteworthy.

Action Selections Based on the D3QN Algorithm
In order to analyze the usage frequency of various actions involved in the optimal strategy based on the experimental results, a usage frequency distribution diagram of the 14 job-machine combination rules is generated.As shown in Figure 6, the actions that have been used more than 5000 times include ODD-FCFS, EDD-WINQ, and SRP-WINQ.These actions have made significant contributions to achieve optimal solutions.The usage frequency of other actions exhibits a relatively even distribution, and the performance is not particularly noteworthy.

Conclusions
In this paper, the problem FF s C max , T , arising from the production process of electronic control modules in the digital electronic detonators industry, is considered.The objective is to minimize both the makespan and the total tardiness.The scheduling problem is described as a multi-objective MIP model.The D3QN algorithm is designed based on the DQN algorithm, which integrates a target network, a dueling network, and an experience replay buffer, to solve the proposed scheduling problem.The experiments that compared the D3QN algorithm with the heuristic rules and the GA illustrate that the incorporation of the target network, the dueling network, and the experience replay buffer accelerates the speed at which the problem is solved, improves the quality of near-optimal scheduling solutions, and enhances the effectiveness of the algorithm.Ablation experiments validate the significant advantages of the D3QN algorithm in terms of both the quality and the convergence rate when it is compared with the DQN algorithm and the DDQN algorithm to solve the problem FF s C max , T .An interesting future issue will be to consider the problem with uncertain constraints, such as dynamic arrival jobs and random processing times.It would also be interesting to consider the problem by taking other objective functions into account.

Figure 1 .
Figure 1.Production flow chart of electronic control modules.

Figure 1 .
Figure 1.Production flow chart of electronic control modules.

Figure 2 .
Figure 2. Implementation of the D3QN algorithm for solving the problem FF s C max , T .

Initialize the parameters of the problem and the algorithm episode=0 Initial state s0 Figure 3 .
Figure 3.The flow chart of the D3QN algorithm.

Figure 3 .
Figure 3.The flow chart of the D3QN algorithm.

Figure 4 .
Figure 4. Convergence curve of different objectives: (a) convergence curve of the makespan; (b) convergence curve of the total tardiness.

Figure 4 .
Figure 4. Convergence curve of different objectives: (a) convergence curve of the makespan; (b) convergence curve of the total tardiness.

Processes 2023 , 18 Figure 5 .Figure 5 .
Figure 5.The Gantt chart of the optimal schedule.4.5.1.Action Selections Based on the D3QN AlgorithmIn order to analyze the usage frequency of various actions involved in the optimal strategy based on the experimental results, a usage frequency distribution diagram of the 14 job-machine combination rules is generated.As shown in Figure6, the actions that have been used more than 5000 times include ODD-FCFS, SRP-FCFS, EDD-WINQ, and

Figure 5 .
Figure 5.The Gantt chart of the optimal schedule.

Figure 6 .
Figure 6.The distribution of the actions in the D3QN algorithm.

Figure 6 .
Figure 6.The distribution of the actions in the D3QN algorithm.

Figure 7 .
Figure 7. Variation trend of the reward.

Figure 8 .
Figure 8. Variation trends of different objectives: (a)variation curve of the makespan; (b) variation curve of the total tardiness.Figure 8. Variation trends of different objectives: (a) variation curve of the makespan; (b) variation curve of the total tardiness.

Table 1 .
Parameters used in the proposed model.
Algorithm parameters: the learning rate α, the exploration rate ε, the discount factor γ, the sample batch size batch_size for updating the network, the number of steps of the target network update C, and the maximum iteration number Max_episode.Initialize the dueling network Q and the target network − Q by using the random dueling network parameters Step 4. Check whether the quantity of data in D is greater than batch_size.If it is greater, randomly take batch_size quadruples s , j ∈ {0, 1, 2, ..., |D|} for training; otherwise, return to Step 3. The training process is as follows:

Table 3 .
C max , T, and the runtime obtained by Gurobi and the D3QN algorithm.

Table 3
lists the scheduling solutions of the problem instances solved by Gurobi and the D3QN algorithm.The C max generated by the D3QN algorithm is longer than that of

Table 4 .
C max and T obtained by the different heuristic algorithms, the GA, and the D3QN algorithm.

Table 5 .
Comparison of objective function values for the different algorithms.

Table 5 .
Comparison of objective function values for the different algorithms.