Deep Reinforcement Learning for Distributed Flow Shop Scheduling with Flexible Maintenance

: A common situation arising in ﬂow shops is that the job processing order must be the same on each machine; this is referred to as a permutation ﬂow shop scheduling problem (PFSSP). Although many algorithms have been designed to solve PFSSPs, machine availability is typically ignored. Healthy machine conditions are essential for the production process, which can ensure productivity and quality; thus, machine deteriorating effects and periodic preventive maintenance (PM) activities are considered in this paper. Moreover, distributed production networks, which can manufacture products quickly, are of increasing interest to factories. To this end, this paper investigates an integrated optimization of the distributed PFSSP with ﬂexible PM. With the introduction of machine maintenance constraints in multi-factory production scheduling, the complexity and computation time of solving the problem increases substantially in large-scale arithmetic cases. In order to solve it, a deep Q network-based solution framework is designed with a diminishing greedy rate in this paper. The proposed solution framework is compared to the DQN with ﬁxed greedy rate, in addition to two well-known metaheuristic algorithms, including the genetic algorithm and the iterated greedy algorithm. Numerical studies show that the application of the proposed approach in the studied production-maintenance joint scheduling problem exhibits strong solution performance and generalization abilities. Moreover, a suitable maintenance interval is also obtained, in addition to some managerial insights.


Introduction
The permutation flow shop scheduling problem (PFSSP) has been receiving considerable attention by academia and industry as a consequence of its broad applications.Earlier studies have demonstrated it to be an NP-complete problem when the number of machines involved is more than three.Beyond that, various heuristic or meta-heuristic algorithms have been developed [1][2][3][4][5][6][7][8].However, those studies simply assumed that all scheduling tasks are done in one factory [9].In the contemporary context of the decentralized and globalized economy, distributed PFSSPs (DPFSSP) in multi-factory production networks are becoming increasingly significant means to increase productivity and achieve lower production costs and higher product quality [10].In recent years, many models and algorithms have been proposed to solve DPFSSPs [9,[11][12][13][14], in which two key decision-makings are considered, including allocating jobs to suitable factories, and scheduling operations on machines.
A common assumption in most research of DPFSSPs is that all the machines are continuously available during the whole production process [14].Nevertheless, machine deterioration is inevitable as their operating times increase, causing an ever-increasing probability of machine failures.Thus, developing a proper maintenance strategy is critical [15].In contrast to the corrective maintenance (CM) strategy, which has to be carried out immediately after machine failures, two types of preventive maintenance (PM) strategies including time-based maintenance (TBM) and condition-based maintenance (CBM) have become more popular in common environment configurations such as single machine, flow shop and flexible job shop configurations.For TBM, two kinds of assumptions concerning flexible PM strategies from literature are the following: (I) PM must be carried out within a predetermined interval [u, v] whose duration is longer than the PM time [16-18]; (II) the interval between two PMs cannot exceed a maximum allowed continuous processing time [19,20].For CBM, PM is not limited to a specific duration, but typically depends on the machine's age or multi-state degradation process [21][22][23].In this paper, a TBM strategy is integrated into the DPFSSP to minimize the makespan of the whole distributed production system.
A result of the complexity of the integrated optimization of multi-factory production scheduling and PM, the application of exact algorithms has limitations.Limited studies only employed heuristic or meta-heuristic optimization algorithms to solve a large-size instance within a short time.For instance, Chan et al. [24] and Chung et al. [25] studied the distributed flexible job shop scheduling problem with maintenance using improved genetic algorithms, in which the maintenance has to be carried out if the machine's age reaches a given maximum.Lei and Liu [26] proposed an improved artificial bee colony algorithm to solve a distributed unrelated parallel machines problem with flexible PM.Wang et al. [27] investigated a DPFSSP considering event-driven policy and right-shift schedule repair, and developed a fuzzy logic-based hybrid estimation of the distribution algorithm.More recently, Miyata and Nagano [28] proposed a multi-level flexible maintenance strategy in a distributed no-wait flow shop in which sequence-dependent setup times were considered as well.An iterated greedy algorithm with variable search neighborhood was designed for solving small-sized and large-sized instances targeted at achieving a minimal makespan.Similarly, Mao et al. [14] assumed that the same types of machines have the same PM intervals in a DPFSSP and a multi-start iterated greedy algorithm proposed to achieve production-maintenance joint optimization.Jafar-Zanjani et al. [29] developed robust and resilient scheduling approaches in a multi-factory network with periodic maintenance and uncertain disturbances, in which the proposed model in small and medium instances were solved by CPLEX, and large-sized instances were solved by a heuristic method based on the genetic algorithm.
Some reinforcement learning algorithms have been applied to the scheduling field to satisfy requirements of real-time scheduling in actual scenarios.One of them is the Q-learning (QL) algorithm.For instance, Wang and Usher [30] applied QL to address a dispatching rule selection problem on a single machine with different system objectives.After that, Wang et al. [15] applied a scheduling rules-based QL approach to jointly optimize single-machine scheduling and flexible maintenance, in which both deteriorating effects and machine failures were considered.In addition, some researchers combined QL with metaheuristics.For instance, Cheng et al. [31] proposed a multi-objective superheuristic algorithm based on QL with four heuristic update operators as the action set for mixed scheduling.Lin et al. [32] used multiple heuristic update rules as QL actions to semi-conductor test scheduling problems.Shahmardan et al. [33] used QL to learn the neighborhood deconstruction of SA to solve the problem of truck scheduling.Long et al. [34] focused on the flexible job shop scheduling problem using QL to learn the number of randomly updated dimensions of nectar sources to improve the neighborhood search of the artificial bee colony algorithm.More recently, Wang et al. [35] integrated QL with the well-known artificial bee colony algorithm to efficiently solve distributed three-stage assembly scheduling with maintenance.
However, in the practice of Q-learning, the dimensionality of the state space is usually large for complex large-scale scheduling problems, resulting in Q-tables that are too large for fast convergence.In order to solve it, attempts at neural network-based reinforcement learning have been applied to production scheduling problems [36]; however, maintenance activities were not considered.A deep Q-learning (DQN) approach is employed in this paper, in which the system state is defined by a binary group that consists of the interval between the current time, the next latest start time of PM and the number of available jobs, and the action space is a set of available jobs.In order to evaluate the solving performance of the DQN-based solution framework, several numerical studies are conducted over many instances.It is observed that the DQN-based optimization approach has greater solution potential compared to two metaheuristic approaches.
The remainder of this paper is organized as follows: Section 2 describes the proposed problem in detail; a DQN-based optimization approach is presented in Section 3; numerical studies and discussions are conducted in Section 4; conclusions are provided in Section 5, along with research limitations and future research directions.

Problem Description
The application scenario of the studied DPFSSP with PM is described as follows: there are n jobs to be processed on f factories, in which each factory has m machines; each job is available at time zero and its operations number is equal to the machines' numbers in one factory, and all the operations of a job must be processed in the order from machine 1 to machine m according to the precedence constraint; different jobs have the same priority, and there is no priority limit between operations of different jobs; the normal processing time of each operation of each job is known in advance and is slightly different on different available machines; however, it may be extended as a consequence of machines' wear and tear, which is treated as deteriorating processing time in this paper.Inspired by [20], the deteriorated processing time of a job is expressed as a linear function of the machine's age at the start time of processing the job, where the slope is the defined deterioration rate and the intercept is the normal processing time.With increasing machine age, the probability of machine failures increases, thus proactive maintenance strategies become increasingly important; one of them is the time-based PM strategy.Based on the flexible maintenance schedule defined in our previous studies [15,37], i.e., the expected PM schedule kT(k = 1, 2, • • •) with a fixed interval T can be shifted within flexible time windows, where kT − ∆ 1 and kT + ∆ 2 are respectively the earliest and latest times at which the machine starts and stops its PM.This paper assumes that each machine is assigned periodic flexible maintenance time windows from the time it is powered on, in which PM times on different machines are identical.However, the occurrence of PM activities consumes an amount of production time; thus, the trade-off between production and maintenance is critical.Hence, the minimal makespan is selected as the optimization objective.Some additional assumptions in this paper are stated as follows: (1) if the assigned machine is being occupied, some jobs have to wait in the buffer, and thus the buffer space is assumed to be sufficient; (2) setup time of machines is ignored and the transfer time between operations from one machine to another machine is negligible; (3) any activities during the process of production and maintenance cannot be interrupted; (4) the proposed PM is perfect, which can restore a deteriorating machine's age to zero.
In a similar manner to the study by Mao et al. [14], the following illustrative case is provided to evaluate the importance of the proposed DPFSSP with PM: there is a set of ten jobs to be assigned to two factories, each of which consists of three machines.The normal processing times of the jobs are shown in Table 1, in which FiMj represents machine j in factory i, and the deteriorating rate is set to 0.1.As for the maintenance activities, T, ∆ 1 and ∆ 2 are set to 30, 3 and 5, respectively, and the PM time is equal to 4.
Firstly, we refer to the mixed integer linear programming (MILP) model presented by [38] and use the software CPLEX to solve for an optimal solution to the example without considering deteriorating effects and PM activities.As shown in Figure 1, Job 10, Job 1, Job 3, Job 6 and Job 9 are sequentially assigned to Factory 1, and Factory 2 needs to manufacture Job 4, Job 7, Job8, Job5 and Job 2 in sequence.In this allocation model, the maximum completion time for each factory is 83, which also implies an optimal makespan of 83.Firstly, we refer to the mixed integer linear programming (MILP) model presented by [38] and use the software CPLEX to solve for an optimal solution to the example without considering deteriorating effects and PM activities.As shown in Figure 1, Job 10, Job 1, Job 3, Job 6 and Job 9 are sequentially assigned to Factory 1, and Factory 2 needs to manufacture Job 4, Job 7, Job8, Job5 and Job 2 in sequence.In this allocation model, the maximum completion time for each factory is 83, which also implies an optimal makespan of 83.Next, deteriorating effects and PM are considered in the DPFSSP, however, we do not change the optimal scheduling of the original DPFSSP and only insert the proposed flexible PM constraints into the optimal scheduling considering deteriorating effects.As shown in Figure 2, the processing time of the job is extended as long as a machine's age at the start time of processing the job is not zero.For instance, the deteriorating processing time of the Job 1 of F1M1 can be calculated by 10 + 0.1 × 8 = 10.8, in which 10 is the normal processing time, 0.1 denotes the deteriorating rate, and 8 is the machine's age immediately prior to Job 1 of F1M1.It is also observed that the insertion of PM activities disrupts the initially tight schedule.There is a lot of idle time resulting from machine unavailability, which causes a longer makespan.Specifically, take F1M2 as an example to illustrate this phenomenon.Next, deteriorating effects and PM are considered in the DPFSSP, however, we do not change the optimal scheduling of the original DPFSSP and only insert the proposed flexible PM constraints into the optimal scheduling considering deteriorating effects.As shown in Figure 2, the processing time of the job is extended as long as a machine's age at the start time of processing the job is not zero.For instance, the deteriorating processing time of the Job 1 of F1M1 can be calculated by 10 + 0.1 × 8 = 10.8, in which 10 is the normal processing time, 0.1 denotes the deteriorating rate, and 8 is the machine's age immediately prior to Job 1 of F1M1.It is also observed that the insertion of PM activities disrupts the initially tight schedule.There is a lot of idle time resulting from machine unavailability, which causes a longer makespan.Specifically, take F1M2 as an example to illustrate this phenomenon.Firstly, we refer to the mixed integer linear programming (MILP) model presented by [38] and use the software CPLEX to solve for an optimal solution to the example without considering deteriorating effects and PM activities.As shown in Figure 1, Job 10, Job 1, Job 3, Job 6 and Job 9 are sequentially assigned to Factory 1, and Factory 2 needs to manufacture Job 4, Job 7, Job8, Job5 and Job 2 in sequence.In this allocation model, the maximum completion time for each factory is 83, which also implies an optimal makespan of 83.Next, deteriorating effects and PM are considered in the DPFSSP, however, we do not change the optimal scheduling of the original DPFSSP and only insert the proposed flexible PM constraints into the optimal scheduling considering deteriorating effects.As shown in Figure 2, the processing time of the job is extended as long as a machine's age at the start time of processing the job is not zero.For instance, the deteriorating processing time of the Job 1 of F1M1 can be calculated by 10 + 0.1 × 8 = 10.8, in which 10 is the normal processing time, 0.1 denotes the deteriorating rate, and 8 is the machine's age immediately prior to Job 1 of F1M1.It is also observed that the insertion of PM activities disrupts the initially tight schedule.There is a lot of idle time resulting from machine unavailability, which causes a longer makespan.Specifically, take F1M2 as an example to illustrate this phenomenon.The first is the initialization of flexible maintenance time windows on Machine 2 of Factory 1.Since the first job, i.e., Job 10 of F1M2, is initially processed immediately after its previous operation on Machine 1 of Factory 1 which is finished at time 8, and T, ∆ 1 and ∆ 2 are set to 30, 3 and 5, respectively, the set of time windows of F1M2 is noted as { [35,43], [65, 73], [95, 103], • • • }.The next step is to determine if the insertion of each job conflicts with the PM, and update its actual start time and completion time accordingly.Job 10 is finished at time 22 with a normal processing time of 14.Then, the starting time of Job 1 of F1M2 is 22, which depends on the maximum completion time between the last job on Machine 2 of Factory 1 and the last operation on Machine 1 of Factory 1.As a result of the deterioration effect, the deteriorating processing time of Job 1 of F1M2 is equal to 12.4 and its completion time is 34.4.Obviously, processing the first two jobs does not conflict with the first flexible time window, i.e., [35,43].However, the next Job 3 cannot be processed immediately after Job 1, since the insertion of Job 3 would make the completion time greater than the difference between the upper boundary of the maintenance time window and the PM time, i.e., it conflicts with the time window, and more importantly, the previous operation to Job 3 has not yet been executed.Therefore, after the processing of Job 1 of F1M2, the machine has 0.6 idle time and PM is performed at the lower bound of the time window, i.e., at time 35, and is finished at time 39.The machine then remains idle until time 43; the first operation of Job 3 is finished on Machine 1 of Factory 1 and the second operation of Job 3 is started on Machine 2 of Factory 1.Similarly, the scheduling of subsequent jobs of F1M2 generates more idle time as a result of the maintenance time window constraints and the completion time constraints of the corresponding previous operations.The same rule is applied to all the machines and a makespan of 123 is obtained.
Considering that the above MILP model applied to the original problem has become difficult to derive an optimal (even a feasible) solution for in a reasonable time using CPLEX as the problem scales up, solving the integrated optimization of the flexible PM and the DPFSSP with deteriorating effects using traditional mathematical programming method is even harder.To this end, a deep reinforcement learning approach is developed in the next section to address this issue.Based on this model-free learning approach, integrated optimization results can be obtained in shorter times, and the better one in limited learning episodes is provided in Figure 3.The integrated optimization result consists of a job sequence {Job 10, Job 2, Job 1, Job 7, Job 9} with a maximum completion time of 95.9 in Factory 1, and a job sequence {Job 4, Job 3, Job 8, Job 5, Job 6} with a maximum completion time of 95 in Factory 2. Due to the flexibility of periodic maintenance times, there is less idle time in addition to fewer executions of PMs.The makespan is equal to 95.9, which implies a decrease in the solution in Figure 2     Among the various reinforcement learning (RL) based algorithms, Q-learning is a typical model-free RL algorithm which directly interacts with the environment using a trial-and-error approach.In order to find a balance between exploration (of uncharted territory) and exploitation (of current knowledge), -greedy method is commonly used.Specifically, a random variable τ ∈ (0, 1) is generated to compare with the predefined value ; if τ < , then the action corresponding to the best Q-value will be selected, otherwise an action a will be performed randomly from an action set at state s.The detailed procedure of the general QL algorithm is shown as Algorithm 1.Each episode is treated as one learning process.In each learning process, an action a is selected by the agent located in state s beforehand and then a new state s and immediate reward are received, followed by updating the Q-value for the state-action pair (s, a).Through continuous learning, better Q-values will be obtained along with an optimal action-selection policy for the agent [15]., a) arbitrarily for all state-action pairs Repeat (for each episode) : Initialize a state s Repeat (for each episode step) : Choose an action a from the state s using -greedy policy derived from Q Take action a, and observe the new state s and reward r(s, a) As the size of the problem increases, a greater number of states and actions need to be stored, i.e., the Q-table becomes increasingly large, making it difficult to obtain a convergent result in a limited learning time.Fortunately, the introduction of a deep Q network (DQN) can solve this problem by considering states as inputs to the neural network [39,40], and generating Q function values of each state-action pair via analysis of the neural network without taking up space for storage.This approach can deal with complex decision-makings involving large and continuous state spaces.Specifically, two main improvements are included in the DQN, one of which is to establish an experience replay memory D with the capacity of N _ to store the transition (φ t , a t , r t , φ t+1 ) at each time-step t.The parameter updating depends on the minibatch of transitions randomly selected from the replay memory, which disrupts the correlation between experiences, and makes the neural network update more efficiently.An old experience is not replaced by a new one until the capacity of the replay memory reaches N _.Another advancement is the use of the target action-value function Q, which can update parameter θ by calculating the target values and loss function.In addition, Q is reset by the action-value function Q every C steps, which can improve the stability of the algorithm.The detailed procedure of the DQN is shown as Algorithm 2 [36].
To the best of our knowledge, little research has been undertaken thus far to apply this DQN approach to the integrated optimization of production scheduling and preventive maintenance.To this end, this paper employs the DQN optimization approach to solve the DPFSSP considering deteriorating effects and flexible maintenance activities.The detailed definitions of states, actions and rewards are provided in Section 3.2.

Definition of Key Elements
In the studied DPFSSP with flexible PM, each machine is faced with the judgement of whether production and maintenance are in conflict with each other.In order to portray this feature, the array formed by the difference between the current completion time of the job on each machine and the upper bound of the upcoming flexible maintenance time window is treated as the system state.In order to avoid the input of DQN from varying within a wide range, we map the above differences to a number of small intervals.In this difference-based definition approach, it is possible to traverse the defined states several times in a single learning session due to the periodicity of maintenance, which can improve the learning efficiency of DQN.
As a result of the property of the permutation flow shop, assigning a job to a factory is an action in this paper.This means that the capacity of the action space is the product of the number of factories and the number of jobs.Compared to the case where scheduling rules are used as actions [15,36,37], our solution space can be more diverse, and the combination of Q-learning and neural networks can guarantee the solving speed.Moreover, the action space is constantly changing, as it can only be selected from the actions corresponding to jobs that have not yet been assigned.Once a job is assigned to a factory as an action is selected, the actions for that job assigned to other factories are removed from the action space available for this learning round.
Since the objective of the proposed problem is to minimize the makespan of all the factories, it is critical to maintain a balance of maximum completion times across factories in the selection of actions.To this end, variations in the variance of the maximum completion time for each factory before and after the execution of the action are used to develop different immediate rewards.If the variance becomes smaller, i.e., the loading capacity of all the factories is more similar, a positive immediate reward is obtained, which is calculated by a countdown of the current system maximum completion time.This implies that a smaller production cycle corresponds to a larger positive reward.Conversely, a larger variance reflects that the chosen action does not balance the capabilities of all the factories, and a negative reward should be received by calculating the difference between the maximum completion time before and after the action is executed.Additionally, the final reward can be received after a learning process, reflecting the difference between the makespan obj * in the current learning episode and the historically optimal makespan obj min .A smaller difference means a bigger final reward.The detailed procedure to achieve a round of interaction between the agent and the environment is provided in Algorithm 3.

Algorithm 3 Interaction Process between the Agent and the Environment at Each Decision Point t
Reserve the completion time list [C 11

Overall Algorithm Framework
The training method is based on the framework of DQN which consists of one input layer and one output layer.The numbers of nodes in input and output layers are equal to the numbers of states and available actions, respectively.In the training process, the decision point t is defined as every time an action a t is about to be selected, followed by the execution of Algorithm 3.However, this algorithm does not include the scenario where t = 1, which means that t is at least greater than or equal to 2. This is a result of the first action being selected without a prior sequence of jobs for comparison and the initial state of the system is still not initialized, for which we design Algorithm 4. The selection of the first action is crucial to ensure the quality of the solution from a certain point of view.Completely random selection, although it guarantees the diversity of solutions, does not consider the propensity to select the first action in the historical optimal solution.In our algorithm, if the makespan learned in a certain round is better than the minimum of historical learning, the first action of this round will be inherited for the next generation of learning; otherwise, the greedy strategy of random exploration is used again.The procedure for initializing the flexible maintenance time window list for each machine is shown in Algorithm 5.

Input (Optional):
The Job List which includes the first job of each factory in the previous solution Initialize the list of machines in all the factories, i.e., ML = If the machine index is equal to 1 then let the starting time of the machine is 0 and initialize the time window list using Algorithm 5 If the input is empty, i.e., the previous solution is worse than the historical optimal solution, then select the first job randomly from the optional job list for the machine Else select the job of the corresponding machine in the list of entered Job List End If Else Update the current machine and machine index Calculate the starting time of the machine, i.e., the completion time of the previous machine Initialize the time window list using Algorithm 5 Identify and reserve the selected operation which depends on the last operation of the previous machine

End If
Update the available action space, remaining operation number Update the lists of the starting time, processing time, machine's age, completion time Initialize the system state referring to the difference calculation in Algorithm 3 Remove the selected operation from job lists for all the factories End For Reserve the maximum completion time for all the factories Output: System state features, available action space Regarding the selection of subsequent actions based on the observed state features, an improved version of the -greedy policy presented in Algorithm 1 is implemented.Specifically, an initial period of learning experience is explored using a completely random strategy, followed immediately by a linear reduction in the greedy rate to ensure that the subsequent learning process is more based on prior experience rather than being dominated by randomness.Due to the setting of flexible maintenance time windows and priority constraints between operations, the selected action may not be performed immediately.Specifically, if the direct execution of an action conflicts with the flexible maintenance constraint or the completion time of the last operation, the implementation of the action will be postponed.The detailed judgement and coordination of the production actions and the implementation of PM are given in Algorithm 6.The overall framework is presented in Algorithm 7.

Algorithm 6 Judgement of the Start Time of Actions
Input: Selected action Decode the selected action, i.e., which job is to be assigned to which factory For machineIndex =

Numerical Experiments
In this section, parameter settings are provided at first, followed by the description of several algorithm competitors and the analysis of comparative experiments.Last but not least, different maintenance cycle settings are analyzed to derive a better makespan.All algorithms are coded in Python 3.6 and run on an Intel(R) Core(TM) i7-8700 CPU (3.20 GHz/16.00GB RAM) PC.

Parameter Settings
To the best of our knowledge, there is no existing instance of the proposed problem, thus some benchmarks are generated based on the following parameters in Table 2, where the normal processing time of a job on each machine follows a uniform distribution from 5 to 20, which is slightly different for different factories.We refer to the research of Mao et al. [14] to set the variation range of maintenance interval T from 50 to 150.Unlike them, this paper assumes that T is the same for each machine and wants to explore a T that makes the makespan optimal using simulation experiments.The time window parameters and the length of maintenance are fixed as 3, 5 and 4, respectively, based on our previous research [37].In addition, the parameters related to the deep reinforcement learning are presented in Table 3.

Performance Evaluation of the Developed Algorithm
In this section, the maintenance cycle is fixed as 50, and 30 scenarios in terms of different numbers of factories, machines and jobs are designed to evaluate the performance of the proposed DQN with diminishing greedy rate (DQND) by comparing with three optimization algorithms, including state-of-the-art genetic algorithm (GA) [41], iterated greedy algorithm (IGA) [14] and the DQN with fixed greedy rate of 0.1 (DQNF).The termination condition of all the algorithms is a maximum elapsed CPU time which is estimated by the formula t = C_ × m × n [14], in which C _ is set to 200 for the comparison experiments.The detailed experimental comparison results are shown in Table 4, in which the symbols 'mean' and 'std', respectively, denote the means and standard values of 20 trials for each algorithm.In order to show these results more clearly, a one-tailed t-test with 38 degrees of freedom at a 0.05 level of significance is employed to conduct statistical analysis.The comparative results are shown as '+', '−' or '~', respectively, when the proposed DQND is significantly better than, significantly worse than, or statistically equivalent to its algorithm competitors.Obviously, the performance of proposed DQND is better than the others within limited time resources in almost all the scenarios, particularly in complicated scenarios.Besides, the advantage of distributed manufacturing is also confirmed in experiments, i.e., the greater the number of factories, the smaller the makespan for the same number of jobs and machines set up.
Some interesting managerial insights are also obtained by adjusting different T under a specific environment configuration of F = 2, M = 3 and J = 40.As shown in Table 5, the objective function does not exactly show a monotonic trend of variation with increasing T, but there is still a gradual deterioration process, which implies that too large a T results in an exacerbated amplification of the deterioration effect of the jobs within a maintenance cycle.By comparing all the extreme value points, it is not difficult to find the optimal maintenance cycle.

Conclusions
Distributed manufacturing has attracted a lot of attention from scholars and practitioners in recent years; however, it is rarely studied in conjunction with preventive maintenance.This paper investigated a joint production-maintenance problem in the context of a distributed interchange flow shop scenario.A DQN-based solution framework was applied to minimize the makespan of the proposed problem.The performance of the developed solution framework was validated by comparing it against three other optimization approaches within the same solving time.Some managerial implications regarding the determination of the maintenance interval were also provided for practitioners.
In the near future, we will continue focusing on distributed manufacturing, considering deterioration effects and preventive maintenance.Firstly, other distributed manufacturing scenarios can be addressed, for instance, distributed flexible job shops.Secondly, some dynamic factors, such as new job arrivals and stochastic failures can be considered.Last but not least, single-objective optimization should be extended to multi-objective, and efficient multi-objective optimization algorithms should be studied in depth.

Figure 1 .
Figure 1.Gantt chart for an optimal solution of the DPFSSP without deteriorating effects and PM.

Figure 2 .
Figure 2. Gantt chart after considering deteriorating effects and PM constraints in the optimal DPFSSP solution.

Figure 1 .
Figure 1.Gantt chart for an optimal solution of the DPFSSP without deteriorating effects and PM.

Figure 1 .
Figure 1.Gantt chart for an optimal solution of the DPFSSP without deteriorating effects and PM.

Figure 2 .
Figure 2. Gantt chart after considering deteriorating effects and PM constraints in the optimal DPFSSP solution.

Figure 2 .
Figure 2. Gantt chart after considering deteriorating effects and PM constraints in the optimal DPFSSP solution.

Figure 3 .
Figure 3. Gantt chart for an optimal solution of the integrated scheduling of the DPFSSP and the PM.

3. Solution Approach Design 3 . 1 .
Background of General Q-Learning and Deep Q Networks

Figure 3 .
Figure 3. Gantt chart for an optimal solution of the integrated scheduling of the DPFSSP and the PM.

3. Solution Approach Design 3 . 1 .
Background of General Q-Learning and Deep Q Networks

Algorithm 5
Initialization of Flexible Maintenance Time Windows timeWindowList = [ [ ] for machine in ML ] Input: Machine type (e.g., F1M1 ), machine s starting time st For index = 1: N + 2 do Append [st + index * T − ∆ 1 , st + index * T + ∆ 2 ] in the corresponding list in the timeWindowList End For Output: timeWindowList

Table 1 .
Normal processing times of the jobs.

Algorithm 2
Deep Q-Learning Algorithm with Experience Replay Initialize replay memory D to capacity N _ Initialize action-value function Q with random weights θ Initialize target action-value function Q with weights θ − = θ For episode = 1: M_ do Initialize sequence s 1 = {x 1 } and preprocessed sequence φ 1 = φ(s 1 ) For t = 1: T_ do With probability ε select a random action a t otherwise select a t = argmax a Q(φ(s t ), a; θ) Execute action a t in emulator and observe reward r t and image x t+1 Set s t+1 = s t , a t , x t+1 and preprocess φ t+1 = φ(s t+1 ) Store transition (φ t , a t , r t , φ t+1 ) in D Sample random minibatch of transitions φ j , a j , r j , φ j+1 from D + γmax a Q φ j+1 , a ; θ − if episode terminates at step j + 1 otherwise Perform a gradient descent step on y j − Q φ j , a j ; θ • • •] immediately prior to a t and corresponding maximum completion time list [maxC 1m (t − 1), maxC 2m (t − 1), • • •] Update the completion time list [C 11 (t), • • • , C 1m (t), C 21 (t), • • • , C 2m (t), • • •] mediately after a t and corresponding maximum completion time list maxC 1m (t), maxC 2m (t), • • • , maxC f m (t) Find the latest maintenance start time in upcoming time windows on the above difference list Remove all the actions that relate to the assigned job in a t from the available action space If the system state is terminated, i.e., there are no more jobs that can be assigned to any factory, then calculate the final reward, i.e., obj min − obj * )/obj min Else Calculate the variance of [maxC 1m (t − 1), • • •] and [maxC 1m (t), • • •] as ϑ 1 and ϑ 2 , respectively If ϑ 1 ≤ ϑ 2 then receive a negative immediate reward, i.e., max[maxC 1m 1 : M do If the machine index is equal to 1 then the earliest start time of the action is initialized as the completion time of the last job on this machine Else the earliest start time depends on the maximum completion time of the previous job as well as operation End If Update the available action space, remaining operation number Calculate the actual processing time and remove the selected operation from job lists for all the factories If the completion time of the previous job does not conflict with the adjacent time window and the insertion of the action still does not conflict with the time window constraint then the starting time of the action is equal to the earliest start time of the action Else if the PM is performed immediately after the previous job then the starting time of the action is equal to the maximum value of the action's earliest start time and the completion time of PM Else if the completion time of the previous job does not conflict with the adjacent time window and the insertion of the action still does not conflict with the time window constraint then the machine keeps idle until the lower bound of the time window and PM is performed Calculate the starting time of the action by evaluating the maximum value of the action's earliest start time and the completion time of PM End If Update the lists of the starting time, processing time, machine's age, completion time Update the system state referring to the difference calculation in Algorithm 3 Learning number, learning rate α, greedy rate and its rate of change, capacity N _ of replay memory, minibatch size, discount factor γ, iteration interval C of updating the target network Q For episode = 1: Learning number do Select the first action using Algorithm 4 to initialize the system state features While True: Choose an action based on the observed state features using Algorithm 2 Judge the actual start time of the chosen action using Algorithm 6 Update new state features and calculate immediate reward using Algorithm 3 Store the transition process in D using Algorithm 2 If there are no more jobs that can be assigned to any factory then calculate the final reward using Algorithm 3

Table 5 .
Comparison of four approaches to the makespan objective when F = 2, M = 3 and J = 40.