A Q-Learning Rescheduling Approach to the Flexible Job Shop Problem Combining Energy and Productivity Objectives

: The ﬂexible job shop problem (FJSP) has been studied in recent decades due to its dynamic and uncertain nature. Responding to a system’s perturbation in an intelligent way and with minimum energy consumption variation is an important matter. Fortunately, thanks to the development of artiﬁcial intelligence and machine learning, a lot of researchers are using these new techniques to solve the rescheduling problem in a ﬂexible job shop. Reinforcement learning, which is a popular approach in artiﬁcial intelligence, is often used in rescheduling. This article presents a Q-learning rescheduling approach to the ﬂexible job shop problem combining energy and productivity objectives in a context of machine failure. First, a genetic algorithm was adopted to generate the initial predictive schedule, and then rescheduling strategies were developed to handle machine failures. As the system should be capable of reacting quickly to unexpected events, a multi-objective Q-learning algorithm is proposed and trained to select the optimal rescheduling methods that minimize the makespan and the energy consumption variation at the same time. This approach was conducted on benchmark instances to evaluate its performance.


Introduction
Energy consumption control is a growing concern in all industrial sectors. Controlling the energy consumption and realizing energy savings are the goals of many manufacturing enterprises. Therefore, the scheduling of a manufacturing production system must now be approached taking into account aspects relating to sustainability and energy management [1]. To implement such measures, researchers focused on developing more energy-efficient scheduling approaches to make a balance between energy consumption and system stability. In addition to that, manufacturing systems constitute dynamic environments in which several perturbations can arise. Such disturbances have negative impacts on energy consumption and system robustness and make the scheduling process much more difficult. In the literature, a lot of researchers solve the job shop problem (JSP) under different types of perturbations, they use different metaheuristics approaches like genetic algorithms [2] or particle swarm optimization [3]. Other researchers use rescheduling approaches that repair the initial disrupted schedule Like dispatching rules.
Recently, many researchers have designed reactive, dynamic, and robust rescheduling approaches using artificial intelligence. These learning-based approaches gain the knowledge of the manufacturing system to be used in the decision-making process. In this case, the rescheduling can adapt to the system's disruption at any time. Research on reducing energy consumption in job shops has focused on energy consumption optimization in the predictive phase when building the initial schedule. The main contribution of this article is first to develop a new approach where energy consumption reduction is taken into account in the predictive and reactive phase. Second, the developed approach integrates a multi-objective machine learning algorithm to be able to react more quickly in case of disruptions (select best rescheduling method rapidly). In the predictive phase, a genetic algorithm was set to build the initial schedule, taking into consideration both energy consumption and completion time optimization. Then, to get a responsive and energy-efficient production system, a multi-objective Q-learning algorithm was developed. This algorithm selects the best rescheduling strategy that minimizes both the completion time and energy consumption in real time, depending on energy availability.
The remainder of this article is organized as follows: the next section provides a literature review on energy-aware scheduling and rescheduling methods, as well as rescheduling approaches using artificial intelligence techniques. Section 3 contains the FJSP problem formulation and the description of rescheduling methods. The Q-learning algorithm and selection of the optimal rescheduling approach are described in Section 4. The experiments and the evaluation of the approach on FJSP benchmarks are presented in Section 5. Finally, a conclusion and some future directions are provided.

Related Works
This section is divided into two parts. The first part presents some of the recent energy efficient methods for scheduling and rescheduling in manufacturing systems. The second part focuses on rescheduling methods using artificial intelligence (AI) techniques. A discussion section is presented to analyze the related works and to highlight their limits.

Energy-Efficient Scheduling
The approaches that can be found in literature are very often related to job shops or flexible job shops. The next subsections present a short overview of both problems.

Job Shop Energy-Efficient Scheduling
One of the most studied production scheduling problems in the literature is the jobshop scheduling problem (JSSP), in which jobs are assigned to resources at particular times. In recent years, due to rising energy costs and environmental concerns, researchers have started working on energy-efficient scheduling problems as a main feature of JSSP. Two integer programming models were for example used in [4], namely a disjunctive and a time-indexed formulation, to solve the JSSP in order to minimize electricity cost. A scheduling model with the turn off/turn on of machines was introduced in [5], and a multi-objective genetic algorithm based on non-dominated sorting genetic algorithm NSGA-II was developed to minimize the energy consumption and total weighted tardiness simultaneously. A metaheuristic to solve the JSSP which includes a power threshold that must not be exceeded over time was also developed [6], with two power requirements considered for operations: a peak consumption at the beginning of the machining and a nominal consumption after. The aim of this work was to minimize the makespan while respecting the power threshold. Decentralized systems attract the interest of many other researchers, where the decision making is distributed over several autonomous actors. For example, an agent-based approach for measuring, in real time, the energy consumption of resources in job shop manufacturing process [7], where the energy consumption was individually measured for each operation and the optimization problem was implemented using IBM ILOG OPL in order to minimize the makespan and the energy consumption.

Flexible Job Shop Energy-Efficient Scheduling
Another type of scheduling in job shop is the flexible job shop scheduling problem (FJSSP) as an extension of JSSP, which has been given widespread attention, due to its flexibility. An energy-efficient scheduling in FJSSP environment was designed by [8], with an enhanced evolutionary algorithm based on genetic algorithm and simulated annealing algorithms incorporated with three objective functions: minimizing total completion time, maximizing the total availability of the system, and minimizing the total energy cost. Similarly, an integrated energy and labor perception multi-objective FJSSP scheduling approach Sustainability 2021, 13,13016 3 of 36 that considers makespan, total energy cost, total labor cost, maximal and total workload was proposed in [9]. In order to solve the optimization problem, the non-dominated sorting genetic algorithm-III (NSGA-III) was used. Likewise, in [10], a hybrid meta-heuristic algorithm based on an artificial immune algorithm (AIA) and simulated annealing algorithm (SA) was developed, to consider simultaneously the maximal completion time and the total energy consumption.
The aforementioned research handled the static scheduling, but few focused on the FJSSP under a real-life environment, considering disturbances such as machine failures, random and new arrival jobs, unexpected processing times or unavailability of operators. The accurate detection and control of these events is becoming a topic of concern on shop floors. The job-shop scheduling problem under disruptions that can occur at any time was solved by [11]. To achieve this, they used a match-up technique to determine the rescheduling zone and its feasible reschedule. Then, a memetic algorithm was proposed to find a schedule that minimizes the energy consumption within that zone. A rescheduling method based on a genetic algorithm to address dynamic events (i.e., new job arrivals and machine breakdowns) was introduced by [2]. The objective of their work was to minimize the energy consumption and the productivity simultaneously. Another form of unpredictable events that gets a lot of attention lately is the new job arrivals: [12] developed an energy-conscious FJSSP with new job arrivals, where the minimization of makespan and energy consumption and instability were considered. To solve the scheduling problem, they proposed a discrete improved backtracking search algorithm (BSA), and for the rescheduling they used a novel slack-based insertion algorithm. In [13], the authors designed a heuristic template for dispatching rules with a potential to make better routing decisions. As a solution, they developed a genetic programming hyper-heuristic with delayed routing (GPHH-DR) method for solving multi-objective DFJSS that optimizes the mean tardiness and energy efficiency simultaneously. Within this context and to deal with the new job arrival, [14] provided a dynamic energy aware job shop scheduling model which seeks a trade-off among the total tardiness, the energy cost and the disruption to the original schedule. An adequate renewed scheduling plan in a reasonable time, based on a parallel GA algorithm was presented. Scheduling of the energy-efficient FJSSP can also be settled with distributed approaches: [15] proposed a negotiation and cooperationbased information interaction and process control method, which combines IoT and energyefficient scheduling methods, to quickly handle machine breakdowns and urgent order arrivals. In this study, a new metaheuristic algorithm, denoted as PN-ACO, based on timed transition Petri nets (TTPN) and ant colony optimization (ACO) algorithms, was introduced. An alternate form of metaheuristic algorithm for scheduling in FJSP is the particle swarm optimization method (PSO), which was used to minimize the makespan and global energy consumption under machine breakdowns in [3]. In [16], an evolved version of the PSO was presented, as well as a multi-agent architecture named EasySched for the predictive and reactive scheduling of production based on renewable energy availability.

Job Shop Scheduling Using Artificial Intelligence
After the emergence of artificial intelligence (AI) and machine learning (ML) techniques, intelligent and automated scheduling and rescheduling have become possible, and methods based on ML techniques began to arise. In general, there are three types of machine learning: supervised learning, unsupervised learning, and reinforcement learning. Starting with supervised learning techniques, the training data generally includes examples of the input vectors along with their corresponding target vectors [17]. In other terms, it is the learning of a function that maps an input to an output based on example input-output pairs. Decision tree (DT) is a well-known supervised technique used in literature: the scheduling knowledge can, for example, be modeled through data mining to identify a rule-set [18]. Three modules were designed here, namely optimization, simulation, and learning: (i) optimization provides efficient schedules based on tabu search (TS), (ii) simulation transforms the solution provided by the optimization module into a set of dispatching  [19] applied a data mining module based on DT knowledge extraction. Here, timed Petri nets were used to describe the dispatching processes of JSSP, a Petri net-based branch-and-bound algorithm was used to generate efficient solutions, and finally the extracted knowledge was formulated as DTs and produced a new dispatching rule. This solution solved the conflicts between operations, by predicting which operation should be dispatched first. Another machine learning technique that combines several decision trees is random forest (RF). The authors in [20] started by generating and processing data samples of machine failures, then designed the RF-based rescheduling model that would decide which rescheduling strategy has to be made (no rescheduling, right-shift rescheduling or total rescheduling). In [21], a comparison between several machine learning techniques was made. They developed a model for the FJSSP with sequence-dependent setup and limited dual resources, solved the scheduling problem through a hybrid metaheuristic approach based on GA and TS to minimize the makespan, then trained the ML classification models such as support vector machines (SVM) and RF for identifying rescheduling patterns when machines and setup workers are not available.
A subset of supervised learning in literature is deep learning. In [22] GA was used to solve the scheduling problem in a job shop in order to minimize the makespan, coupled with an artificial neural network (ANN), which was employed to predict the total energy consumption. GA was also used in [23] to minimize the makespan, but they handled the dynamic events and perturbations in a job shop environment, they therefore designed a back-propagation neural network (BPNN) to describe machine breakdowns and new job arrivals. Thanks to their feedback adjustments, BPNN can generate a feasible solution for the JSP by resolving the conflicts. In [24] cumulative time error was used as the quantitative index of implicit disturbance, locally linear embedding (SLLE) and general regression neural networks (GRNN) were applied to reduce and map the data, and then a least square-support vector machine (LS-SVM) was used to select the best rescheduling mode.
Other works treated the new job arrival disturbance. The authors of [25] presented a scheduling and dispatching rule-based approach for solving a realistic FJSSP, through a combination of a discrete event simulation (DES) model and a BPNN model to find optimal or near-optimal solutions while favoring the fast reactivity to unexpected new arrival jobs. An appropriate management of both methods in the GA optimization process (GA-Opt) was achieved to minimize the makespan.
Compared with supervised learning, unsupervised learning operates upon only the input data without outputs or target variables. The goal in such problems may be to discover groups of similar examples within the data, in an operation called clustering [17]. K-means, an unsupervised technique, was used in [26]. They developed the modified variable neighborhood search (MVNS) method in the optimization process to minimize the mean flow time. This method was combined with the k-means algorithm as a cluster analysis algorithm. It was used to place similar jobs according to their processing time into the same clusters, then jobs in the farther clusters have greater probability to be selected in the replacement mechanism.
The third type of machine learning is reinforcement learning (RL). This type was widely used to solve the scheduling problem in job shop. It describes a class of problems where an agent operates in an environment and must learn to operate using feedback. The use of an environment means that there is no fixed training dataset. In other words, reinforcement learning is learning what to do, how to map situations to actions to maximize a numerical reward signal. The learner is not told which actions to take, but instead must discover which actions yield the most reward by trying them [27]. There are different types of reinforcement learning such as Q-learning, deep Q-learning, SARSA, policy gradient, prioritized experience replay . . . [28] are among the first ones to have used reinforcement learning in their work. They proposed an approach to learn local dispatching policies in a job shop with the aim of reducing the summed tardiness. They applied an ANN-based agent to each resource which was trained by Q-learning. This approach demonstrated a better performance than common heuristic dispatching rules. The authors of [29] developed a rule-driven dispatching method. To do so, they used reinforcement learning to train the intelligent agent in order to obtain the knowledge to set appropriate weight values of elementary rules to solve the work in process fluctuation of a machine. The objective of their work was to minimize the mean flow time and mean tardiness time in JSSP. In a different way of using RL, [30] used a policy gradient method for autonomous dispatching to minimize the makespan. They designed a multi-agent system where each machine was attached to an agent which employed probabilistic dispatching policies to decide which operation is currently waiting to be processed. In the same context, to select the best dispatching rule, in [31] the rescheduling strategy was acquired by the agent of the proposed Q-learning. The agent-based approach can then select a best strategy under different machine failures. In [32], the Q-learning algorithm was applied to update the parameters of the variable neighborhood search (VNS) at any rescheduling point. New job insertion was also handled using Q-learning. In [33], six composite dispatching rules were developed to select an unprocessed operation and assign it on an available machine when an operation is completed or a new job arrives. Later, a deep Q-learning agent was trained to select the appropriate dispatching rules. In a distributed way, [34] used a Q-learning algorithm associated with Intelligent Products (IP) which collected data to pinpoint the current scheduling context, and then determined the most suitable machine selection rule and dispatching rule in a dynamic flexible job shop scheduling problem with new job insertion. The authors of [35] proposed a multi-agent system containing machine, buffer, state and job agents for dynamic job shop scheduling to minimize earliness and tardiness punishment. A weighted Q-learning algorithm based on a dynamic greedy search was adopted to determine the optimal scheduling rules.
A comparison between all the above-mentioned studies is summarized in Table 1. The first column indicates the reference of the works, the second column specifies the type of problem studied, the third column defines the type of perturbation considered. In the fourth column, the scheduling or rescheduling method is presented. In the fifth and sixth column the solving method architecture is mentioned: centralized, which means that only one actor handles the scheduling problem, or distributed, through different communicating agents. In the seventh and eighth columns, the nature of the objective function and the objectives to minimize are presented. Finally, in the last column, the artificial intelligence techniques used in relevant works are presented.

Discussion
Most works in the literature consider energy-efficiency scheduling as a multi-objective strategy, which includes reducing the energy consumption or the energy cost alongside the traditional scheduling objectives, e.g., makespan, mean tardiness, mean flow time, maximal workload and many other objectives. Considering the energy related strategies and the traditional objectives proved to be a good solution to increase scheduling efficiency, this new technique is inspiring a lot of research and has become an important topic.
To reduce energy consumption, many aspects were reviewed. Processing, machine idle time reduction, machine speed, transportation, maintenance, setup and switching energy are examples of energy consumption aspects. Many articles handle the energy efficiency in scheduling but do not clearly outline the energy consumption aspects, or only consider one aspect, mainly the processing energy, and ignore the rest that can have a great impact on energy consumption. About rescheduling, many methods are dynamically used in job shops, but these methods depend on the state of the system in a particular moment. Due to the changing and uncertain nature of job shops, rules have to be modified dynamically and at the right time. Therefore, rescheduling can be handled using machine learning algorithms. In that case, the system is able to select the best method and adapt to the system's perturbation. The learning methods are trained to acquire the system's knowledge which will be used in the decision-making process. From the literature review, a lot of works applied these learningbased approaches using inductive learning, neural networks, or reinforcement learning, especially RL which has been widely used and has proved to have high performance in selecting the best approaches for rescheduling or modifying existing approaches. However, they have not integrated energy-efficiency in these approaches and are usually interested in minimizing the operations execution time. In this article both makespan and energy consumption reductions are considered in the learning process.
A classical GA was chosen for the initial solving of FJSSP (predictive phase). GAs have already been successfully adopted to solve FJSSP, as proven by the growing number of articles on the topic. Genetic algorithms might not be the best solution in a generic context in terms of solving time. However, this solving is performed in an offline phase that is not penalizing in the context of this work. Moreover, a different choice can be made by a practitioner according to a specific context, without questioning the validity of the overall approach.
On the reactive phase of rescheduling, as no prior knowledge of the environment is considered (because no coherent pre-trained data of manufacturing system were available to use in the learning process), Q-learning was chosen in this work. Literature provides many works that have used Q-learning for a single objective, optimization of productivity, whereas this article develops a multi-objective optimization that also considers energy consumption. In addition, the learning is generally performed on classical dispatching rules. This article presents a learning phase on actual multi-objective optimization methods of rescheduling.
In addition, Q-learning is an agent-based approach which facilitates its integration in distributed approaches that can be developed on embedded systems which is the topic of possible future works.

A Dynamic Flexible Job Shop Scheduling with Energy Consumption Optimization
The FJSSP has been widely researched in recent decades due to its complexity. On top of that, dynamic events can occur frequently and randomly in job shop systems, which increases its complexity. Many metaheuristics have been proposed in literature to solve this problem. In this section, a solution to FJSSP considering energy consumption optimization is proposed. Then, corresponding rescheduling methods are proposed to handle the dynamic nature of the system.

Description of FJSSP
In FJSSP, there are n jobs that should be processed on M machines. Each job consists of a predetermined sequence of n j operations which should be processed in a certain order. The objective of FJSSP is to assign each operation to the suitable machine and arrange the sequence of operations on each machine [36].
We define the notations used in this article to model the FJSSP: • J =J 1 . . . J n is a set of n independent jobs to be scheduled. FJSSP is a generalization of the job shop scheduling problem, where an operation can be processed on several machines, usually with varying costs. Here after a list of characteristics of FJSP problem:

1.
Jobs are independent and no priorities are assigned to any job type. Operations of different jobs are independent. 3.
Each machine can process only one operation at a time.

4.
Each operation can be processed without interruption during its performance on one of the set of machines.

5.
There are no precedence constraints among operations of different jobs. 6.
Two assumptions are considered in this work: 7.
All machines are available at time 0 and the transportation time is neglected.
An example of an FJSSP instance is presented in Table 2. A processing machine and time of FJSSP includes 3 jobs and 4 machines. A full description of the mathematical mixed integer programming (MIP) formulation for FJSP considering energy consumption proposed MIP has been proposed in [37]. Table 2 illustrates an example of a small FJSP instance.

Genetic Algorithm (GA)
In this article, we propose to use a classical GA for the initial solving of FJSSP [38]. It is an optimization method based on an evolutionary process. The performance validation of the proposed algorithm is detailed in Section 5.1.
The aim of the FJSSP is to find a feasible schedule that minimizes makespan and energy consumption at the same time. Therefore, makespan and energy consumption are integrated into one objective function (F) using a weighted sum approach. The relative importance of each objective can be modified in F, which represents the fitness of the GA. Since the values of energy consumption and makespan are not proportional, we have to normalize both measures [39]. As presented in equation 1, makespan is divided by MaxMakespan, which is the maximum makespan value for the given problem, and energy consumption is divided by the MaxEnergy, which is the sum of the energy needed to execute all tasks of the problem. λ is the weight that reflects the importance of each objective function, λ ∈ [0 . . . 1]. This weight is modified statically, in this work. A dynamic evolution of λ is out of the scope of this article, and future perspectives may consider using an agent that controls the energy availability and triggers a rescheduling order when a threshold is reached.
A flow chart illustrating the process of the genetic algorithm is represented in Figure 1. The overall structure of GA can be described in the following steps:

1.
Encoding: Each chromosome represents a solution for the problem. The genes of the chromosomes describe the assignment of operations to the machines, and the order in which they appear in the chromosome describes the sequence of operations.

2.
Tuning: The GA includes some tuning parameters that greatly influence the algorithm performance such as the size of population, the number of generations, etc. Despite recent research efforts, the selection of the algorithm parameters remains empirical to a large extent. Several typical choices of the algorithm parameters are reported in [40,41]. 3.
Initial population: a set of initial solution is selected randomly. 4.
Fitness evaluation: A fitness function is computed for each of the individuals, this parameter indicates the quality of the solution represented by the individuals. 5.
Selection: At each iteration, the best chromosomes are chosen to produce their progeny. 6.
Offspring generation: The new generation is obtained by applying genetic operators like crossover and mutation 7.
Stop criterion: when a fixed number of generations is reached, the algorithm ends and the best chromosome, with their corresponding schedule, is given as output.

Disturbances in FJSSP
FJSSP considers a large variety of disturbances. These perturbations are random uncertain and will bring instability to the initial schedule. In this work, one of the common and frequent disruption in production scheduling will be considered: ma

Disturbances in FJSSP
FJSSP considers a large variety of disturbances. These perturbations are random and uncertain and will bring instability to the initial schedule. In this work, one of the most common and frequent disruption in production scheduling will be considered: machine failures. We will deal with these events using rescheduling methods that will be discussed in the next section. These methods will try to maintain the stability of the system.
To simulate a machine failure [3], we have to select: • The moment when the failure occurs (rescheduling time). These failures are randomly occurring, with a uniform distribution between 0 and the makespan of the original schedule generated with GA algorithm. • The machine failing.

•
The breakdown duration, which obeys to a uniform distribution between 25% and 50% of the makespan.
To simplify the problem, some assumptions about machine failures are considered: 1.
There is only one broken-down machine at a time.

2.
The time taken to transfer a job from the broken-down machine to a properly functioning machine is neglected.

3.
Machine maintenance is immediate after the failure.

Rescheduling Strategies
One question can arise when dealing with the system disturbances, or the changed production circumstances: what kind of rescheduling methodologies should be used to produce a new schedule for the disturbance scenario? In the literature, many rescheduling methodologies were reported. Researchers classified these methods into two categories: (i) repairing a schedule that has been disrupted and (ii) creating a schedule that is more robust with respect to disruptions [42,43].
There are common methods used to repair a schedule that is no longer feasible due to disruptions: right shifting rescheduling, partial rescheduling, and total rescheduling. Their definitions are described respectively as follows [24]: The choice of the most appropriate methodology depends on the nature of the perturbation and is generally made by experts. Rescheduling methods have different advantages and drawbacks: RSR and PR can quickly respond to machines' breakdowns, however TR can offer a high-performance rescheduling, but with excessive computational effort. In this work, the targeted rescheduling strategy is the optimal one that minimizes the makespan and the energy consumption.

Proposed Multi Objective Q-Learning Rescheduling Approach
The proposed Q-learning-based rescheduling is described in Figure 2. The system is composed of two modes: • An offline mode: in the first place the predictive schedule is obtained using a genetic algorithm, which represents the environment of the Q-learning agent. By interacting with this schedule and simulating experiments of machine failures, this agent learns how to select the optimal rescheduling solution for different states of the system. • An online mode: when a machine failure occurs, the state of the system at the time of the interruption is delivered to the Q-learning agent. It responds by selecting the optimal rescheduling decision for this particular type of failure. Sustainability 2021, 13, x FOR PEER REVIEW 13 of 33 A key aspect of RL is that an agent has to learn a proper behavior. This means that it modifies or acquires new behaviors and skills incrementally [44]. An improvement of the Q-learning algorithm was also made to consider different criteria (multi-objective Q-learning). Next sections detail this algorithm.

Q-Learning Terminologies
In order to be more accurate in the description of the algorithm, some terminologies of Q-learning are recalled below [45]: • Agent: The agent interacts with its environment, selects its own actions, and responds to those actions; To sum up, the agent will make optimal decisions using experiences, make an action in a particular state, and evaluate its consequences based on a reward. This process is done repeatedly until it becomes able to choose the best decision.  A key aspect of RL is that an agent has to learn a proper behavior. This means that it modifies or acquires new behaviors and skills incrementally [44]. An improvement of the Q-learning algorithm was also made to consider different criteria (multi-objective Q-learning). Next sections detail this algorithm.

Q-Learning Terminologies
In order to be more accurate in the description of the algorithm, some terminologies of Q-learning are recalled below [45]: • Agent: The agent interacts with its environment, selects its own actions, and responds to those actions; To sum up, the agent will make optimal decisions using experiences, make an action in a particular state, and evaluate its consequences based on a reward. This process is done repeatedly until it becomes able to choose the best decision.
Q-learning is a value-based learning algorithm; it updates the value function based on a Bellman equation. The 'Q' here stands for quality of an action. The agent maintains a table of Q(s, a), updated along time based on Equation (2): where r t+1 is the reward received when the agent transferring from the state s t to the state s t+1 , α is the learning rate (0 < α ≤ 1) (representing the extent to which our Q-values are being updated in every iteration), and γ is the discount factor (0 ≤ γ ≤ 1) (determining what importance is given to future rewards). The algorithm of Q-learning is detailed in Algorithm 1.

Algorithm 1 Q-Learning
Initialize Q(s , Aa ) randomly Repeat for each episode: Initialize s Repeat for each step of episode Choose an action from a using a policy derived from Q (ε-greedy) Take an action a and observe the reward R and the next state s' Update Q(s t ,a t ) = (1 − α) Q(s t ,a t ) + α( r t+1 +γmaxQ(s t+1 , a )) s ← s until s is terminal

Multi-Objective Q-Learning
In this case the agent has to optimize two objective functions at the same time. Here, the reward will transform from a scalar value to a vector of the size of the number of objective functions: where m is the number of objective functions. The same thing occurs with action-state value Q(s,a) which becomes also a m-dimensional vector which is defined as follow: where every value corresponds to a reward value from the reward vector.
In this article a multi-objective Q-learning with single policy approach is used. This means that it reduces the dimensionality of the multi-objective function. This new function fairly represents the importance of all objectives. For the single policy approach, many methods have been proposed. The most well-known is the weighted sum approach where scalarizing function is applied to Q(s, a) to acquire a scalar value Q(s , a ) that considers all the objective functions. The linear scalarizing function is used and described as follows: where 0 ≤ w i ≤ 1 is the weight that specifies the importance of each objective function, and must satisfy the following equation: ∑ m i=0 w i = 1 The algorithm of the multi-objective Q-learning is detailed in Algorithm 2.

Algorithm 2 Multi-Objective Q-Learning
Initialize Q(s , a ) randomly Repeat for each episode: Initialize s Repeat for each step of episode Choose an action from a using a policy derived from Q (ε-greedy) Take an action a and observe the rewards R 1 and R 2 and the next state s' Update Q 1 (s t , a t ) = (1 − α) Q 1 (s t , a t ) + α(R 1t+1 + γ maxQ 1 (s t+1 , a )) Q 2 (s t , a t ) = (1 − α) Q 2 (s t , a t ) + α(R 2t+1 + γ maxQ 2 (s t+1 , a )) s ← s until s is terminal

State Space Definition
The state space is the set of all possible situations the agent could inhabit. We have to select the number of states that will give the optimal solution and how to define these states. In this article, two indicators were used to establish the state space: • s1: indicates the moment when the perturbation happens, e.g., in the beginning, the middle or in the end of the schedule. For this purpose, the initial makespan was divided into 3 intervals, so s1 can take the values 0, 1 or 2. • s2: defined by the indicator SD which is the ratio of the duration of the directly affected operation by the machine's breakdown to the total processing time of the remaining operations on failed machine. The formula is described as follows: where O aff is the directly affected operation by the breakdown machine and RT is the total processing time of the remaining operations on failed machine. s2 is an integer between 0 and 9 depending on the value of SD.
The couple (s1, s2) represents the state of the system at a particular time, given the rescheduling time, the failure machine, and the breakdown duration. In total we have 30 states, where 0 ≤ s1 ≤ 2 and 0 ≤ s2 ≤ 9 (s1 and s2 are integers).

Actions and Reward Space Definition
The agent encounters one of the 30 states, and it takes an action. The action in this case is one of the rescheduling methods: The definition of the reward plays an important role in the algorithm since the Qlearning agent is reward-motivated. This means that it selects the best action by evaluating the reward. In this work, the reward is a vector with two scalars R(s, a) = [R 1 (s, a), R 2 (s, a)] where R 1 (s, a) depends on delay time (the longer the delays, the smaller the rewards) and R 2 (s, a) depends on the difference of energy consumption between the initial scheme and the scheme after rescheduling (the bigger these differences, the smaller the rewards). The rewards are set to be between 5 and −5, based on how much delay time there is and the difference in energy consumption the action will cause.

Experiments and Results
In order to evaluate the performance of the proposed model, benchmark problems are used. At the authors' best knowledge, there are currently no benchmarks available in the literature considering energy in an FJSSP. Therefore, instances had to be created in order to test and validate this work. The choice was made to extend classical problems from the literature to support energy consumption. The chosen problems are taken from Brandimarte [46]. This consists of 10 problems (mk1 to mk10), where the jobs range from 10 to 20 operations, machines from 6 to 15, and operations for each job from 5 to 15. An energy consumption of every operation was added randomly, obeying a uniform distribution between 1 and 100. Thus, for each instance, the machining energy consumption and the idle power of machines are specified as inputs.
In this article, the unit of the makespan is unit of time and the unit of the energy consumption is in kWh.

Predictive Schedule Based on GA
Initially, the optimal scheduling scheme is acquired based on GA. Python programming is used to develop the proposed method using the distributed evolutionary algorithms in python framework (DEAP), which is a novel evolutionary computation framework. The parameters of GA are set as follows: the size of initial population is 50 and the number of generations is 500.
To validate the GA, a comparison with other methods in literature was made, such as PSO proposed by [47] and TS proposed by [48]. The result of the Brandimarte instances in terms of makespan of these different algorithms is presented in Table 3. The weight of the objective function of genetic algorithm is set to 1, to give importance to makespan rather than energy reduction. Italics here identify the most effective algorithm through the lowest value of the makespan.
As can be seen from Table 3, the proposed GA gives similar results to PSO and TS algorithm when the weight is set to 1. Therefore, we consider this proposition as satisfying.
In the next step, more importance is given to energy reduction, therefore the weight of the objective function is modified. The Gantt chart of the predictive schedule using GA of Mk01 for different weight values is shown in Figure 3.
The makespan and energy consumption values for different cases are described in Table 4. This shows that the two objective functions are antagonistic. When the weight is set to 1, importance is given to makespan, therefore in this case GA provides the best makespan (42) but the biggest energy consumption value (2812). On the opposite, when the weight is set to 0, the importance is given to energy reduction, in this case GA provides the worst makespan (73) but the best energy consumption value (2229). It may be noted that when the weight decreases, makespan decreases but energy consumption increases.

Rescheduling Strategies
To illustrate the difference between the different rescheduling methods presented in Section 3.4, the predictive schedule of the instance MK01 where the weight is set to 1 is taken as example. A random perturbation (machine failure) is applied, assuming that at time t = 20, machine 1 is broken down and t = 6 is the duration of the breakdown. The new schedules acquired by the three rescheduling methods (PR, TR and RSR) are presented in Figure 4, the red line representing the starting time and ending time of machine failure.
As can be seen from Table 3, the proposed GA gives similar results to PSO and TS algorithm when the weight is set to 1. Therefore, we consider this proposition as satisfying.
In the next step, more importance is given to energy reduction, therefore the weight of the objective function is modified. The Gantt chart of the predictive schedule using GA of Mk01 for different weight values is shown in Figure 3.  The makespan and energy consumption values for different cases are described in Table 4. This shows that the two objective functions are antagonistic. When the weight is set to 1, importance is given to makespan, therefore in this case GA provides the best makespan (42) but the biggest energy consumption value (2812). On the opposite, when the weight is set to 0, the importance is given to energy reduction, in this case GA provides the worst makespan (73) but the best energy consumption value (2229). It may be noted that when the weight decreases, makespan decreases but energy consumption increases.

Rescheduling Strategies
To illustrate the difference between the different rescheduling methods presented in Section 3.4, the predictive schedule of the instance MK01 where the weight is set to 1 is taken as example. A random perturbation (machine failure) is applied, assuming that at time t = 20, machine 1 is broken down and t′ = 6 is the duration of the breakdown. The new schedules acquired by the three rescheduling methods (PR, TR and RSR) are presented in Figure 4, the red line representing the starting time and ending time of machine failure.   (Figure 4b). In TR, all the remaining jobs are rescheduled using the GA algorithm after the breakdown (Figure 4c). As for RSR, all the remaining jobs are postponed by the breakdown duration (Figure 4d). The performance of the rescheduling methods is described in the Table 5. Sustainability 2021, 13, x FOR PEER REVIEW 18 of 33 In TR, all the remaining jobs are rescheduled using the GA algorithm after the breakdown ( Figure  4c). As for RSR, all the remaining jobs are postponed by the breakdown duration ( Figure  4d). The performance of the rescheduling methods is described in the Table 5. As can be seen from Table 5, the three rescheduling methods gives different results. Both makespan and energy consumption are increased due to the presence of the machine failure that affects a set of operation. In terms of makespan, TR gives the best result (42), but in terms of energy consumption, RSR gives the best result (2887). This result can be explained by the date of the failure, which happened close to the end of the initial schedule.  As can be seen from Table 5, the three rescheduling methods gives different results. Both makespan and energy consumption are increased due to the presence of the machine failure that affects a set of operation. In terms of makespan, TR gives the best result (42), but in terms of energy consumption, RSR gives the best result (2887). This result can be explained by the date of the failure, which happened close to the end of the initial schedule.

Rescheduling Based on Q-Learning
To test the performance of the proposed Q-learning algorithm, we designed simulation experiments of machine failures. The parameters are set as follows: • α = 1: A learning rate of 1 means the old value will be completely discarded, the model converges quickly, no large number of episodes are required; • γ = 0: The agent considers only immediate rewards. In each episode, one state is evaluated (the initial state of the system at a particular time, given the rescheduling time, the failure machine and the breakdown duration) • ε = 0.8, the balance factor between exploration and exploitation. Exploration refers to searching over the whole sample space while exploitation refers to the exploitation of the promising areas found. In the proposed model, 80% is given to exploitation, so in 80% of cases the agent will choose the action with the biggest reward and in 20% of cases he will randomly choose an action to explore more of its environment.

•
The number of episodes is 1000, for the model to converge.
In each episode the Q-table is updated depending on the value of the rewards ( Figure 5).
simulation experiments of machine failures. The parameters are set as follows: • α = 1: A learning rate of 1 means the old value will be completely discarded, the model converges quickly, no large number of episodes are required; • γ = 0: The agent considers only immediate rewards. In each episode, one state is evaluated (the initial state of the system at a particular time, given the rescheduling time, the failure machine and the breakdown duration) • ε = 0.8, the balance factor between exploration and exploitation. Exploration refers to searching over the whole sample space while exploitation refers to the exploitation of the promising areas found. In the proposed model, 80% is given to exploitation, so in 80% of cases the agent will choose the action with the biggest reward and in 20% of cases he will randomly choose an action to explore more of its environment.

•
The number of episodes is 1000, for the model to converge.
In each episode the Q-table is updated depending on the value of the rewards ( Figure 5).

The Single Objective Q-Learning
Two types of Q-learning algorithm are proposed in this article: the single objective Q-learning and multi-objective Q-learning.
The aim of the single objective function Q-learning is to minimize the makespan, which means the minimization of the delay time. The curve of the reward and the delay time in the first 50 episodes are described in Figure 6. It can be seen that the longer the delay time, the lower the reward value.

The Single Objective Q-Learning
Two types of Q-learning algorithm are proposed in this article: the single objective Q-learning and multi-objective Q-learning.
The aim of the single objective function Q-learning is to minimize the makespan, which means the minimization of the delay time. The curve of the reward and the delay time in the first 50 episodes are described in Figure 6. It can be seen that the longer the delay time, the lower the reward value. To show how the Q-values are updated in each episode, the state (0.7) is taken as example. Figure 7 describes the variation of Q-values of each action. The agent first selects the action 0 and gets a positive reward so its Q-value increases. After a few episodes, action 0 is chosen again because it has the biggest Q-value but gets a negative reward. Its Q-value thus decreases, giving the chance for action 1 to be selected. After To show how the Q-values are updated in each episode, the state (0.7) is taken as example. Figure 7 describes the variation of Q-values of each action. The agent first selects the action 0 and gets a positive reward so its Q-value increases. After a few episodes, action 0 is chosen again because it has the biggest Q-value but gets a negative reward. Its Q-value thus decreases, giving the chance for action 1 to be selected. After that, action 1 is chosen in every episode because it gets a positive reward each time so its Q-value increases. Action 2 is selected in 100 th and 800 th episodes due to the ε-greedy where the agent still has a 20% probability to explore but its Q-value decreases because it gets negative rewards. To show how the Q-values are updated in each episode, the state (0.7) is taken as example. Figure 7 describes the variation of Q-values of each action. The agent first selects the action 0 and gets a positive reward so its Q-value increases. After a few episodes, action 0 is chosen again because it has the biggest Q-value but gets a negative reward. Its Q-value thus decreases, giving the chance for action 1 to be selected. After that, action 1 is chosen in every episode because it gets a positive reward each time so its Q-value increases. Action 2 is selected in 100 and 800 episodes due to the ε-greedy where the agent still has a 20% probability to explore but its Q-value decreases because it gets negative rewards.

The Multi-Objective Q-Learning
The goal of the multi-objective Q-learning approach is to minimize the makespan and the energy consumption at the same time. In this case, two rewards are considered: reward R 1 that depends on the delay time and reward R 2 that depends on the energy consumption deviation. Figure 8 describes the variation of the reward along the first 50 episodes. It can be seen that R 1 increases when the delay time decreases and R 2 increases when the energy consumption deviation decreases.
This time, state (1.9) is taken as an example and the weight of the objective function of the multi Q-learning algorithm is set to 0.5 (which means that makespan and energy consumption have the same importance). Throughout the episodes, action 1 gets positive rewards and its Q-value increases so it is selected most of the times, on the other hand action 0 and action 2 get negative rewards so their Q-values decrease, they are chosen only in the exploration phase. The Q-value prediction of the state (1.9) is presented in Figure 9.

Models Validation
The results of the optimal rescheduling methods for the Brandimarte [46] instances and the solution given by the Q-learning agent are represented in Appendix A. In Table 6, an extraction of Appendix A, corresponding to the instance MK01, is taken as example. The first column is the name of the instance, followed by its size and its level of flexibility. In the fourth column, the weight of the objective function of the GA and of the multiobjective Q-learning is defined. In the fifth column, makespan and energy consumption of the predictive schedule are calculated. In the sixth column, different types of machine failures are defined by their failure time, the reference of the failing machine and the failure duration. Next comes the state definition, then the rescheduling methods and their performance. In the last column the evaluated Q-learning approach is presented by giving the makespan (MK) and the energy consumption (EC) of the selected optimal rescheduling solution using single objective Q-learning and multi-objective Q-learning. The goal of the multi-objective Q-learning approach is to minimize the makespan and the energy consumption at the same time. In this case, two rewards are considered: reward that depends on the delay time and reward that depends on the energy consumption deviation. Figure 8 describes the variation of the reward along the first 50 episodes. It can be seen that R increases when the delay time decreases and increases when the energy consumption deviation decreases. This time, state (1.9) is taken as an example and the weight of the objective function of the multi Q-learning algorithm is set to 0.5 (which means that makespan and energy consumption have the same importance). Throughout the episodes, action 1 gets positive rewards and its Q-value increases so it is selected most of the times, on the other hand action 0 and action 2 get negative rewards so their Q-values decrease, they are chosen only in the exploration phase. The Q-value prediction of the state (1.9) is presented in Figure 9.

Models Validation
The results of the optimal rescheduling methods for the Brandimarte [46] instances and the solution given by the Q-learning agent are represented in Appendix A. In Table  6, an extraction of Appendix A, corresponding to the instance MK01, is taken as example. The first column is the name of the instance, followed by its size and its level of flexibility. In the fourth column, the weight of the objective function of the GA and of the multi-objective Q-learning is defined. In the fifth column, makespan and energy consumption of the predictive schedule are calculated. In the sixth column, different  In the predictive schedule, when the weight decreases, the makespan increases but the energy consumption decreases. This is normal because importance is given to energy consumption each time the weight is decreased. After simulating different types of failure randomly, it can be seen that the Q-learning is able to choose the best rescheduling methods each time; the single objective Q-learning selects the best methods that minimize the makespan but the multi objective Q-learning selects the best methods that minimize the makespan and energy consumption depending on the value of the weight of the objective function.
When this weight is set to 1, the single objective and multi-objective Q-learning have the same results. They both choose the methods that minimize the makespan regardless of the value of the energy consumption. From Table 7, in the case of the MK01, TR proved to have the highest performance and was selected in both algorithms. Giving the same importance to energy consumption, which implies setting the value of the weight to 0.5, the selected method changes to make a compromise between the two objectives. There is a difference between the result of single objective and multi-objective Q-learning. Taking the state (0.9) as example, PR and TR gives 56 and 57 as makespan respectively and 2890 and 2724 as energy consumption respectively, so PR is selected by the single-objective Q-learning because it generates the minimum makespan, but TR is selected by the multiobjective Q-learning because it has better result than PR in terms of energy consumption. By further decreasing the value of the weight to 0.2, more prominence is given to energy consumption. Taking the example of the state (0.4), PR and TR give 75 and 79 as makespan respectively and 2797 and 2757 as energy consumption respectively. Here PR is selected by the single objective Q-learning because it minimizes the makespan, but TR is selected by the multi-objective Q-learning because it has better optimization of the energy consumption that was given more importance. Once the weight is set to 0, the multiobjective Q-values selects the methods that optimizes the energy consumption regardless of the value of the makespan, as in state (0.9) when PR gave the best makespan (91) so it was selected by the single-objective Q-learning, but TR was selected by the multi-objective Q-learning because it gave the best energy consumption (2612).
Considering all the instances of the Brandimarte benchmark, in Appendix A, we can also deduce that the right shift rescheduling turned out to have the worst performance, this is due to the postponement of the remaining tasks which increases both the makespan and the energy variation. Another deduction that can be taken is that generally TR have the best performance in early failures and PR gives better results when the failures occur in the middle or in the end of the schedule and especially with instances that have high The Q-learning algorithm not only selects the optimal methods for rescheduling but also responds immediately to perturbation. Table 7 indicates the CPU time comparison between the time spent to execute the three rescheduling methods (PR, TR, RSR) and to select the optimal one and the time spent by the Q-learning algorithm to select the best method from the Q-table. The reported values are evaluated using a laptop computer with Intel core i5-8250U with 1.8 GHZ speed and with 12 Gb memory. The offline training of the Q-learning algorithm can take minutes or even some hours depending on the instance size, but it can be seen that, in online execution, the learning-based rescheduling selection of the optimal solution takes only one millisecond compared with traditional rescheduling that can exceed one minute, this time corresponds to state calculation of the system after perturbation and the selection of the best methods that have the highest Q-values from the corresponding Q-values table. However, the execution of the three rescheduling methods and the selection of the best method can take several seconds, even minutes when the instance is large.

Conclusions
This work deals with the flexible job shop scheduling problem under uncertainties. A multi-objective Q-learning rescheduling approach is proposed to solve the FJSSP under machine failures. Two key performance indicators are used to select the best schedule: the makespan and the energy consumption. The idea was not only to maintain effectiveness but also to improve energy efficiency. The approach is hybrid and combines predictive and reactive phases. The originality of this work is to combine AI and scheduling techniques to be able to rapidly solve a bi-objectives problem (makespan and energy consumption) of rescheduling in a context of FJSP.
First, a genetic algorithm was developed to provide an initial predictive schedule that minimizes the makespan and energy consumption simultaneously. In this predictive phase, different types of machine failures were simulated and classical rescheduling policies (RSR, TR, PR) were executed to repair the predictive scheduling and to find new solutions. Based on these results, the Q-learning agent is trained. To consider the energy consumption even in the rescheduling process, a multi-objective Q-learning algorithm was proposed. A weighting parameter is used to make a tradeoff between the makespan and the energy consumption. In the reactive phase, the Q-learning agent is tested on new machine disruptions. The Q-learning agent seeks to find the best action to take given the current state. In fact, the main goal of using AI tools is to be able to react quickly facing failures while rapidly selecting the best rescheduling policy related to the state of the environment. In order to assess the performance of the developed approach, the Brandimarte [46] benchmark was extended to support energy consumption. On this new benchmark, the Q-learning based rescheduling approach was tested to respond to unexpected machine failures and select the best rescheduling strategy.
The results of this study show that the approach proved to be effective in responding quickly and accurately to unexpected machine failures. The Q-learning algorithm provided appropriate strategy choices based on the state of the environment with various balance between the objectives of energy consumption and productivity. The learning phase was therefore efficient enough to enable these efficient choices. The choices of genetic algorithm and Q-learning algorithm proved their efficiency on the extended classical instances of Brandimarte in this work. Nevertheless, the approach leaves the possibility to the user to integrate their own choice of algorithm according to the specific context.
Future works are oriented to take into consideration other types of disruptions like new job insertions, variety of availability of energy, urgent job arrivals, etc. Another future perspective that can be expected is the evaluation of the proposed approach on other types of learning techniques in order to compare with the Q-learning algorithm. On a more global perspective, this work contributes to the development of efficient rescheduling approaches for the control of future industrial systems. Such systems are meant to integrate more and more flexibility, and the performance evaluation of this work on a FJSP shows the compatibility of the approach with this objective. This work also contributes to the integration of multi-objective rescheduling strategies in industry, which is especially relevant for sustainability concerns. Funding: This research was funded by the PULSAR Academy sponsored by the University of Nantes and the Regional Council of Pays de la Loire, France.
Institutional Review Board Statement: Not applicable.

Conflicts of Interest:
The authors declare no conflict of interest.