A Hybrid Machine Learning and Population Knowledge Mining Method to Minimize Makespan and Total Tardiness of Multi-Variety Products

: Nowadays, the production model of many enterprises is multi-variety customized production, and the makespan and total tardiness are the main metrics for enterprises to make production plans. This requires us to develop a more effective production plan promptly with limited resources. Previous research focuses on dispatching rules and algorithms, but the application of the knowledge mining method for multi-variety products is limited. In this paper, a hybrid machine learning and population knowledge mining method to minimize makespan and total tardiness for multi-variety products is proposed. First, through offline machine learning and data mining, attributes of operations are selected to mine the initial population knowledge. Second, an addition–deletion sorting method (ADSM) is proposed to reprioritize operations and then form the rule-based initial population. Finally, the nondominated sorting genetic algorithm Ⅱ (NSGA- Ⅱ ) hybrid with simulated annealing is used to obtain the Pareto solutions. To evaluate the effectiveness of the proposed method, three other types of initial populations were considered under different iterations and population sizes. The experimental results demonstrate that the new approach has a good performance in solving the multi-variety production planning problems, whether it is the function value or the performance metric of the acquired Pareto solutions. is proposed to assign each operation a priority class. This sorting method overcomes the problem of misalignment of the sequence of operations and overcomes the problem of insufficient or overflowing operations owned by the priority class. After generating an initial population set based on the data mining and ADSM, we compared it with the NSGA- Ⅱ hybrid SA method of the mixed, random, and Nasiri’s initial population. Considering the different numbers of iterations and the sizes of the initial population, a series of comparative and computational experiments were conducted. The results show that the new proposed method is better than the traditional method in terms of the magnitude and relative error of each objective function, and the Cov and Spacing performance index of Pareto solutions.


Introduction
In recent years, digital integration and artificial intelligence have accelerated at an explosive rate. This progress inevitably leads to the rapid development of two technologies: big data [1][2][3] and intelligent algorithms [4]. For enterprise production planning or scheduling, although there are many machine learning algorithms [5][6][7][8][9] for this problem, there are still not many data mining methods, especially for multi-objective job shop scheduling problems (MOJSSP) [10].
MOJSSP is an important and critical problem for today's enterprises. A MOJSSP model can typically be described as a set of machines and jobs under multiple objectives, where each job has different sequential operations and each job should be processed on machines in a given order. This pattern is in line with the production planning problem for multi-variety products with different routings. Herein, the problem is transformed into how to rationally arrange the order of operations of multi-variety products on various machine tools, such that two or more objectives are optimal, e.g. makespan and total tardiness.
The characteristics of multi-objective problems increase the computational difficulty of traditional single-objective scheduling, while data mining [11] could extract rules from a large dataset without too much professional scheduling knowledge, which has more research space in solving MOJSSP and obtaining the optimal values of the objectives.
For the benefits of data mining, a multi-objective production planning method combining machine learning and population knowledge mining is proposed. Through the relation hierarchies mapping [12], attributes related to processing time and due date are selected. The machine learning algorithm used here is a hybrid metaheuristic algorithm that combines nondominated sorting genetic algorithm Ⅱ (NSGA-Ⅱ) and simulated annealing. It provides training data for the mining of population knowledge, and the rules obtained can be used as the criterion for sorting operations. Each operation gets its priority through rules, but it sometimes fails to meet the requirements of the rules (the number of operations per priority is fixed). Besides, the sequence of operations should also be considered. Therefore, this paper proposes a comprehensive priority assignment procedure called addition-deletion sorting method (ADSM) which can meet not only the optimized priority but also the requirements of rules and the order of operations. The main contributions of this paper are highlighted as follows: (1) A hybrid machine learning and population knowledge mining method is proposed to solve multi-objective job shop scheduling problems. (2) Five attributes, namely operation feature, processing time, remaining time, due date, and priorities, are selected to mine initial population knowledge. (3) The ADSM method is designed to reprioritize operations after the population knowledge mining. (4) Three populations (rules, mixed, and random) with different iterations and population sizes are compared, and three performance metrics are defined to explore the effectiveness of the proposed method.

Literature Review
MOJSSP is increasingly attracting scholars and researchers. Most methods are based on a multiobjective evolutionary algorithm (MOEA) [13], which first generates an initial population of production planning and then iterates continuously to obtain the best results. Huang [14] proposed a hybrid genetic algorithm to solve the scheduling problem while considering transportation time. Souier et al. [15] used NSGA-Ⅱ for real-time scheduling under uncertainty and reliability constraints. Ahmadi et al. [16] used NSGA-Ⅱ and non-dominated ranking genetic algorithm (NRGA) to optimize the makespan and stability under the disturbance of machine breakdown. Zhou et al. [17] presented a cooperative coevolution genetic programming with two sub-populations to generate scheduling policies. Zhang et al. [18] utilized a genetic algorithm combined with enhanced local search to solve the energy-efficiency job shop scheduling problems. A high-dimensional multi-objective optimization method was proposed by Du et al. [19]. To improve the robustness of scheduling, constrained nondominated sorting differential evolution based on decision variable classification is developed to acquire the ideal set.
In addition to the MOEA method, there are many other approaches to solve MOJSSP. Sheikhalishahi et al. [20] studied the multi-objective particle swarm optimization method. In their paper, makespan, human error, and machine availability are considered. Lu et al. [21] employed a cellular grey wolf optimizer (GWO) for the multi-objective scheduling problem with the objectives of noise and energy, while Qin et al. [22] tried to overcome the premature convergence of the GWO in solving MOJSSP by combining an improved tabu search algorithm. Zhou et al. [23] proposed three agent-based hyper-heuristics to solve scheduling policies with constrains of dynamic events. Fu et al. [24] designed a multi-objective brain storm optimization algorithm to minimize the total tardiness and energy consumption. Their goal is to minimize multiple objectives through MOEA or other metaheuristic algorithms, without exploiting a data mining method to improve the performance of scheduling.
With the development of industrial intelligence, data mining could be applied to management analysis and knowledge extraction. For MOJSSP, it could play a key role to solve the scheduling problem in the future. Recently, some researchers are trying to apply data mining to overcome the difficulties of job shop scheduling problems. Ingimundardottir and Runarsson [25] used imitation learning to discover dispatching rules. Li and Olafsson [26] introduced a method for generating scheduling rules using the decision tree. In their research, data mining extracts the rules that the attributes of jobs determine which one is better to be processed first. Similarly, Olafsson and Li [27] applied a decision tree to learn directly from scheduling data where the selected instances are identified as preferred training data. Jun et al. [28] proposed a random-forest-based approach to extract dispatching rules from the best schedules. This approach includes schedule generation, rule learning, and discretization such that they could minimize the average total weighted tardiness for job shop scheduling. Kumar and Rao [29] applied data mining to extract patterns in data generated by the ant colony algorithm, which generate a rule set scheduler that approximates the ant colony algorithm's scheduler. Nasiri et al. [30] gained a rule-based initial population via GES/TS method, and then employed PSO and GA to verify the advantage of this method, but lacked the consideration of operations sequence. Most of them considered data mining as a tool to extract dispatching rules, while only a few of them utilized a data mining approach to generate the initial population of metaheuristic methods, not to mention multi-objective job shop scheduling problem.
This paper uses data mining to generate the rule-based initial population for MOJSSP. The operation sequence is considered and an addition-deletion sorting method is presented to reprioritize the order of operations of each job. The advantage of the proposed method is reflected by the NSGA-Ⅱ with simulated annealing method of the random initial population under different criteria of iteration numbers and different initial population sizes.

Problem Description and Goal
The MOJSSP can be described as n jobs {J0, J1, …, Jn} under multiple objectives processed on m machines {M0, M1, …, Mm} with different routes [31]. The processing order of each part and the processing time of each operation oij on a machine is determined. To illustrate the problem, the encoding of a benchmark of 10 × 10 job shop (LA18) [32] here can take an example as {9, 8, 5, 4, 2, 1, 3, 6, 7, 0, …,5, 8, 9}. Each element in the operation sequence code represents a job. The ith occurrence of the same value means the ith operation of this job. Here, the MOJSSP aims to find an order to optimize the above two objectives simultaneously. The constraints include each job should be processed on only one machine at a time, each machine can process only one job at a time, and the preemption is not allowed [33].
Notations used are listed below: 1, if is machined on = 0, otherwise or -, = 0, 1, ..., ; = 0, 1, ..., ; = 0, 1, Equations (1) and (2) are used to minimize the makespan (F1) and total tardiness (F2), respectively. Equation (3) is the variable restriction. Equation (4) ensures the sequence of operations. Equation (5) is used to constrain operations so that they do not overlap. Equation (6) indicates an operation of a job could only be processed on one machine, that is, an operation cannot be divided, and it can only be processed on one machine from the beginning to completion.

The Hybrid Machine Learning and Population Knowledge Mining Method
The main process of the approach for MOJSSP (PAMOJSSP) is briefly illustrated in Figure 1. The process is started from the operations' attributes assignment. In this step, the attributes of each operation are deduced from the optimized objective functions, corresponding to the process information of each operation. These attributes' information is embedded into the individual class and then, under the calculation of metaheuristic algorithm, the Pareto frontier solutions presented as the optimal results among these non-dominated solutions are obtained. This series of Pareto frontier can be used as training data to obtain potential rules in these non-dominated solutions through data mining. After the acquisition of rules, an operation priority table is obtained by the rule correspondence and the ADSM. The rule-based initial population is eventually acquired by crossing the genes of the parent individual in the operation priority table. In the last step, the rule-based initial population combined with the hybrid metaheuristic algorithm obtains the new better Pareto frontier solutions for MOJSSP. The following part introduces the concrete implementation of the proposed method.

Attributes
On the basis of attribute-oriented induction [34], combined with the optimization objectives of this problem, five attributes are finally determined: priority, operation feature, processing time, remaining time, and due date. Each attribute is classified into two or more types and explained in the following subsections.

Priorities
Priorities mean the position of each operation in the final sequence code. From the benchmark of 10 × 10 job shop problem, 100 positions could be acquired and needed to be sequenced to optimize the makespan and total tardiness. Ten classes of priorities are divided, and each priority has ten positions. That is, we divide the operation sequence code into 10 equal parts from front to back, with the priority of 0, 1, 2, …, 9, respectively. The number 0 indicates the highest priority level, while number 9 indicates the lowest priority level.

Operation Feature
Operation feature means the position of each operation in its route process. Referring to Nasiri et al. [30], the first operation is classified as "first". The second and third operations are labeled with "secondary". The fourth, fifth and sixth operations are classified as "middle". The seventh, eighth, and ninth operations mean "later". Finally, the tenth is labeled with "last".

Processing Time
Processing time is the required processing time for each operation on its predetermined machine. For LA18, we classified the processing time into three equal part as "short", "middle", and "long". As illustrated in Figure 2, the processing time with less than 37 is classified as "short". The time between 37 and 67 is "middle" and the left time is labeled with "long".

Remaining Time
Remaining time refers to the accumulative processing time for the remaining operations to be processed after the current operation. As with the classification of processing time, its histogram is shown in Figure 3. The remaining time with less than 202 is defined as "short", the time between 202 and 402 is classified as "middle", and the time with more than 402 is labeled as "long".

Due Date
Due date is the time when a job must be delivered. For optimizing the total tardiness, it is a critical attribute which needs to be considered as an important input of data mining. Because the benchmark LA18 only provides the processing time information, here we add the due date for each job. According to the benchmark, the shortest makespan is 848. Therefore, we could set the due date of Jobs 0-9 as a random number from 1100 to 1290 and arrange them in order. The due date for Job 0 is the most urgent and the due dates of the remaining jobs increase in order, which means that the due dates for jobs become looser. Table 1 shows the values of due date attribute for jobs.

Data Preparation
First, the operation feature, processing time, remaining time, and due date of each operation in LA18 are matched one by one using the above methods. Table 2 lists some of its sample data. The difference of attributes can be obtained by the significant test (α = 0.05) [35]. The null hypothesis is that the significant difference of each column (each attribute) does not exist. The number of samples before matching is 100, while the p-value is 0.000, which is much less than 0.05. That is, the difference between columns is significant. After this, in the 9088 datasets obtained by 30 independent runs of NSGA-Ⅱ combined with simulated annealing, 6645 non-dominated solutions were obtained after removing the repeated solutions. Among them, 493 sets of Pareto frontier solutions were obtained, and 66 sets of solutions were finally selected for the computational complexity, which forms 6600 transactions in a relational database. Table 3 shows the sample data of the trained data.

Rule Mining
Rule mining refers to mining association rules between attribute set {operation feature, processing time, remaining time, due date} and attribute {priority}. The decision tree is an effective means of mining classification. Here, we can get 33 rules from the theory of entropy, which is shown in Appendix І (in the Supplementary Materials). However, the mined rules may have different priorities for the same attribute set. To discriminate this difference and enhance the richness of excellent populations, the weight of a priority class [12] in this paper is used to mine the rules behind the training data. The minimum value of the weight we set here is 0.01, which is the lower bound of the weight of the priority class. When the weight of a rule set under a priority class is less than this value, the rule set is considered not to belong to the priority class. The obtained 37 dominant rules are shown in Table 4.
The rules also could be expressed in the form of "if-then". For example, Rule 3 is given as: If (operation feature = "first", and processing time = "middle", and remaining time = "long", and due date = "slack") Then, (priority = "0", weight = "0.71"; priority = "1", weight = "0.28"; priority = "2", weight = "0.01") It should be mentioned that there may be a case where the sum of weights is not equal to 1, which is due to rounding and does not affect the results of rules.

Initial Population Generated Using ADSM
In this section, a rule-based initial population is finally obtained through the proposed ADSM. First, each operation obtains its possible position according to the mined rules. For example, from Table 2, attributes of Operation 00 are first, middle, long, and tight, which are consistent with the rule set with ID 3 in Table 4. It means that the priority weight of the first operation of the Job 0 is {priority = 0, weight = 0.98; priority = 1, weight = 0.02}. The highest priority level is initially defined as the possible position of the operation. Therefore, the initial priority of Operation 00 is 0. In Appendix ІІ (in the Supplementary Materials), these maximum values are labeled and all other priority weights can be obtained by traversing all operations. Next, the operations in each priority are readjusted to satisfy the requirement of ten operations per priority. To gain this purpose, ADSM is proposed to overcome problems that may arise during the sorting process.
As shown in Figure 4, we define the marked weight as the priority class of each operation. In the process, there are three main situations. The first situation is that the priority order of operations may be incorrect because operations are processed sequentially. That is, the weight of the marked class of the previous operation may be higher than the next operation. While confronting this problem, these two marked weights of the operations should be compared. Through calculating the difference between the marked weight of each operation and the weight of the current row corresponding to the column of the other marked class, the smaller one should remove the mark from the current weight to the new weight of the current row where the column corresponds to the other marked weight. After adjusting the order of all operations, the next problem is to assign the operations between each adjacent priority group to meet the requirement of ten operations per priority. There are two situations, namely ">10" and "<10", which are the remaining two situations mentioned above. The second situation (the number of marked class in the current priority column >10) takes the deletion sorting method, that is removing the extra marked weights to the next priority column. The mark with the smallest difference is removed, which here refers to the difference between the weight of the last marked operation for each job in the current column and the adjacent weight in the next priority column. In the third situation (the number of marked class in the current priority column <10), the addition sorting method is used to add marked classes from the next priority column to the current priority column. The mark with the smallest difference is added first, and the standard is the difference between the weight of the first marked operation for each job in the next column and the corresponding weight in the current priority column. In the cases that the minimum difference is the same, the deletion sorting method first removes the marked operations with the higher encoded number, while the addition sorting method first provides the marked operations with the smaller encoded number.

Experiment
The proposed method was applied to a well-known multi-objective metaheuristic algorithm, which runs 10 times per calculation in Eclipse. To better explain this method, a rule-based initial population and a mixed initial population mined by this method were compared with the random population under different criteria based on the benchmark LA18.

Selected Algorithm
For the multi-objective scheduling problem, the most popular algorithm is NSGA-Ⅱ. However, it lacks a better local search capability. It means that, when solving some cases, its results usually fall into a locally optimal solution. To make experimental results more accurate and better, this study used the NSGA-Ⅱ combined with simulated annealing (SA). The following introduces the NSGA-Ⅱ and SA algorithm and the parameters used.

Nondominated sorting genetic algorithm Ⅱ (NSGA-Ⅱ)
Based on NSGA, developed by Deb et al. [36], NSGA-Ⅱ has the advantage of fast running speed, good robustness, and fast convergence. The encoding method is consistent with the above described. Each gene means an operation. Each chromosome/individual represents a solution or a sequence of all the operations waiting to be scheduled. NSGA-Ⅱ combines parent and offspring into a new population and then obtains the non-dominated solutions by fast non-dominated sorting. The parent is generated by disrupting genes in the chromosome. The offspring is generated by the crossover and mutation method. The parameters are shown in Table 5.

Simulated annealing (SA)
Simulated annealing is an approximate method based on Monte Carlo design. It was first introduced by Kirkpatrick et al. [37] to solve the optimization problem. SA accepts a solution that is worse than the current solution by the Metropolis criterion, thus it is possible to jump out of this local optimal solution to reach the global optimal solution. Table 6 lists the parameters used in SA.  The non-dominated solutions were obtained by NSGA-Ⅱ, and then SA was used to local search. Because of the multi-objective reasons, an objective function was randomly selected as the direction of a search, and the obtained non-dominated solutions were stored in the external archive. SA was applied to local search every 50 generations in this experiment.
To obtain better initial population rules, the data source of data mining must be the optimal or near-optimal solution set, so that accurate and reliable rules can be discovered. In Tables 7 and 8, we can see that the data obtained by NSGA-Ⅱ hybrid with SA are more effective, thus the results obtained by this algorithm were selected as the data source of knowledge mining. More comparisons between this algorithm and other algorithms can be seen in [38].  To verify the knowledge mining method, three different initial populations of NSGA-Ⅱ combined with simulated annealing were considered as follows: • Knowledge mining heuristic optimization method (rule) The initial population was generated using the proposed hybrid machine learning and population knowledge mining method.

•
Heuristic optimization method (random) The initial population of this method was completely randomized.

•
Hybrid population optimization method (mixed) Half of the initial population was generated by knowledge mining and the other half was randomly generated.

Performance Metrics
Three popular metrics were employed to evaluate the performance of the proposed method for MOJSSP: Relative Error (RE), Coverage of two sets (Cov) [39], and Spacing [40]. They can be expressed as follows:

Relative Error (RE)
Extending the method of Arroyo and Leung [41], we analyzed the performance of the two acquired objectives using the relative error (RE) metric. The formulation is as follows: where is the mean value of makespan or total tardiness and is the best makespan or total tardiness obtained among the iterations.

Coverage of Two Sets (Cov)
This indicator was used to measure the dominance between two sets of solutions. The definition is as follows: where X and Y are two non-dominated sets to be compared. x and y are subsets of X and Y, respectively. " ≺= " represents x dominates y or x = y. The value Cov(X, Y) = 1 means that all solutions in Y are dominated by or equal to solutions in X. The opposite, Cov(X, Y) = 0 represents that no solutions in Y are dominated by the set X. It should be mentioned that there is not necessarily a relationship between Cov(X, Y) and Cov(Y, X). Therefore, these two values should be calculated independently. Cov(X, Y) > Cov(Y, X) means the set X is better than the set Y.

Spacing
It measures the standard deviation of the minimum distance from each solution to other solutions. The smaller the Spacing value is, the more uniform the solution set is. The expression is as follows: where n is the number of solutions in the obtained Pareto frontier, is the minimum distance between the solution and its nearest solution, and ̅ is the average value of all .

Results and Discussion
In this section, we first compare the new rule method, mixed method, and traditional random method under different iteration times with the same initial population size. Then, under the same number of iterations and different initial population sizes, the performance of the three methods are compared. Finally, compared with Nasiri's rule-based initial population [30], the effectiveness of the ADSM is proved.
To analyze the values of the two objective functions and the average distribution from the optimal solution, the result of different iterations of 100 initial population is shown in Table 9, where F1 and F2 represent the values of makespan and total tardiness, respectively, and the bold contents are better values. The rules method represents the new approach for population knowledge mining applied to the hybrid NSGA-Ⅱ to solve MOJSSP. The traditional random method is the NSGA-Ⅱ combined with SA for MOJSSP. The mixed method is a combination of the two methods. Figure 5 shows the box distribution of its function values of which the values in the rule method and the mixed method are overall smaller than the traditional method. To some extent, the smaller are the values, the better is the performance. Similarly, the results of different iterations of 50 and 25 initial population are shown in Tables 10 and 11. Figures 6 and 7 are the box diagrams of the respective function values. From the results, we can find that the population mining method has more excellent results than the traditional random method because the rule population contains more excellent information than the traditional random population. Although the new method has no absolute advantage in the test of 25 initial population, it should be explained that, when the initial population size is too small, the performance of the method will inevitably be reduced. In general, there is no obvious advantage or disadvantage between rule method and mixed method in terms of function values, but, in term of the index RE, the rule method is superior to the mixed method, and both are superior to the random method. That is, the more rule-based individuals there are in the population, the more excellent information they contain, and thus it is easier to obtain excellent solutions under a limited number of iterations. Table 9. Values of makespan, total tardiness, and RE of 100 initial population.    Since this paper optimizes two objectives at the same time, the values of makespan or total tardiness cannot be separately compared. Thus, the efficiency of the method was determined by the performance metrics of the acquired Pareto solutions, especially the dominance metric (Cov). Here, the performance metrics of the three methods are compared under different initial population sizes with the same number of iterations, as shown in Tables 12-14, where IP represents the size of the initial population and the bold contents are the better values. In Table 12, the Cov value of the rule method is generally higher than that of the traditional method. The larger is the Cov value, the higher is the dominant ability. That is, in terms of dominance, the rule method is superior to the traditional random method in comparing the size of different initial populations. At the same time, the Spacing values of the new method is generally smaller than that of the traditional method. From Equation (9), it can be seen that the distribution of the rule method is also superior to the random method in considering the initial population size as a variable. Figure 8 shows the box diagram of the Cov of the rule method vs. the random method. From the box diagram, we can find more intuitively that the Pareto solution of the rule method is superior to that of the traditional random method. Similarly, from the results in Tables 13 and 14, we can conclude that the performance of rule method in Cov is also "rules > mixed > random". For Spacing performance, the mixed method is better than the rule method, which shows that the mixed method can more easily obtain a uniformly distributed solution set with a limited number of iterations. Overall, the Spacing metrics of both the rule and mixed methods are superior to the random method.   To better illustrate the proposed ADSM, here we compare it with the Nasiri's method. Compared with the operation sequence and knowledge constraints that are not considered, Table 15 shows that the proposed ADSM has an impact on the value of the single objective function. Although the advantage of the ADSM is not obvious in terms of single objective value, this is because the influence of local adjustment of operations on a single objective value is relatively small under multiple iterations. From the perspective of multi-objective, it significantly improves the performance metrics of dominance and distribution, as shown in Table 16. The results show that, under different iterations and initial population sizes, the Cov values and Spacing values of the ADSM are better, which demonstrates that it is more effective in solving multi-objective problems. On the other hand, the validity of the method ADSM is proved, which shows that the new proposed method can overcome the problem of sequencing the operations under knowledge mining as well as meet the requirement of high performance.

Conclusions
This paper presents a population knowledge mining approach combined with a hybrid machine learning algorithm to solve the MOJSSP with makespan and total tardiness criteria. The purpose of this method is to find a rule-based initial population from good production planning, which can produce better Pareto solutions than the traditional stochastic methods. Many optimal or nearoptimal solutions are created by the method of NSGA-Ⅱ integrated with SA on a benchmark of 10 × 10 job shop problem (LA18), and a training dataset with selected operations' attributes is therefore acquired. For mining knowledge, the weight of priority is used to extract 37 potential rules, which map the relationship between the attribute set and priority.
To form the rule-based initial population, a complicated ADSM is proposed to assign each operation a priority class. This sorting method overcomes the problem of misalignment of the sequence of operations and overcomes the problem of insufficient or overflowing operations owned by the priority class. After generating an initial population set based on the data mining and ADSM, we compared it with the NSGA-Ⅱ hybrid SA method of the mixed, random, and Nasiri's initial population. Considering the different numbers of iterations and the sizes of the initial population, a series of comparative and computational experiments were conducted. The results show that the new proposed method is better than the traditional method in terms of the magnitude and relative error of each objective function, and the Cov and Spacing performance index of Pareto solutions.
The future works of our study will mainly focus on other methods of initial population formation and the application of the proposed method to other cases. Besides, considering other multi-objective problems and how to choose a mining algorithm to better improve the effect of rules need further research.

Conflicts of Interest:
The authors declare no conflict of interest.