A Data-Independent Genetic Algorithm Framework for Fault-Type Classiﬁcation and Remaining Useful Life Prediction

Featured Application: We propose a data-independent framework based on an ensemble of genetic algorithms for fault-type classiﬁcation and remaining useful life prediction. Abstract: Machinery diagnostics and prognostics usually involve the prediction process of fault-types and remaining useful life (RUL) of a machine, respectively. The process of developing a data-driven diagnostics and prognostics method involves some fundamental subtasks such as data rebalancing, feature extraction, dimension reduction, and machine learning. In general, the best performing algorithm and the optimal hyper-parameters suitable for each subtask are varied across the characteristics of datasets. Therefore, it is challenging to develop a general diagnostic / prognostic framework that can automatically identify the best subtask algorithms and the optimal involved parameters for a given dataset. To resolve this problem, we propose a new framework based on an ensemble of genetic algorithms (GAs) that can be used for both the fault-type classiﬁcation and RUL prediction. Our GA is combined with a speciﬁc machine-learning method and then tries to select the best algorithm and optimize the involved parameter values in each subtask. In addition, our method constructs an ensemble of various prediction models found by the GAs. Our method was compared to a traditional grid-search over three benchmark datasets of the fault-type classiﬁcation and the RUL prediction problems and showed a signiﬁcantly better performance than the latter. Taken together, our framework can be an e ﬀ ective approach for the fault-type and RUL prediction of various machinery systems.


Introduction
In a machinery system, diagnostics and prognostics usually involve two kinds of problems, a fault-type classification and a remaining useful life (RUL) prediction problem. In particular, prognostics has been applied to the field of machinery maintenance as it allows industries to better plan logistics, as well as save cost by conducting maintenance only when needed [1]. Various approaches have been proposed in each problem and they can be divided into three categories: Physics-based, data-driven, and hybrid-based approaches. Physics-based approaches incorporate prior system-specific knowledge from an expert, as shown in previous studies, of fault-type classification [2][3][4][5] and RUL prediction [6][7][8][9] problems. Alternatively, data-driven approaches are based on statistical-/machine-learning techniques using the historical data (see example studies about aero-propulsion system simulation (C-MAPSS) dataset) problems. Our method showed a significantly better and more robust performance than the latter, with a practically acceptable running time.
The remainder of this paper is organized as follows. Section 2 introduces the backgrounds on the diagnostics and prognostics problem and the performance evaluation metrics. Section 3 explains the details of our approach and Section 4 presents the experimental results along with discussion. Section 5 includes the concluding remarks and suggestions for future work.

Backgrounds
In the fault-type classification and RUL prediction problems, data-preprocessing has a great impact on the performance of machine-learning methods and it is usually implemented by the rebalancing (in a classification problem), filtering, and dimension-reduction methods. They are introduced in the following subsections, and the last subsection explains the performance evaluation metric used in the study.

Data Rebalancing Methods
In the practical fault-type classification problem, the proportion of samples of the minority class is often severely lower than that of the majority class, which restricts the learning performance. To resolve this problem, the data rebalancing methods are commonly used. They can be classified into the over-sampling method, which adds samples of the minority class, and the under-sampling method, which reduces samples of the majority class. In general, the resampling process is repeated until the balancing ratio, which is defined by the ratio of the number of samples in the minority class over that in the majority class, is equal to or greater than a threshold parameter value r (0 < r ≤ 1). In the following, we introduce some representative rebalancing methods that were included in our framework.

•
Random duplication (RDUP)-A sample of the minority class is randomly selected and then duplicated.

•
Synthetic minority over sampling technique (SMOTE) [34]-A sample of the minority class is randomly selected and the weighted mean of the nearest neighbors of it is used to produce a new sample of the minority class. • Borderline-SMOTE [35]-Two SMOTE variant methods, borderline-SMOTE1 (BSMOTE1) and borderline-SMOTE2 (BSMOTE2), were further developed. They are the same as SMOTE except that a new sample of the minority class is produced near the borderline between classes. In addition, BSMOTE1 chooses the nearest neighbor from only the minority class, whereas BSMOTE2 does so from any class. • Support vector machine (SVM)-SMOTE (SSMOTE) [36,37]-Similar to borderline-SMOTE methods, the new minority-class sample is produced near the borderline but the borderline is determined by the support vector machine classifiers.

Under-Sampling Methods
Under-sampling methods eliminate the samples of the majority class. This might cause the loss of information of the data, which led the under-sampling method to be less popular than the over-sampling of Batista et al. [38].

•
Random removal (RREM)-A sample of the majority class is randomly selected for removal.

•
Neighborhood cleaning rule (NCL) [39]-A sample of the majority class is selected by Wilson's edited nearest neighbor rule [40] or a simple 3-nearest-neighbors search [36] for removal.

Filtering Methods
A filtering method is employed to remove noise from an original signal, and we herein introduce five well-known filtering methods. Let f t be the value of the feature f at time t in the following.

•
Simple moving average (SMA)-SMA is the unweighted average of values over the past time points as follows.
where n is the number of past time-points.

•
Central moving average (CMA)-SMA causes a shift in a trend because it considers only the past samples. On the other hand, CMA is the unweighted average of values over both the past and future time points as follows.
where n is an odd number specifying the number of time points to be averaged.

•
Exponential moving average (EMA)-EMA, which is also known as an exponentially weighted moving average (EWMA), is a type of infinite impulse response filter with an exponentially decreasing weighting factor. The EMA of a time-series of the feature f is recursively calculated as follows.
where, given the total number of observations N, α = e −1/N is a constant factor. • Exponential smoothing (ES)-Similar to EMA, ES is another weighted recursive combination of signals with a constant weighting factor α as follows.
• Linear Fourier smoothing (LFS)-LFS is based on the Fourier transform, which decomposes a signal into its frequency components. By suppressing the high-frequency components, one can achieve a denoising effect.
where F (·) and F −1 (·) denote the forward and inverse Fourier transform, respectively, and χ A is the characteristic function of the set A (λ is the cut-off frequency parameter). We used the standard fast Fourier transform algorithm to compute the one-dimensional discrete Fourier transform of a real-valued feature f .

Dimensionality Reduction Methods
A reduction method is used to reduce the p-dimensional input space into a lower k-dimensional feature space (k < p).

•
Principal component analysis (PCA) [41][42][43]-PCA extracts k principal components by using a linear transformation of the singular value decomposition (SVD) to maintain most of the variability in input data. • Latent semantic analysis (LSA) [44]-Contrary to the PCA, LSA performs the linear dimensionality reduction by means of the truncated SVD.

•
Feature agglomeration (FAG) [45]-FAG uses the Ward hierarchical clustering, which groups features that look very similar to each other. Specifically, it recursively merges a pair of features in a way to increase the total within-cluster variance as less as possible. The recursion stops when the remaining number of features is reduced to k. • Gaussian random projection (GRP) [41]-GRP projects the high-dimensional input space onto a lower dimensional subspace using a random matrix whose components are drawn from the normal distribution N 0, 1 k .

•
Sparse random projection (SRP) [46]-SRP reduces the dimensionality by projecting the original input space using a sparsely populated random matrix introduced in [47]. The sparse random matrix is an alternative to a dense Gaussian random projection matrix to guarantee a similar embedding quality while saving computational cost.

Performance Evaluation Metrics
In this paper, we used the F1-score and the mean-squared error to evaluate the performance of the fault-type classification and the RUL prediction, respectively.

F1-Score
For a classification task, the precision and the recall with respect to a given class c are defined as Precision c = TP c TP c +FP c and Recall c = TP c TP c +FN c , respectively, where TP, FP, and FN denote true positives, false positives, and false negatives, respectively. The macro-averaged F1-score is the average of the harmonic means of precision and recall of each class, as follows: where C is the set of all classes.

Mean Squared Error (MSE)
MSE is a general performance measure used in RUL prediction problems. It is defined as follows: where RUL i andRUL i are the observed and the predicted RUL values of the i-th sample among a total of N samples, respectively.

The Proposed Method
In this work, we propose a novel problem-independent framework for both the fault-type classification and the RUL prediction based on a GA. As we mentioned, the GA was employed to select the close-to-optimal set of data-processing algorithms and optimize the involved parameters in a robust way for a given dataset. As shown in Figure 1a, we first outlined the general process of the data-driven diagnostics and prognostics, which consists of four subtasks of data rebalancing (for classification problems), feature extraction, feature reduction, and learning. We did not explicitly include a feature selection in our framework, although it is a frequently used technique [48]. In fact, an implicit feature selection was already employed in the feature extraction stage because the inclusion and exclusion of created features are dynamically determined by a chromosome in the genetic algorithm (see the Section 3.1.1 for more details). As we explained in Section 2, a variety of algorithms in each subtask can be considered, and the diagnostics/prognostics performance is likely to be highly dependent on the selected algorithm and the specified parameter values. In this regard, it is necessary to select the optimal data preprocessing algorithms and specify the optimal parameter values involved by those algorithms. Hence, we propose a data-independent diagnostic/prognostic genetic algorithm (DPGA) to resolve it. In addition, our DPGA can be easily extended to generate an ensemble result [49,50] because it runs along with various learning methods, as shown in Figure 1b. We note that four representative machine-learning methods such as the multi-layer perceptron network (MLP), k-nearest neighbor (kNN), support vector machine (SVM), and random forest (RF) were employed in this study. As shown in Figure 1b, our DPGA runs to search the optimal data-processing algorithms for data rebalancing, feature extraction, and feature reduction subtasks and the relevant parameter values over the training dataset for each learning method in the learning phase. Then, a set of best solutions found by each DPGA are integrated into an ensemble to predict the fault-type or the RUL value over the test dataset in the prediction phase. In the following subsections, we explain the details of DPGA and the employed ensemble approach.
Appl. Sci. 2020, 10, x 6 of 20 We note that four representative machine-learning methods such as the multi-layer perceptron network (MLP), k-nearest neighbor (kNN), support vector machine (SVM), and random forest (RF) were employed in this study. As shown in Figure 1b, our DPGA runs to search the optimal dataprocessing algorithms for data rebalancing, feature extraction, and feature reduction subtasks and the relevant parameter values over the training dataset for each learning method in the learning phase. Then, a set of best solutions found by each DPGA are integrated into an ensemble to predict the fault-type or the RUL value over the test dataset in the prediction phase. In the following subsections, we explain the details of DPGA and the employed ensemble approach.

DPGA
DPGA is a steady-state genetic algorithm, and the overall framework is depicted in Figure 2. It first creates a random initial population of solutions, , and evaluates the fitness of each solution. It selects two parent solutions and among the population according to the fitness values and generates two offspring solutions and by a crossover operation. These new solutions can be mutated with a low probability. After evaluating the fitness values of the offspring solutions, the GA replaces some old solutions in the population with them. This process is repeated until a stopping condition is satisfied. The specified values of parameters of the GA are summarized in Table 1. In the following subsections, we introduce the details of each part in DPGA.

DPGA
DPGA is a steady-state genetic algorithm, and the overall framework is depicted in Figure 2. It first creates a random initial population of solutions, P, and evaluates the fitness of each solution. It selects two parent solutions s 1 and s 2 among the population according to the fitness values and generates two offspring solutions x 1 and x 2 by a crossover operation. These new solutions can be mutated with a low probability. After evaluating the fitness values of the offspring solutions, the GA replaces some old solutions in the population with them. This process is repeated until a stopping condition is satisfied. The specified values of parameters of the GA are summarized in Table 1. In the following subsections, we introduce the details of each part in DPGA.

Parameters Value
The number of solutions in a population (| |) 20 Stopping-patience 50 The crossover probability ( ) 0.5 The mutation probability ( ) 0.1 The algorithm-change mutation probability 0.3 The parameter-change mutation probability 0.7

Chromosome Representation
In a GA, a solution is represented by a chromosome. Table 2 shows a chromosome in DPGA, which is implemented by a one-dimensional list consisting of categorical and continuous variables to represent algorithm-selection and parameter-specification. Specifically, it is composed of four parts corresponding to data rebalancing, feature extraction, feature reduction, and learning subtasks as follows:

Parameters Value
The number of solutions in a population (|P|) 20 Stopping-patience 50 The crossover probability (p c ) 0.5 The mutation probability (p m ) 0.1 The algorithm-change mutation probability 0.3 The parameter-change mutation probability 0.7

Chromosome Representation
In a GA, a solution is represented by a chromosome. Table 2 shows a chromosome in DPGA, which is implemented by a one-dimensional list consisting of categorical and continuous variables to represent algorithm-selection and parameter-specification. Specifically, it is composed of four parts corresponding to data rebalancing, feature extraction, feature reduction, and learning subtasks as follows:

Parameters Value
The number of solutions in a population (| |) 20 Stopping-patience 50 The crossover probability ( ) 0.5 The mutation probability ( ) 0.1 The algorithm-change mutation probability 0.3 The parameter-change mutation probability 0.7

Chromosome Representation
In a GA, a solution is represented by a chromosome. Table 2 shows a chromosome in DPGA, which is implemented by a one-dimensional list consisting of categorical and continuous variables to represent algorithm-selection and parameter-specification. Specifically, it is composed of four parts corresponding to data rebalancing, feature extraction, feature reduction, and learning subtasks as follows:  In case of kNN: The number of neighbors n NN ∈ [2-10] The type of weight function w f NN ∈ uni f orm, distance In case of SVM: The penalty parameter C ∈ [1][2][3][4][5][6][7][8] In case of RF: The number of trees n tree ∈ [2-10]

•
Data rebalancing-This part is only applicable in the fault-types classification. As explained in Section 2.1, the 'DR algo.' field in a chromosome indicates one among five over-sampling and two under-sampling algorithms, or none of them. In addition, the 'DR para.' field represents the threshold parameter of the rebalancing ratio (see Section 2.1 for details). • Feature extraction-To generate latent features, our GA employed two groups of approaches, filtering-based (available only for time-series datasets, see Section 2.2) and reduction-based (see Section 2.3) approaches. The 'FFE algo.' field represents the subset of five filtering-based feature extraction algorithms (SMA, CMA, EMA, ES, and LFS). In addition, the 'FFE para.' field includes the corresponding parameters that are necessary to run the selected feature extraction algorithms (for example, the number of time points in SMA or CMA). Similar to filtering-based feature extraction, the 'RFE algo.' and 'RFE para.' fields represent the combinatorial selection among five reduction-based feature extraction algorithms (PCA, LSA, FAG, GRP, and SRP) and the corresponding parameters (for example, the number of principal components), respectively. We note that if none are selected in 'FFE algo.' and 'RFE algo.,' only the original variables are used as input variables in the learning algorithm. In other words, when the 'PCA flag' turns on, the set of highest-order principal components that account for more than % of the data variability are selected as the final input variables to be fed into a learning method.

•
Learning method: As explained before, we employed four machine-learning algorithms in this study. Therefore, the 'LM para.' field represents the corresponding parameters that are necessary to run the learning method as follows: -MLP: The MLP of a single hidden layer is assumed and n HN denotes the number of hidden nodes. In addition, the type of the activation function (act f HL ) is selected between the hyperbolic tan function ("tanh") and the rectified linear unit function ("relu"). The solver for weight optimization (sv WO ) is also selected between an optimizer in the family of quasi-Newton methods ("lbfgs") [51] and a stochastic gradient-based optimizer ("adam") [52]. -kNN: n NN denotes the number of nearest neighbors. In addition, the weight function (w f NN ) is selected between "uniform" and "distance." In the former, the neighbors are weighted equally, whereas the neighbors are weighted by the inverse of the distance to the query in the latter. -SVM: C denotes the penalty parameter for the misclassification. -RF: n tree denotes the number of trees in the forest.

Fitness Calculation
To evaluate a chromosome s, the F1-score and MSE measures (see Section 2.4 for details) are used for the fault-type classification and the RUL prediction, respectively, as follows.
where F1 − macro(s) and MSE(s) are the results by the leaning method using the algorithms and the parameter values included in s. In addition, A denotes a constant large enough to make the fitness a positive real value. Consequently, the higher the fitness value, the better the solution in both the fault-type classification and the RUL prediction problems. To avoid the over-fitting, we used d-fold cross-validation in computing the fitness over the training data. More specifically, the whole training dataset was randomly divided into d disjoint subsets. Then, each subset was held out for evaluation while the rest (d − 1) of the subsets were used as the training data. For a more stable fitness evaluation, we repeated the cross-validation l times. Accordingly, the fitness of s is the average over d × l trials.
In this work, we set d to 5 and l to 3.

Selection
To choose a parent solution from the population P, we employed the roulette wheel selection where the selection probability of a chromosome x is proportional to the fitness value of x as follows: .

Crossover
Two new offspring solutions are generated by a crossover with a probability p c , or they are duplicated from the parent solutions with a probability 1 − p c . The employed crossover is a block-wise uniform crossover, as shown in Figure 3. Specifically, there are five blocks such as 'DR,' 'FFE,' 'RFE,' 'PCA,' and 'LM,' all of which, except for the last block, consist of 'algo. (or flag)' and 'para.' fields, as explained in Section 3.1.1. For each block, the first offspring chromosome is inherited from one of two parent chromosomes uniformly at random and the second offspring chromosome is inherited from the remaining parent chromosome. For example, the first offspring inherited DR, PCA, and LM blocks from the first parent, whereas FFE and RFE blocks were inherited from the second parent in Figure 3.

Mutation
The offspring chromosome created by the block-wise crossover is mutated with a small probability , whereas the offspring created by the duplication is surely mutated to create a new chromosome that is not identical to the parent chromosome. Only one among four blocks, 'DR,' 'FFE,' 'RFE,' and 'PCA,' in the offspring is randomly selected, and it is mutated by one of the following two ways: • Algorithm-change mutation-The selected algorithm is changed. In other words, the current choice in the 'DR algo.', 'FFE algo.', 'RFE algo.', or 'PCA flag' field is replaced with an alternative uniformly at random.

•
Parameter-change mutation-The parameter value specified for the corresponding algorithm is mutated. In other words, the 'DR para.,' 'FFE para.,' 'RFE para.,' or 'PCA para' field is replaced with a new value.
In this work, the parameter-change mutation probability was set to a larger value (0.7) than the algorithm-change mutation probability (0.3) considering that the range of values in the former case is much wider than that in the latter case.

Replacement and Stop Criterion
When the offspring solution is better than the worst solution in the population, the latter is replaced with the former. For an efficient stopping criterion, we set a patience parameter . Our GA stops when the best solution in the population is not improved during the past consecutive generations.

Ensemble Methods
As explained in Figure 1b, the DPGA can produce many prediction models, which can constitute an ensemble of solutions. Herein, we employed the voting ensemble and the Kalman filter ensemble [53] for the fault-type classification and the RUL prediction, respectively. For the former case, we applied a soft voting rule to achieve the combined results of multiple optimal classifiers. The voting ensemble is based on the sums of the predicted probabilities from well-calibrated classifiers. The Kalman filter ensemble can provide a mechanism for fusing multiple model predictions over time for a stable and high prediction performance.

Results
To validate the performance of our method, we compared it to the traditional grid-search approaches over the following two fault-type classification benchmark datasets and one RUL prediction benchmark dataset.

Mutation
The offspring chromosome created by the block-wise crossover is mutated with a small probability p m , whereas the offspring created by the duplication is surely mutated to create a new chromosome that is not identical to the parent chromosome. Only one among four blocks, 'DR,' 'FFE,' 'RFE,' and 'PCA,' in the offspring is randomly selected, and it is mutated by one of the following two ways: • Algorithm-change mutation-The selected algorithm is changed. In other words, the current choice in the 'DR algo.', 'FFE algo.', 'RFE algo.', or 'PCA flag' field is replaced with an alternative uniformly at random.

•
Parameter-change mutation-The parameter value specified for the corresponding algorithm is mutated. In other words, the 'DR para.', 'FFE para.', 'RFE para.', or 'PCA para' field is replaced with a new value.
In this work, the parameter-change mutation probability was set to a larger value (0.7) than the algorithm-change mutation probability (0.3) considering that the range of values in the former case is much wider than that in the latter case.

Replacement and Stop Criterion
When the offspring solution is better than the worst solution in the population, the latter is replaced with the former. For an efficient stopping criterion, we set a patience parameter T. Our GA stops when the best solution in the population is not improved during the past T consecutive generations. Figure 1b, the DPGA can produce many prediction models, which can constitute an ensemble of solutions. Herein, we employed the voting ensemble and the Kalman filter ensemble [53] for the fault-type classification and the RUL prediction, respectively. For the former case, we applied a soft voting rule to achieve the combined results of multiple optimal classifiers. The voting ensemble is based on the sums of the predicted probabilities from well-calibrated classifiers. The Kalman filter ensemble can provide a mechanism for fusing multiple model predictions over time for a stable and high prediction performance.

Results
To validate the performance of our method, we compared it to the traditional grid-search approaches over the following two fault-type classification benchmark datasets and one RUL prediction benchmark dataset.

Steel Plates Faults Dataset
This dataset is provided by the Semeion Research Center of Sciences of Communication (www. semeion.it). Each observation is classified into seven different types of steel plate's faults, namely, Pastry, Z-Scratch, K-Scratch, Stains, Dirtiness, Bumps, and Other Faults [54,55]. The numbers of observations corresponding to the fault type are shown in Table 3. As shown in the table, the numbers of observations vary a lot from one category to another. The total number of observations is 1941, and each observation is made up of 27 features representing the geometric shape of the defect and its contour. The dataset provided by UCI Machine Learning Repository (http://archive.ics.uci.edu/mL) is related to a semiconductor manufacturing process. The dataset consists of 1567 observations, and each observation is made up of 591 features representing manufacturing operations of a single semiconductor [56]. The data were collected from the continuous monitoring process using sensors and metrology equipment along the semiconductor manufacturing line. At the end of manufacturing operation, functional testing was performed to ensure that the semiconductor meets the specification for which it is designed. If the result met the expectation, the semiconductor would be classified as the accepted product; otherwise, it would be rejected. There are only 104 rejected cases, whereas there are 1463 accepted cases. Due to its high imbalance ratio, it is difficult to get a high classification performance on the dataset.

NASA C-MAPSS Dataset
The NASA commercial modular aero-propulsion system simulation (C-MAPSS) dataset is generated by using a model-based simulation program [57,58]. It is further divided into four sub-datasets, as shown in Table 4. Each trajectory within the train and test trajectories is assumed to be the life-cycle of an aircraft gas turbine engine, and starts with different degrees of initial wear and manufacturing variation, which are unknown to the data analyzer. All engines operate in normal condition at the start, and then begin to degrade at some point. The degradation in the training set grows in magnitude until failure, while the degradation in the test set ends prior to failure. Thus, the main objective is to predict the correct RUL value for each engine in the test set. The data are arranged in an N-by-26 matrix where N corresponds to the number of data points in each dataset. Each row is a snapshot taken during a single operational cycle and includes 26 different features: Engine number, time step (in cycles), three operational settings, and 21 sensor measurements (temperature, pressure, fan/core speed, and so on). The three features of operational settings specify the flight condition or operational mode of an engine, which have a substantial effect on engine performance [53,59,60]. There is a single operational mode in FD001 and FD003 sub-datasets, whereas there are six operational modes in FD002 and FD004 sub-datasets. Therefore, the operational mode was included as a feature by using six real variables, each of which represents the number of cycles spent in the corresponding operational mode since the beginning of the series [53]. In addition, they were normalized as in [53].

Performance Comparisons between DPGA and Grid-Search Approaches
We compare the prediction performance of our method to the traditional exhaustive grid-search (EGS) approaches applied to MLP, kNN, SVM, and RF (we call them EGS-MLP, EGS-kNN, EGS-SVM, and EGS-RF, respectively). In an EGS approach, 5-fold cross-validation is conducted to find an optimal set of parameters. In addition, we further compared the performance of the ensemble of the prediction models of all EGS approaches (we call this EGS-E). As our method, the voting ensemble and the Kalman filter ensemble for the fault-type classification and the RUL prediction, respectively, were used for EGS-E. We first scatter-plotted the relation of the performance between the training and test sets by DPGA and five EGS approaches ( Figure 4). Unfortunately, the positive relation was not observed among the results of EGS approaches in all figures. This implies that a better solution in the training set can show a worse performance over the test set. Therefore, it is not efficient to simply select a best grid-search approach based on the training set. Interestingly, our approach (DPGA) was best over the test set, whereas it was not best over the training set in all datasets. Figure 5 shows the result where Y-axis values mean the average and the standard deviation of the F1-score or MSE values in the test dataset over 50 trials. As shown in the figure, our DPGA achieved significantly better results among the examined methods in all datasets of fault-type classification and the RUL prediction problems (all p-values < 0.02). Specifically, the second-best methods were EGS-kNN, EGS-E, and EGS-MLP for the C-MAPSS, steel-plate, and SECOM datasets, respectively. This implies that the performance of a learning algorithm is varied across the given dataset, but our method stably overwhelmed the EGS approaches. In addition, we investigated the best solutions found by EGS (Table 5) and DPGA (Tables 6  and 7), and observed that they are very different to each other. EGS approaches have almost found the best solution, which only includes an algorithm of the FFE (filter-based feature extraction) part and the RD (rebalancing data) part in the RUL prediction and the fault-type classification, respectively. In other words, the RFE and the PCA part were not useful in the search. On the other hand, the best solutions found by DPGA have included valid algorithms in all RD, FFE, RFE, and PCA parts. Specifically, the RD (in fault-type classification), FFE (in RUL prediction), and PCA parts were effective in all best-found solutions. This means that DPGA has efficiently searched a variety of combinations of all subtasks in the fault-type classification and the RUL prediction.         Finally, we compare the running time between the approaches on a system with a four-core Intel ® Core™ i7-6700 Processor 3.40 GHz and 16 GB of memory. As the execution time of the Kalman filter or voting ensemble is very small (less than 1 min), we compared the running time of DPGA to that of EGS-E only ( Figure 6). Note that the running time is measured for the learning phase in Figure 1b. As shown in the figure, the running time of DPGA is even shorter than that of the EGS-E for the small-sized dataset (steel plates faults). For large datasets (FD001-FD004, and SECOM), the running time of DPGA approaches was, at most, 1.9 times longer than that of the EGS-E ones, which is practically acceptable considering the performance improvement.

Conclusions
In this study, we proposed a DPGA, which is a novel framework to predict the RUL and faulttypes. It is a self-adaptive method to select the close-to-optimal set of data-preprocessing algorithms and optimize the involved parameters in each subtask of data rebalancing, feature extraction, feature reduction, and learning. Although DPGA used four machine-learning methods such as the multilayer perceptron network, k-nearest neighbor, support vector machine, and random forest in this study, it can be easily extended to combine other kinds of machine-learning methods. In addition, our method seems robust because it can generate an ensemble of prediction models. Through the performance comparison of DPGA with the traditional grid-search framework over three benchmark datasets, the former showed significantly better accuracies than the latter in a comparable running time. This implies that our genetic search was efficient in solving the large-scaled diagnostics and prognostics problems. It was interesting that the best solutions found by DPGA involve many filtering-or reduction-based feature extraction algorithms to generate various feature variables. As shown in the results, it is advantageous that DPGA can be applicable to other machinery systems without a priori knowledge about the most proper machine-learning method or a feature processing algorithm. In a future study, a parallel and distributed version of the DPGA method can be developed to reduce the execution time. It is also promising to further validate the usefulness of our approach by employing other kinds of the machine-learning models such as the recurrent neural network. Finally, it will be another interesting future study to design a more robust ensemble approach than what was employed in DPGA.

Conclusions
In this study, we proposed a DPGA, which is a novel framework to predict the RUL and fault-types. It is a self-adaptive method to select the close-to-optimal set of data-preprocessing algorithms and optimize the involved parameters in each subtask of data rebalancing, feature extraction, feature reduction, and learning. Although DPGA used four machine-learning methods such as the multi-layer perceptron network, k-nearest neighbor, support vector machine, and random forest in this study, it can be easily extended to combine other kinds of machine-learning methods. In addition, our method seems robust because it can generate an ensemble of prediction models. Through the performance comparison of DPGA with the traditional grid-search framework over three benchmark datasets, the former showed significantly better accuracies than the latter in a comparable running time. This implies that our genetic search was efficient in solving the large-scaled diagnostics and prognostics problems. It was interesting that the best solutions found by DPGA involve many filtering-or reduction-based feature extraction algorithms to generate various feature variables. As shown in the results, it is advantageous that DPGA can be applicable to other machinery systems without a priori knowledge about the most proper machine-learning method or a feature processing algorithm. In a future study, a parallel and distributed version of the DPGA method can be developed to reduce the execution time. It is also promising to further validate the usefulness of our approach by employing other kinds of the machine-learning models such as the recurrent neural network. Finally, it will be another interesting future study to design a more robust ensemble approach than what was employed in DPGA. Acknowledgments: This work was supported by National IT Industry Promotion Agency (NIPA) grant funded by the Korea government (MSIP) (S1106-16-1002, Development of smart RMS software for ship maintenance based fault predictive diagnostics).