A Genetic Algorithm Optimized RNN-LSTM Model for Remaining Useful Life Prediction of Turbofan Engine

: Understanding the remaining useful life (RUL) of equipment is crucial for optimal predictive maintenance (PdM). This addresses the issues of equipment downtime and unnecessary maintenance checks in run-to-failure maintenance and preventive maintenance. Both feature extraction and prediction algorithm have played crucial roles on the performance of RUL prediction models. A benchmark dataset, namely Turbofan Engine Degradation Simulation Dataset, was selected for performance analysis and evaluation. The proposal of the combination of complete ensemble empirical mode decomposition and wavelet packet transform for feature extraction could reduce the average root-mean-square error (RMSE) by 5.14–27.15% compared with six approaches. When it comes to the prediction algorithm, the results of the RUL prediction model could be that the equipment needs to be repaired or replaced within a shorter or a longer period of time. Incorporating this characteristic could enhance the performance of the RUL prediction model. In this paper, we have proposed the RUL prediction algorithm in combination with recurrent neural network (RNN) and long short-term memory (LSTM). The former takes the advantages of short-term prediction whereas the latter manages better in long-term prediction. The weights to combine RNN and LSTM were designed by non-dominated sorting genetic algorithm II (NSGA-II). It achieved average RMSE of 17.2. It improved the RMSE by 6.07–14.72% compared with baseline models, stand-alone RNN, and stand-alone LSTM. Compared with existing works, the RMSE improvement by proposed work is 12.95–39.32%.


Introduction
Maintenance management has long been a core part of business management. It aims at maximizing functionality and minimizing breakdowns. Traditional run-to-failure maintenance and preventive maintenance become infeasible for meeting the smart maintenance perspective [1]. Run-to-failure maintenance schedules only when the equipment becomes malfunction. Preventive maintenance schedules regular maintenance to examine the status of equipment. With the advent of Internet of things (IoT) [2,3], sensors have been attached to equipment for continuous monitoring. Further data analytics via artificial intelligence (AI) techniques will provide valuable insights for better maintenance management. Using the solid foundation of IoT architecture, predictive maintenance (PdM) has started to replace traditional maintenance approaches. Different from traditional run-to-failure maintenance and preventive maintenance, predictive maintenance helps businesses to plan maintenance of equipment right before equipment failure. It is optimized between the preventive maintenance and run-to-failure. Predictive maintenance reduces the unnecessary maintenance check as in preventive maintenance. There are various successful stories in literature, summarized in the review articles [4,5].
The Airline Maintenance Cost Executive Commentary Edition 2019 reported the annual maintenance, repair, and overhaul cost was 9% ($69 billion) of the total operational cost [6]. In recent years, many remaining useful life (RUL) prediction algorithms [7][8][9][10][11][12][13][14][15][16] have been proposed to estimate the time of failure for turbofan engine. Thus, optimal PdM can be scheduled to reduce the maintenance cost and avoid equipment downtime.
In Section 1.1, the methodologies and performance of existing works of RUL prediction algorithms for turbofan engine are summarized. The limitations of existing works and motivations of our work are followed in Section 1.2. The research contributions of this paper are explained in Section 1.3.

Related Works
The RUL prediction algorithms can be categorized into shallow learning-based [7][8][9][10][11] and deep learning-based approaches [12][13][14][15][16]. Deep learning has become a preferred choice when performance of prediction model is outstanding compared with computational power. In the following, for fair and consistent comparison, all existing works are related to the RUL prediction of turbofan engine using identical benchmark dataset [17,18]. A brief summary on the methodology of related works [7][8][9][10][11][12][13][14][15][16] is presented. A detailed discussion and comparison between proposed work and related works will be shared in Section 3.4.
Shallow learning-based approaches are firstly summarized. In Mosallam et al. [7], a hybrid discrete Bayesian filter and k-nearest neighbors approach was proposed. It achieved average root-mean-square error (RMSE) of 27.57. Another work proposed a median distance to the k-nearest neighbors-based transfer learning approach for feature extraction [8]. It was then applied to random forest regression model. Results revealed that the average RMSE was 26. Zhao et al. [9] proposed a back propagation neural network as preliminary study of RUL prediction which yielded average RMSE of 42.6. An autoregressive integrated moving average-based support vector regression was proposed in Ordóñez et al. [10]. Genetic algorithm (GA) was applied to fine-tune the parameters of the regression model. The performance had an average RMSE of 47.63. Apart from traditional machine learning algorithms, an innovative approach incorporated maximum Rao-Blackwellized particle filter, kernel two sample test, and maximum mean discrepancy was proposed by Cai et al. [11]. The average RMSE was 18.2.
Attention is drawn to deep learning-based approaches. It can be seen from the literature that a significant portion of articles adopted long short-term memory (LSTM). Researchers appreciated the effectiveness of LSTM in long-term prediction. A GA optimized restricted Boltzmann machine-based two layers LSTM model was proposed in Ellefsen et al. [12]. It obtained an average RMSE of 19.8. Adam adaptive learning rate optimization algorithm with single layer Vanilla LSTM model was applied and achieved average RMSE of 28.4 [13]. Likewise, Adam adaptive learning rate optimization algorithm was applied to optimize LSTM model in Wu et al. [14]. Evaluation showed that optimal performance (average RMSE of 19.1) was achieved for the setting of 5 layers and 100 neurons per layer. Other deep learning approaches were autoencoder gated recurrent unit [15] and deep convolution neural networks [16] which achieved average RMSE of 20.07 and 19.9, respectively.
In general, RUL prediction models using deep learning approaches outperform shallow learning-based approaches as the problem can be learnt more effectively via multiple layers architecture. It is agreed that shallow learning-based approaches offer low training time and light-weight models, however, computing power becomes a less important concern in today's digital era. A reduction in average RMSE of prediction model outweighs the computing power. Particularly, the requirement of computing power in RUL prediction of turbofan engine is not exhaustive compared with image-based applications.
To address these limitations, we have the following point-to-point considerations.
i A 10-fold cross-validation is adopted for performance evaluation of RUL prediction model. ii We combine recurrent neural network (RNN) and LSTM which take the advantages in managing both short-term and long-term RUL predictions. iii NSGA-II is adopted to optimally design the RNN-LSTM model.

Research Contributions
The research contributions of this paper are summarized as follows. i The combined model RNN-LSTM takes the advantages in RUL prediction of turbofan engine under short-term and long-term conditions. Results reveal that it reduces RMSE by 6.07-14.72% compared with stand-alone RNN and stand-alone LSTM. ii The combination of complete ensemble empirical mode decomposition (CEEMD) and wavelet packet transform (WPT) as two-step decomposition for feature extraction takes the advantages in capturing both time and frequency information. It reduces the RMSE by 5.14-27.15% compared with CEEMD, EEMD, EMD, WPT, EEMD-WPT, and EMD-WPT. iii Non-dominated sorting genetic algorithm II (NSGA-II) optimally designs the RNN-LSTM for optimal performance on short-term and long-term predictions. iv The proposed NSGA-II optimized RNN-LSTM model outperformed related works by 12.95-39.32% in terms of RMSE.

Methodology of Proposed NSGA-II Optimized RNN-LSTM Model
This section is organized as follows. Figure 1 shows the system overview of the proposed NSGA-II optimized RNN-LSTM model via CEEMD-WPT for feature extraction. The feature extraction is firstly discussed. This is followed by the NSGA-II optimized RNN-LSTM model.

Feature Extraction
Features are extracted based on the combination of complete ensemble empirical mode decomposition (CEEMD) and wavelet packet transform (WPT) namely CEEMD-WPT. The rationale behind the combination is as follows.
Compared with empirical mode decomposition (EMD) and ensemble empirical mode decomposition (EEMD), CEEMD reduces the residual noise and thus achieves smaller reconstruction error [19,20]. Furthermore, the requirement of computational power is lowered. In general, CEEMD decomposes the nonlinear turbofan engine signals into intrinsic mode functions (IMFs) and a residual. It captures the frequency and temporal resolutions of the signals. The next step is to further decompose the IMFs as detail and approximation coefficients using WPT. It is noted that WPT takes the advantages in retaining the localization properties, smoothness, and orthogonality [21,22]. Therefore, the proposal of hybrid CEEMD-WPT takes the advantage in capturing both time and frequency information. Particularly, this is two-step decomposition.
The mathematical formulations of merging traditional CEEMD and WPT as CEEMD-WPT are illustrated as follows.

Feature Extraction
Features are extracted based on the combination of complete ensemble empirical mode decomposition (CEEMD) and wavelet packet transform (WPT) namely CEEMD-WPT. The rationale behind the combination is as follows.
Compared with empirical mode decomposition (EMD) and ensemble empirical mode decomposition (EEMD), CEEMD reduces the residual noise and thus achieves smaller reconstruction error [19,20]. Furthermore, the requirement of computational power is lowered. In general, CEEMD decomposes the nonlinear turbofan engine signals into intrinsic mode functions (IMFs) and a residual. It captures the frequency and temporal resolutions of the signals. The next step is to further decompose the IMFs as detail and approximation coefficients using WPT. It is noted that WPT takes the advantages in retaining the localization properties, smoothness, and orthogonality [21,22]. Therefore, the proposal of hybrid CEEMD-WPT takes the advantage in capturing both time and frequency information. Particularly, this is two-step decomposition.
The mathematical formulations of merging traditional CEEMD and WPT as CEEMD-WPT are illustrated as follows.
where x i (t) is the temporary signal of original signal x(t) masked by Gaussian noise w i (t) with noise standard deviation σ 0 , and EMD(·) denotes the fundamental EMD function.
Equations (5)-(7) are repeated until r j (t) has only one extreme, which cannot be further decomposed.
The final residual can be calculated using: where j = 1, . . . , J. The original signal can be reconstructed by rearranging Equation (8): WPT further decomposes each of the I MF j (t) of length L into extended version I MF j (t) ext = I MF j,0 , I MF j,1 , . . . , I MF j,L x into mother wavelet. Denote L x as the length of extended I MF j (t), which is given by: where M is the number of low-pass filter coefficients with low-pass filter h low defined as: The approximation coefficients of WPT on I MF j (t) with h low is given by: For simplicity, we may rewrite Equation (12) as: Define the high-pass filter h high of length M as follows: The approximation coefficients of WPT on I MF j (t) with h high is given by: For simplicity, we may rewrite Equation (15) as: In this article, we have chosen various types of wavelet for analysis, including Haar wavelet, Daubechies wavelet (D2, D4, D6, D8, and D10), and Coiflet (C1, C2, C3, C4, C5). Figure 2 shows an example of CEEMD-WPT which includes the plots for IMF 1-6 and residue, as output of CEEMD, as well as approximation coefficients and detail coefficients as output of WPT.
Electronics 2021, 10, x FOR PEER REVIEW 6 of 15 Figure 2 shows an example of CEEMD-WPT which includes the plots for IMF 1-6 and residue, as output of CEEMD, as well as approximation coefficients and detail coefficients as output of WPT. Inspired by Plaza et al. [23] of WPT for vibration signals, we computed the statistical features namely Shannon entropy, kurtosis, skewness, peak-to-peak amplitude, standard deviation, and mean of the coefficients for feature construction. Inspired by Plaza et al. [23] of WPT for vibration signals, we computed the statistical features namely Shannon entropy, kurtosis, skewness, peak-to-peak amplitude, standard deviation, and mean of the coefficients for feature construction.

NSGA-II Optimized RNN-LSTM Model
The stand-alone RNN and stand-alone LSTM models are firstly trained independently. This is followed by optimally designing the RNN-LSTM model via the introduction of weighting factors which will be solved by NSGA-II. Figure 3 shows the architectures of the proposed RNN and LSTM model which will be illustrated in Sections 2.

NSGA-II Optimized RNN-LSTM Model
The stand-alone RNN and stand-alone LSTM models are firstly trained independently. This is followed by optimally designing the RNN-LSTM model via the introduction of weighting factors which will be solved by NSGA-II. Figure 3 shows the architectures of the proposed RNN and LSTM model which will be illustrated in Section 2.2.1 and Section 2.2.2, respectively. NSGA-II will be applied to optimally design RNN-LSTM model in Subsection 2.2.3.

Formulation of RNN
The architecture of RNN relies on previous information (t − 1) to generate output of information for current time (t). A standard three-layer Elman network is employed. The input is fed forward to the hidden layer with learning. There is connection to retain the previous information of the hidden unit in the context unit. The formulation is given by: where ℎ and ℎ is the vector for the hidden layer at previous time and current time, respectively; and are the activation functions for the hidden layer and the output layer, respectively; is the weight matrix between the input and hidden layer; is the weight matrix between the hidden layers; and are vectors for biases in the hidden layer and output layer; is the weight matrix between the hidden layer and the output layer.
Traditional activation functions like sigmoid, tanh, and rectified linear unit (ReLU) may suffer from slow convergence speed and thus other nonlinear functions, power-sigmoid and bipolar-sigmoid activation functions have been utilized for RNN implementation [24,25]. The formulations are defined as: • Power-sigmoid activation function:

Formulation of RNN
The architecture of RNN relies on previous information (t − 1) to generate output of information for current time (t). A standard three-layer Elman network is employed. The input is fed forward to the hidden layer with learning. There is connection to retain the previous information of the hidden unit in the context unit. The formulation is given by: where h t−1 and h t is the vector for the hidden layer at previous time and current time, respectively; ϕ h and ϕ y are the activation functions for the hidden layer and the output layer, respectively; U in is the weight matrix between the input and hidden layer; V in is the weight matrix between the hidden layers; b h and b y are vectors for biases in the hidden layer and output layer; W out is the weight matrix between the hidden layer and the output layer. Traditional activation functions like sigmoid, tanh, and rectified linear unit (ReLU) may suffer from slow convergence speed and thus other nonlinear functions, power-sigmoid and bipolar-sigmoid activation functions have been utilized for RNN implementation [24,25]. The formulations are defined as:

Formulation of LSTM
The LSTM network takes the advantage to address time series data because of the ability to map between input and output sequences with contextual information. The workflows of the forget gate, the input gate, and the output gate of LSTM network have been summarized as follows. Define W f , W in , W out , and W c as the weight matrices of the forget gate, the input gate, the output gate, and the memory cell, respectively. The corresponding bias vectors are b f , b in , b out , and b c , respectively.

•
Forget gate: The forget gate f t takes the latest input x t and previous output of memory block h t−1 . The activation function ϕ a of the forget gate is chosen to be logistic sigmoid as common practice, determines how much information is reserved the upper cell.
• Input gate: The information flowing into the cell is controlled by the input gate i t .
• Output gate: The output of the memory cell is regulated by the output gate o t .
• Memory cell: A tanh layer creates a vector of new candidate values C t that could be added to the state.
The state of the old memory cell C t−1 is updated to new memory cell C t .

Optimal Design of RNN-LSTM Using NSGA-II
Each RNN model and LSTM model produces predictions namelyŷ RNN andŷ LSTM , respectively. We optimally join the models as RNN-LSTM by introducing weighting factors ω RNN and ω LSTM . The predictions of RNN-LSTM become: NSGA-II is adopted to optimally design ω RNN and ω LSTM by setting the objective functions F1 as minimization of the RMSE and F2 as the minimization of the mean absolute error (MAE). The objective functions are defined in multidimensional space, also named as objective space. Crossover is the operation to combine genetic material of two or more solutions. Most species have two parents except some have one parent. In GA, it can be extended to more than two parents. N-point crossover is famous for bit string representation. Two solutions at n positions are spitted up and alternately assembled to a new one. For instance, assume one-point crossover between 001110010 and 1111010111. It randomly selects a position (let us say 4), the offspring candidate solutions can be 0010-010111 and 1111-110010. For continuous representations, numerical operations are utilized for the orientation of crossover operators. Arithmetic crossover is one of the popular operations that computes the arithmetic mean of all parental solutions component-wise. For example, two parents (3,8,5) and (2,6,4) will generate an offspring of (2.5,7,4.5).
The second operation of GA is mutation. It is based on random changes from which the solution is disturbed. The strength of disturbance is quantified by mutation rate (bit string representation) and step size (continuous representation). There are three design principles for mutation operators. The first principle is reachability. The arbitrary point in the solution space must be reachable to each point in the solution space. Adding constraints to the optimization problem can reduce the reachability in which the solution space becomes a feasible subset. Secondly, the unbiasedness principle prohibits the induction of drift of the search to any direction in unconstrained solution spaces. In the case of constrained solution space, bias may be allowed as advantageous. Scalability is the third design principle. Each mutation operator is adaptable to offer the degree of freedom. The probability distribution is usually adopted for mutation operators. Let uss consider Gaussian mutation using Gaussian distribution, the standard deviation can scale the samples in the entire solution space.
The quality of the phenotype of the solution is governed by the fitness function. The design of fitness function is an essential component of the algorithm. When it comes to the multi-objective optimization problem, the values of the fitness function of each single objective are summed. The fitness values of the individuals are ranked based on the values of the objective functions. The rationale behind the minimization of both RMSE and MAE is minimizing RMSE is equivalent to the prediction of the mean whereas minimizing MAE is equivalent to the prediction of median [26]. It is a well-known optimization algorithm which improves NSGA by introducing elite strategy, crowding degree, and fast non-dominated sorting technique [27][28][29]. The best offspring solutions are selected as parents in the new parental population. This aims at converging the solution to optimal solutions. The selection process is based on the fitness value in the population. As we are handling minimization problem, low fitness values are preferred.
In general, the multi-objective optimization problem aims at obtaining trade-off optimal solutions. There exist many possible solutions of ω RNN and ω LSTM with corresponding predicted valueŷ RNN-LSTM . The design of the weightings determines how the RUL prediction model characterizes the advantages of RNN and LSTM for short-term and long-term RUL prediction. To obtain the output, i.e., the optimal set of ω RNN and ω LSTM , it requires high computing power. Therefore, a tradeoff between the computing power and convergence of RUL prediction model is expected.
Within the objective space, Pareto optimal solution is defined as the optimal solution whereas the Pareto front is defined as the set of Pareto optimal solutions. Attributed to the errors of stochastic selection in finite population, a small group of the Pareto optimal solutions (but not all the optimal solutions) determine the convergence of the population.
The algorithm of NSGA-II can be found in Algorithm 1. Allocate non-dominated ranks for the individuals; 3: while generations g <= max_generation do 4: Perform the operations including selections, crossover, and mutation;

5:
Merge population (initial population and offspring) and allocate non-dominated ranks for the individuals in the merged population;

6:
Follow truncation mechanism that retains non-dominated solutions with higher crowding distance and maximum number of solutions is equal to the population size; 7: Allocate non-dominated ranks for the individuals; 8: Extract Pareto-optimal front; 9: Determine the optimal solution based on the values of objective functions; and 10: g = g + 1; 11: End while 12: Model←Pareto optimal solutions

Results
The benchmark dataset is firstly presented for the performance evaluation of RUL prediction model. Analysis was conducted based on (i) comparison between stand-alone RNN, stand-alone LSTM, and proposed NSGA-II optimized RNN-LSTM; (ii) Comparison feature extractions approaches EMD, EEMD, CEEMD, EMD-WPT, EEMD-WPT, with proposed CEEMD-WPT; and (iii) comparison between proposed NSGA-II optimized RNN-LSTM and existing works.

Commercial Modular Aero-Propulsion System Simulation Turbonfan Degradation (C-MAPSS-TD) Dataset
In the field of RUL prediction of turbofan engine, Commercial Modular Aero-Propulsion System Simulation Turbofan Degradation (C-MAPSS-TD) dataset is a well-known and benchmark dataset [17,18]. Table 1   It is worth noting that the total number of engine units in FD002 and FD004 are not divisible by 10, the settings of training and testing datasets may not be identical in all folds. Each datum contains 26 columns covering engine identity (ID), timestamp, 3 operational settings, and 21 sensor measurements.
We have analyzed the performance of RUL prediction models in existing works [7][8][9][10][11][12][13][14][15][16] and suggested the level of difficulty among FD001-FD004. The ranks are given by FD004 > FD002 > FD003 > FD001. The level of difficulty increases with the increase in the number of operating conditions and the number of fault modes.

Comparison between Stand-Alone RNN, Stand-Alone LSTM, and Proposed NSGA-II Optimized RNN-LSTM
To illustrate the results of RUL prediction problem, Figure 4 shows two examples of the predicted RUL and true RUL for engine unit numbers 5 and 82. It can be seen that the proposed NSGA-II optimized RNN-LSTM model takes the advantages in both short-term and long-term prediction which the deviations between true and predicted RUL can be reduced across the whole range of time (cycle).

Comparison between Stand-Alone RNN, Stand-Alone LSTM, and Proposed NSGA-II Optimized RNN-LSTM
To illustrate the results of RUL prediction problem, Figure 4 shows two examples of the predicted RUL and true RUL for engine unit numbers 5 and 82. It can be seen that the proposed NSGA-II optimized RNN-LSTM model takes the advantages in both short-term and long-term prediction which the deviations between true and predicted RUL can be reduced across the whole range of time (cycle).  To reveal the necessity of merging RNN and LSTM, i.e., the proposed NSGA-II optimized RNN-LSTM model, evaluation and comparison are made with the baseline models, stand-alone RNN, and the stand-alone LSTM models. For fair comparison, analysis is based on feature extraction using CEEMD-WPT. K-fold cross-validation is chosen as a common way of performance evaluation, in which K = 10 is adopted and supported by many real-world applications [30][31][32]. Table 3 summarizes the RMSE and MAE of the To reveal the necessity of merging RNN and LSTM, i.e., the proposed NSGA-II optimized RNN-LSTM model, evaluation and comparison are made with the baseline models, stand-alone RNN, and the stand-alone LSTM models. For fair comparison, analysis is based on feature extraction using CEEMD-WPT. K-fold cross-validation is chosen as a common way of performance evaluation, in which K = 10 is adopted and supported by many real-world applications [30][31][32]. Table 3 summarizes the RMSE and MAE of the three approaches in datasets FD001-FD004. The RMSE is based on the average of 10 results in 10-fold cross-validation.
Both shallow learning [7][8][9][10][11] and deep learning [12][13][14][15][16] have been reported for RUL prediction in literature. Researchers may consider adopting shallow learning when they prefer models with lower requirement on computational power and faster research analysis, with fair model performance (e.g., RMSE in RUL prediction). On the other hand, deep learning approaches require larger computational power and may suffer from slower research analysis with smaller RMSE.
Since the performance of RUL prediction model using shallow learning may not yield favorable model performance, research works are usually built the prediction model with hybrid techniques, as in Mosallam et al. [7], Fan et al. [8], Ordóñez et al. [10], and Cai et al. [11]. The rationale behind this is to take advantages from different techniques to further enhance the performance because there is no one that method fits all applications. Likewise, for deep learning-based approaches, some works (Wu et al. [14] and proposed work) introduce hybrid techniques for RUL prediction.
The stability of the RUL prediction model can be examined. In general, a trained model must be tested by unseen data in order to confirm the effectiveness of the model when it is deployed. The trained model is expected to capture the characteristics from the data, without too much influence by noisy data. Technically speaking, the model should have low variance and bias. K-fold cross-validation is adopted. The higher the value of k, the less the bias is, nevertheless, the higher the variability is. In contrast, the lower the value of k, the more the bias is. Therefore, k = 10 is chosen as typical value which has been supported by various works [30][31][32]. In literature, some works [9][10][11]14,16] did not employ cross-validation in which the results are not convincing to reflect the model performance in practice. The others [7,8,12,13,15] and proposed work employed k-fold cross-validation (various possibilities of k = 3, k = 4, k = 5, and k = 10) for performance evaluation. Although the selection of the value of k may be different, comparison could be made between these works given k-fold cross-validation has been adopted.
The reasons for the improvements by proposed work are two-fold: (i) the two-step decomposition via CEEMD-WPT captures both time and frequency information for feature extraction; and (ii) NSGA-II optimally merges the results of RNN and LSTM as RNN-LSTM which take advantages from RNN and LSTM for short-term and long-term predictions which reduce the prediction errors across all time (cycle).

Conclusions
Optimal predictive maintenance can be scheduled to reduce the maintenance cost and avoid equipment downtime. In this paper, an innovative NSGA-II optimized RNN-LSTM algorithm was proposed for RUL prediction of turbo fan engine. A hybrid decomposition CEEMD-WPT was introduced to enhance the feature extraction process. A benchmark C-MAPSS-TD dataset was chosen to evaluate the performance of proposed work. It achieved average RMSE and average MAE of 17.2 and 16.3, respectively for overall datasets FD001-FD004. To evaluate the effectiveness of hybrid decomposition CEEMD-WPT and RNN-