Deep Neural Network Feature Selection Approaches for Data-Driven Prognostic Model of Aircraft Engines

: Predicting Remaining Useful Life (RUL) of systems has played an important role in various ﬁelds of reliability engineering analysis, including in aircraft engines. RUL prediction is critically an important part of Prognostics and Health Management (PHM), which is the reliability science that is aimed at increasing the reliability of the system and, in turn, reducing the maintenance cost. The majority of the PHM models proposed during the past few years have shown a signiﬁcant increase in the amount of data-driven deployments. While more complex data-driven models are often associated with higher accuracy, there is a corresponding need to reduce model complexity. One possible way to reduce the complexity of the model is to use the features (attributes or variables) selection and dimensionality reduction methods prior to the model training process. In this work, the e ﬀ ectiveness of multiple ﬁlter and wrapper feature selection methods (correlation analysis, relief forward / backward selection, and others), along with Principal Component Analysis (PCA) as a dimensionality reduction method, was investigated. A basis algorithm of deep learning, Feedforward Artiﬁcial Neural Network (FFNN), was used as a benchmark modeling algorithm. All those approaches can also be applied to the prognostics of an aircraft gas turbine engines. In this paper, the aircraft gas turbine engines data from NASA Ames prognostics data repository was used to test the e ﬀ ectiveness of the ﬁlter and wrapper feature selection methods not only for the vanilla FFNN model but also for Deep Neural Network (DNN) model. The ﬁndings show that applying feature selection methods helps to improve overall model accuracy and signiﬁcantly reduced the complexity of the models.


Introduction
Modern computational capability has become more powerful over the past decades.This has induced a new trend of employing various data-driven models in many fields.Despite the fact that modern computers can complete complex tasks, researchers are still searching for solutions to reduce the computational time and complexity of the data-driven models to increase the likelihood that the models can be employed in real-time operation.
The same challenge has also applied to a certain type of aerospace data, which in this case, is the estimation of Remaining Useful Life (RUL) of the aircraft gas turbine engines.The main purpose of this work is to prove the theory that a particular group or a set of prognostics features (attributes or variables) from the aircraft gas turbine engines data can be selected prior to the training phase of Artificial Neural Network (ANN) modeling in order to reduce the complexity of the model.The same assumption also is believed to be applicable to the Deep Neural Network (DNN) model.It might also be applied to other complex deep learning models, i.e., Convolutional Neural Network (CNN), Recurrent Neural Network (RNN), and their variations as well.
In order to validate the aforementioned theory, the prognostics of aircraft gas turbine engines dataset or Commercial Modular Aero-Propulsion System Simulation (C-MAPSS) dataset derived from NASA Ames Prognostics Center of Excellence (PCoE) [1] was used to develop preliminary vanilla ANN models with selected features from different feature selection methods.Furthermore, to prove that similar assumptions can also be deployed to other deep learning algorithms, the Deep Neural Network or DNN models have also been developed based on some selected features derived from the ANN validation models.The final goal was to determine which feature selection method was the most suitable for the deep learning model in general to predict prognostics state or Remaining Useful Life for aircraft gas turbine engines data.End results from various future selection methods were compared against the one that is using original features.The ANN and DNN models with selected features were studied and compared based on their performance.
Based on the aforementioned goal, the summary of the main contributions of this work are: 1. Extract meaningful features for neural network-based and deep learning data-driven models from the C-MAPSS dataset.2. Suggest the novel neural network-based feature selection method for aircraft gas turbine engines RUL prediction.3. Develop deep neural network models from selected features.4. Show how the developed methodology can improve the RUL prediction model by comparing its performance/error and complexity to the model derived from original features.

Neural Network for RUL Prediction
Prognostic and Health Management (PHM) is aimed at improving reliability and reducing the cost of maintenance of the system's elements [2].Remaining Useful Life (RUL) in PHM is defined as the amount of time left before systems or elements cannot perform as their intended function.Therefore, RUL prognostics are used to evaluate the equipment's life status in order to plan future maintenance [3].With enough condition monitoring data, data-driven machine learning methods can be used to learn the degradation patterns directly from data in order to generate predictive prognostics models.Data-driven model using machine learning has an advantage over physics-based [4] and traditional data-driven statistical-based models [5].For example, machine learning models can be implemented without prior degradation knowledge [6].Neural network algorithms have particularly been receiving more attention compared to other machine learning algorithms as they have outperformed other algorithms as well as their ability to approximate high dimensional nonlinear regression function directly from raw data [7].
The Artificial Neural Networks (ANN) model is fundamentally based on biological neural networks.Sigmoid functions are applied to the nodes of ANN to connect and sum the total weights of the neural network.A sigmoid function is a Gaussian spheroid function, which can be expressed as: The hidden neurons in ANN measure the distance between the input vector x and the centroid c from the data cluster.The measured values are the output of the ANN.In Equation (1), the σ parameter represents the radius of the hypersphere determined by iteratively selecting the optimum width.The weights of the neural network are updated at the neural nodes using error backpropagation, which is a stochastic gradient descent technique.Then the weights of each individual neural node are fed forward to the next layer.This technique is often referred to as Feedforward Neural Network (FFNN).This is how ANN "learns" the data pattern through its weights [8].
In 2006, Geoffrey Hinton suggested the early design of deep learning algorithms based on the aforementioned FFNN [9].The vanilla FFNN generally consists of only the hidden layer with a sigmoid activation function described in Equation (1).Multiple configurations of deep learning algorithms, such as Deep Neural Network (DNN), Convolutional Neural Network (CNN), Recurrent Neural Network (RNN), etc., have been widely used as data-driven modeling algorithms.Most of them have outperformed every other well-known data-driven algorithms in the past.
One aspect to keep in mind before employing any deep learning algorithm is that each deep learning algorithm might be suitable for different tasks.This heavily depends upon the different data characteristics and type of target models.The deep learning algorithms also include different types of activation functions and optimizers.These are the key differences between deep learning algorithms and vanilla ANN or FFNN that have been proposed in the early years [10].
In this work, we only employed DNN with auto-encoder as a modeling algorithm.All encoded and decoded processes happen inside the hidden layers of the network through parameterized function [9,10].The construction of DNN with auto-encoder is briefly illustrated in Figure 1.Unlike the ANN that uses sigmoid function as an activation function, our DNN layers used Rectified Linear Units (ReLU) as activation function.The ReLu function can be simply expressed as: where  is the input to a neuron and + represents the positive part of its arguments.The ReLU function has been demonstrated to achieve better general regression tasks training for deeper networks compared to other activation functions such as the logistic sigmoid and the hyperbolic tangent (tanh) [10].Therefore, the ReLU function has been chosen to use for modeling Remaining Useful Life (RUL) prediction for our PHM data while the ANN with sigmoid function has been used as a validation algorithm for feature selection methods.The estimation of the RUL or "health state" of a system or its components is one of the main tasks for prognostics analysis.The RUL estimations often involve the prediction of the life span based on time or cycles, which is also known as the regression task.In PHM, the RUL is determined using the historical data collected from the system's sensors or signals.The ANN-based or deep learning data-driven models have been proven to work relatively well with these types of PHM tasks [11].However, one of the challenges is to reduce the complexity of the neural network prior to the training states.This might possibly be done by reducing the input training data.One possible way that can help in reducing the complexity of the model is to select only meaningful features or attributes from the raw dataset before model training.

Related Works
Multiple deep learning algorithms have been used to generate data-driven models to predict RUL for C-MAPSS aircraft gas turbine engines data.It can be observed from the literatures [12][13][14][15][16][17][18][19][20] that the most suitable deep learning algorithms for training the high accuracy C-MAPSS models is the Long-Short Term Memory Recurrent Neural Network (LSTM).The hybrid deep neural network layers with LSTM is also an ongoing investigation and experiment on the C-MAPSS dataset.This approach believes to achieve the higher accuracy among other algorithms that have been employed.The most important drawback of the hybrid models is the high complexity of the model architectures.These models can also have limitless variations and architecture structures.It is best to reduce the complexity of the model as much as possible and one way to achieve that is to limit the number of input nodes.This is the area that feature selection methods can be brought in.
There are many publications on applying ANN-based or deep learning algorithms to C-MAPSS aircraft gas turbine engines data.However, all previous works have never introduced the feature selection approaches into their model architectures.Also, the usefulness of any particular feature selection methods have not been addressed in any prior works.
The next paragraph concludes the contribution of past publications for such an approach.We specifically only included the works that employed deep neural network algorithms for prognostics of C-MAPSS aircraft gas turbine engines data modeling here.It might be worth to note that there are other research works that used other data-driven algorithms or machine learning algorithms, which are not mentioned here.
Chen Xiongzi, et al., (2011) conducted a comprehensive survey of the three main data-driven methods for aircraft gas turbine engines, namely particle filtering methods, neural network, and relevant vector machine methods [12].Mei Yuan, et al., (2016) applied RNN network methods for fault diagnosis and estimation of remaining useful life of engines [13].Faisal Khan, et al., (2018) used particle filter algorithms to generate the arbitrary input data points before training their models with neural networks.Unlike, vanilla neural network algorithm, their models employed radial basic function (RBF) as activation function instead of original sigmoid function [14].Xiang Li, et al., (2018) applied the Convolutional Neural Network (CNN) as a time window approach to generate a feature extraction model of engine data [15].Ansi Zhang et al., (2018) proposed a supervised domain adaptation approach by exploiting labeled data from the target domain aims to fine-tune a bidirectional Long-Short Term Memory Recurrent Neural Network (LSTM) previously trained on the source domain [16].Zhengmin Kong et al., (2019) also very recently developed the models based on CNN.They employed CNN as part of the network layers in their experiment and proposed the hybrid models by combining the CNN layers with LSTM layers.Their approaches have proven to achieve highest accuracy over the other standard methods [17].Other works previously published [18][19][20] mostly focused on adopting the LSTM network and proposing new models without addressing the complexity reduction in their approaches.While each work proposed the different network architectures and the performances of the models have been improved over time, what they failed to address is whether the complexity reduction in ANN-based models can play a role in improving the complexity of the model.This work aims to address the issue of a selection approach to reduce learning times.
The rest of the paper is organized as follows: Section 2 covers the methodology outlining all methods and approaches used for the defined problem.Section 3 describes the experimental setup with detail of data description and comparing final results from all models.Section 4 discusses and compares results from all modela.Lastly, a final conclusion and possible future works highlight are discussed in Section 5.

Methodology
In this section, all essential details of auto-encoder deep neural network used in our experiment will be discussed.The problem definition, and all notations will also be clearly defined, as well as the illustration of how our proposed deep neural network architecture can be applied for RUL aircraft gas turbine engines prediction with feature selection and neural network modeling framework.

Deep Neural Network Architecture
While there are existing deep learning algorithms that have been proposed to accommodate for PHM of aircraft gas turbine engines data modeling [12][13][14][15][16][17][18][19][20], this work focuses on using a deep neural network with auto-encoder with a specific use case and specifications that fit into problem definition previously identified.
The DNN used in this work focused on the feedforward architecture by the H2O package in Python API [21].H2O is based on multi-layer feedforward neural networks for predictive modeling [22].The following are some of the H2O DNN features used for this experiment.In the proposed DNN model, deep neural network layers are used to extract the temporal features from the time length,  .The hidden state units of the neural consist of, the hidden state vector ℎ ∈ ℝ , input vector (as defined in problem definition), x ∈ R , and the activation function, .All operations in DNN layers can be written as:

Supervised training protocol for regression tasks
where  and  represent input and output states. and  are matrices of updated weights and weights from the hidden state, and  is the bias vector.Unlike in vanilla ANN, in the proposed DNN, the activation function  is the Rectifier Linear function [23] instead of the sigmoid function.The DNN activation function can be represented as; where, in this case,  represent the state functions (Formulas (3) and ( 4)) that firing into the input neural.
Another important aspect of the DNN model architecture is the loss function, denoted by, ℒ.For this work, the Huber loss function was selected because it [24] has proven to work best in terms of accurately projecting the RUL,  ∈  , of the source domain,  .The Huber loss function can be described as; where,  is the space representation of the target input that mapped through the feature extraction layers into a new space.In addition,  is the domain regression space generated by logistic repressor [24], and,  is RUL prediction from the source domain.
The objective in training DNN is to minimize the prediction loss, ℒ , which can be described by; The DNN model used in this work is depicted in Figure 2.This DNN model architecture is trained to predict for each input,  , real value  and its domain label  for the source domain and only domain label for the target domain.The first part of the DNN architecture is the feature extractor,  , that decomposes the inputs and maps them into the hidden state, ℎ ∈ ℝ .The model then embeds the output space as a feature space  of the deeper layers and repeats this process as needed.
As previously detailed, this vector space parameter that is the result of feature mapping is,  i.e.,    .This feature space  is first mapped to a real-value  variable by the function,  ;  , which is composed of fully-connected neural network layers with parameter,  .The dropout layer with a rate of 0.4 was applied to avoid the overfitting issue [25].
Another goal is to find the feature space that is domain invariant, i.e., finding a feature space  in which   and    are similar.This is one of the challenges in training, which can be improved by applying the "feature selection" prior to training (detailed in the further section).Another objective is to minimize the weights of feature extractor in the direction of the regression loss, ℒ .In more detail, the model loss function can be used to derive the final learning function, , through parameter , which means the RUL prediction result (described in Equation ( 6)),     ;  .
The way the DNN algorithm update its learning weights, , is through the gradient descent update [26] in the form of; Usually, the Stochastic Continuous Greedy (SCG) estimate is used to update the Equations ( 8) and ( 9).The learning rate, , represents the learning steps taken by the SCG as training processes.

Feature Selection Methods for Neural Network Architectures
In prognostic applications, feature extraction occurs after receiving raw data from sensors.The feature extraction usually involves signal processing and analysis in the time or frequency domain.The purpose is to transform raw signals into more informative data that well-represents the system [27].In other words, feature extraction is the process of translating sensor signals into data.In contrast, the purpose of feature selection is to select a particular set of features in the dataset that is believed to be more relevant for modeling.These feature selection processes always execute after the feature extraction and occur in between pre-processing and the training or pre-training phase of the data modeling framework.
Three common feature selection strategies have been discussed in the literature: (1) filter approach, (2) wrapper approach, and (3) embedded approach.This paper will only discuss the filter and wrapper approaches.Figure 3 shows the processes flow and role difference role of feature extraction and feature selection in the data modeling process.Filter methods employ statistical, correlation, and information theory to identify the importance of the features.The performance measurement metrics of filter methods usually use the local criteria that do not directly relate to model performance [28].
There are currently multiple baseline filter methods popularly employed for feature selection processes.However, the result from the experiments showed that only the correlation-based methods were suitable for the case study data.This is due to the fact that correlation-based methods evaluate the feature with a direct correlation to the target variable.In other words, the correlation-based filter methods make selections based on the modeling objectives, which can imply that these methods are more suitable to the data with the target variable.The correlation-based filter methods included in this work is Pearson correlation [29,30].Additionally, the result from other statistical-based methods, namely Relief algorithm, Deviation selection, SVM selection, and PCA selection [31], was also included to provide a complete comparison.
Wrapper methods use a data-driven algorithm that performs the modeling for the dataset to select the set of features that yield the highest modeling performance [32].Wrapper methods are typically more computationally intensive compared to filter methods.There are four main baseline wrapper methods [32]: (1) forward selection, (2) backward elimination, (3) brute force selection, and (4) evolutionary selection.
Forward selection and backward elimination are search algorithms with different starting and stopping conditions.The forward selection starts with an empty selection set of features, then adds an attribute in each searching round.Only the attribute that provides the highest increase in performance is retained.Afterwards, another new searching cycle is started with the modified set of selected features.The searching of forward selection stops when the added attribute in the next round does not further improve the model performance.
In contrast, the backward elimination method performs in the reverse process.Backward selection starts with a set of all attributes, and then the searching processes continue to eliminate attributes until the next set of eliminated attributes does not provide any further improvements of modeling performance.The brute force selection method uses search algorithms that try all combinations of attributes.Evolutionary selection employs a genetic algorithm to select the best set of features based on the fittest function measurement [33].Because of computational and time limitations, the brute force selection could not be included in this experiment.Only forward selection, backward elimination, and evolutionary selection were implemented [34].

Neural Network Data-Driven Modeling Framework
In general, the modeling framework for this experiment is similar to a data-driven modeling framework that was developed from a cross-industry standard process for data mining (CRISP-DM) [35].The standard construction consists of five phases: (1) definition states phase, (2) preprocessing phase, (3) training phase, (4) testing phase, and (5) evaluating phase [36].In addition to the standard construction, the feature engineering phase and pre-training phase might be important prior to the training phase.
The feature engineering phase was introduced in "Features selection procedure for prognostics: An approach based on predictability" [34] and the pre-training phase was introduced in "The difficulty of training deep architectures and the effect of unsupervised pre-training" [37] to overcome issues in training deep learning models, while it also helped to improve some aspects of model performance.The details of these two additional phases have also been detailed by others [34,37].
As mentioned in Section 2.1, one of the challenges of training the deep learning model is to seek for a feature space  in which   and    are similar.Selecting only the meaningful feature is believed to help reduce the dissimilarity in the feature space that effect the predictability of the model.This is also the way to reduce the complexity of the model architecture and might also improve the prediction accuracy of the deep learning models.One possible framework that incorporates the feature engineering phase and pre-training phase into the CRISP-DM standard is illustrated in Figure 4.

Experimental Setup and Results
The first part of the experiment was designed to compare the effectiveness of using different feature selection methods and filtering for ANN modeling of the prognostics dataset.The aircraft gas turbine engines dataset with 21 attributes was fed into different filter and wrapper feature selection methods to identify particular sets of features prior to the model training phase.The selected sets of features were then used as training features or training attributes for the ANN model.The second part was to test the feature selected using ANN modeling with the DNN architecture.The results from different sets of features were compared in order to determine the most suitable set of selected features.Finally, the final-best DNN model for predicting RUL of aircraft gas turbine engines was determined.

C-MAPSS Aircraft Engines Data
Commercial Modular Aero-Propulsion System Simulation (C-MAPSS) is a simulation tool used to generate the turbofan engine degradation run-to-failure test dataset.This test dataset was derived from the NASA Ames prognostics data repository [1].The C-MAPSS dataset is one of the most popular benchmark datasets used in the prognostics and diagnostics research community.This dataset provides a set of editable input parameters to simulate various operational conditions for aircraft gas turbine engines [38].The operational conditions include sea-level temperature, Mach number, and altitude.The C-MAPSS dataset includes four sub-datasets described in Table 1.Each sub-dataset FD001, FD002, FD003, and FD004 contains a number of training engines with run-to-failure information and a number of testing engines with information terminating before failure is observed.As for operating conditions, each dataset can have one or six operational conditions based on altitude (0-42,000 feet), throttle resolver angle (20-100°), and Mach (0-0.84).As for fault mode, each dataset can have one mode or two modes, which are, HPC degradation and Fan degradation.
Sub-dataset FD002 and FD004 are generated with six operational conditions, which are believed to be a better representation of general aircraft gas turbine engines operation compared to FD001 and FD003, which could be generated from only one operational condition.Therefore, either data from FD002 or FD004 can be selected for a complete experiment.In this study, the data from FD002 set were selected as a training dataset.As our current model validation set-up (which will be described in Section 3.2), the wrapper methods required roughly 2 to 3 weeks to complete the run.We also keep the consistency of the amount of data points used in feature selection validations and model trainings-in both ANN feature selection validation and DNN model training.Our experiments have been designed this way in order to clearly demonstrate the effectiveness of the feature selection methods used for neural network-based algorithms.
There are 21 features included in the C-MAPSS dataset for every sub-dataset.These attributes represent the sensor signals from the different parts of the aircraft gas turbine engines, as illustrated in Figure 5 [39].Short descriptions of the features and the plots of all 21 sensor signals of sub-dataset FD002 are illustrated in Figure 6.
It has been suggested by multiple literature references to normalize the raw signal before performing modeling and analysis [13][14][15].Figure 7 shows the data signals before and after applying z-normalization: where,  denotes the original -th data point of -th feature at time  and  is the vector of all inputs of the -th feature.Each attribute value was normalized individually and scaled down to the same range across all data points.From the dataset, aircraft gas turbine engines start with various initial wear levels, but all are considered to be at "healthy state" at the start of each record.The engines begin to degrade at a point in time at higher operation cycles until they can no longer function normally.This is considered as the time when the engine system is being at the "unhealthy state".The training datasets have been collected over the time of run-to-failure information to cover entire life until the engines fail.It is also reasonable to estimate RUL as a constant value when the engines operate in normal conditions [38].Therefore, a piece-wise linear degradation model can be used to define the observed RUL value in the training datasets.That is, after an initial period with constant RUL values, it can be assumed that the RUL targets decrease linearly.Figure 8 illustrates the RUL curves of all unseen or test datasets containing testing engines from FD002 and FD004 dataset.Figure 9 show the example of RUL curves from one degradation engine from FD002 and FD004 dataset.The same degradation behavior is also applied to the training set.These RUL curves represent the health state or prognostic of the aircraft gas turbine engines over cycles until the end-of-life, or the point that the aircraft gas turbine engines can no longer operate normally.
The degradation behavior of the aircraft gas turbine engines can be observed clearer from Figure 9.We presume that the RUL is a constant cycle until it gets to the critical point when the performance of the engine starts to degrade.In the degradation phase, the RUL is represented by a linear function.Hence, the entire RUL curve is identified as a piece-wise linear degradation function.The critical point, Rth, is the point where the aircraft engines started to degrade.The critical points of the aircraft gas turbine engines were predefined based on the condition described by the data source-NASA Ames prognostics data repository [1].To measure and evaluate the performance of the models with selected features, root mean square error (RMSE) and the scoring algorithm as suggested in [39] were used.
RMSE is commonly used as a performance indicator for regression models.The following is the formula of RMSE: where,  is the number of prediction datasets,  is the real value, and ̅ is the prediction value.In this case, the  parameters refer to the data points in RUL curve while  is the actual RUL value and ̅ is the RUL value predicted by our models.The scoring algorithm is as described in the formula below: where,  is the computed score, n is number of units under test (UTT),   ̂  ̂ or Estimated RUL-True RUL, while  10 and  13.It can also be explained that the difference between  is the difference between predicted and observed RUL values and  is summed over all examples.
From the formula, the scoring matric penalizes positive errors more than negative errors as these have a higher impact on maintenance policies.Also, note that the lower score means better prediction performance of the model [39].

Training Procedure and Hyperparameters Selection
For training, the data from input sensors, operational setting, and labeled RUL value from the source data, and only sensors and settings from the target dataset, were used.The raw data were normalized, and the feature selection was applied before the start of all models training.For the training process, the training dataset (as a source) from dataset FD002 were used.The FD002 and FD004 test dataset were used to validate the models and calculate prediction errors (RMSE and Score).
As for wrapper methods, we used ANN as a validation algorithm.The cross-validation within the FD002 training data was employed for measuring the performance of the wrapper algorithms.The set-up parameters for ANN validation were fine-tuned based on the best model that was derived from complete attributes ( For the DNN hyperparameters selection, the model parameters in H2O DNN algorithm varied as described in Table 2.The grid search to identify the range of the learning rate, λ, was performed after fine-tuning the remaining parameters manually.Additionally, the training sample per iteration was set to auto-tuning, and batch size was set to 1 for all variations.

Experimental Setup and Results
All experiments were implemented on an Intel ® Core i7 10th generation i7-10510U 4 cores processor with 8 MB Cache, 1.8 GHz clock speed, and up to 4.9 GHz boost speed with 16 GB RAM and Intel ® UHD integrated graphic.The DNN architecture was implemented using Python 3.6 with H2O library/package [21].The experimental results presented in this section will be broken down into three parts: (1) Feature selected using feature selection methods, (2) Results and models from ANN with the selected feature, and (3) Proposed DNN model.All RMSE and all performance measurements of DNN models reported in this paper are the average results from 20 trials.Table 3 shows the ranking of attributes based on coefficients and weights calculated from each filter feature selection method.It is important to note that the ranking of the attributes based on different methods is dependent upon the statistical measures or weights obtained from each method.
For the Pearson correlation, the attributes were not selected if the coefficient was less than −0.01 [29,30].For PCA, the features have been selected based on weight (selected if weight is more than 0.2) and the PCA matrix [31].For the Relief algorithm, the attributes were not selected if the calculated weight was below zero [31].For deviation selection, the feature will be selected if the weights are higher than 1 [31].It is important to note that the weights of the attributes calculated using the Relief algorithm were unacceptably low (less than 10 −12 ) and there were very large gaps between calculated weights.Similar results were observed with other filter selection methods, including the SVM.It was found that by using the filter methods that provided statistically low weight as for selecting features, the models trained from those features were unable to provide usable prediction results.
The following are the features selected based on these two filtering methods.In addition to the feature weights from Pearson correlation selection and PCA selection in Table 3 In reference to the wrapper methods, below are the sets of features selected from each method.It is important to note that for the wrapper methods, ANN validation with the modeling set-up, as mentioned in Section 3.2 was used.Figure 10 shows the validation process using ANN for evolutionary selection.Unlike forward selection and backward elimination methods, which are both based on search algorithms [32], the setting of Evolutionary selection is based on genetic algorithms [40].However, instead of using fitness function from genetic theory, the evolutionary selection method used ANN validation as fitness measurement.The parameters set-up in our evolutionary selection experiment are; population size = 10, maximum number of generation = 200, using tournament selection with 0.25 size, initial probability for attributes (features) to be switched = 0.5, crossover probability = 0.5 with uniform crossover, and mutation probability = .


It is also important to note that, in this case, the brute force algorithm was not used.The brute force algorithm is the selection algorithm that can derive the best features set from the data.However, with limited computational capability, it cannot be used in real-time.Therefore, we did not include the Brute force algorithm in this experiment.

DNN Models and Results
Table 4 summarizes RMSE and prediction score results from all DNN models.The complete RUL best fit prediction curves for testing data of all feature selection methods are illustrated in Figure 11 for FD002 test data, and in Figure 12 for FD004 test data, respectively.The blue curves represent the actual RUL from the dataset, and the red lines/dots are the prediction points from our feature selection DNN models.For illustration purposes, Figures 13 and 14 include the prediction curve from one engine of each testing data FD002 and FD004 in order to demonstrate how DNN predicts RUL of one degradation cycle.Additionally, Table 5 includes all DNN models and all prediction error values measured from the DNN models using FD002 test dataset, i.e., absolute error, relative error, relative error lenient, relative error strict, normalized absolute error, root relative squared error, squared error, correlation, squared correlation, prediction average, spearman rho, and Kendall tau.The number of hidden nodes in the DNN layers was identified based on the best models fine-tuned from one-layer ANN models for each feature selection method.We used the same number of hidden nodes from the best ANN models to construct the DNN model layers.Note that we only presented the DNN models from feature selection methods that provided usable prediction results.Therefore, the results from Relief algorithms and SVM selection are not presented here.Due to the fluctuations in the prediction results from the DNN algorithm, we ran our experiments (training and testing) 100 times for each model.The result from Table 4 are the best prediction results.The fluctuations across 100 iterations for FD002 and FD004 are presented in Figure 15.In addition to the best prediction, we include the mean RMSE and error distributions from the 100 times testing as illustrated in Table 6 and Figure 16.These fluctuations in prediction errors are commonly found in most deep learning algorithms due to the random initial training weights assignment and the amplification effect from the optimizer function in deeper networks.The fluctuations in the prediction result can be more obvious when models are more complex and take a large number of input attributes.We will discuss more on this topic in Section 4.       ---------

Discussion
As mentioned in the related works (Section 1.2), there have been a number of efforts in developing deep learning models for a C-MAPSS aircraft gas turbine engines dataset [12][13][14][15][16][17][18][19][20].Currently, the deep learning model with the highest accuracy was proposed by Zhengmin Kong et al. [17].Their deep learning architecture consists of CNN and LSTM-RNN combined layers and can achieve 16.13 RMSE, while our best Evolutionary DNN model can achieve 44.71 RMSE.This indicates that the performance of our DNN models is poorer than the modern hybrid deep learning models developed in the recent years.
However, to the best of our knowledge, no work has addressed the complexity of the models and the computational burden for model training.All hybrid deep neural network layers are generally overly complex and require exponentially more computational time and resources compared to our proposed Evolutionary DNN.All proposed models in recent years also took all features from the C-MAPSS dataset and disregard the features performance benchmark.Different from those models, our proposed approach applies the feature selection prior to the model training phase to help reduce the number of input attributes, and to improve the model complexity as a result.The reduction in complexity when using less input features is more evident for the high complexity hybrid deep neural network layers.
Additionally, as illustrated in Figures 15 and 16, prediction errors fluctuations can be noticed when training deep learning models.This effect has occurred not only in DNN but also in other types of network layers, such as LSTM-RNN, CNN, and other modern hybrid layers.Based on the results demonstrated in Table 4 and Figures 12-16, the key observations of such an effect are as follows: (1) Utilizing fewer features to train the model has shown to lower the error distribution range, compared to using more features.This is due to that the initial random weights assigned to the hidden nodes are smaller when using less feature in model training.In other words, the models are more robust and reliable when using less features.Same observation is also applied for the fluctuation of the prediction errors, in that the prediction results are more stable when using less features in model training.(2) In terms of model performance and accuracy, although using selected features does not always guarantee better results, the feature selection methods still help in terms of reducing a computational burden while offering better prediction performance.In our experiment, the Evolutionary selection can achieve both better performance and complexity reduction.
We emphasize that our current goal is not to improve on model performance compared against other existing works; rather, we aim to provide baseline results and demonstrate the significant effect of using feature selections on deep learning models, which have never been addressed before.We believe that the end results can be further improved when applying our feature selection results in the modern hybrid deep neural network architectures.
For our experimental results in general, as mentioned, the best accuracy based on the RMSE results in Table 4 were generated from the Evolutionary method.The complexity of the model has also been significantly improved using a reduced set of features, from 21 attributes to only 14 attributes.
When considering the complexity and computational time, the filter methods were less complex and faster to run because they do not require to train-and-test multiples of ANN model validation in the process.In this study, when performing the selection process, most of the filter methods required only 5-10 min while wrapper methods required 10 h to 10 days to complete.
It is also important to note that the curve fitting and pattern recognition have been vastly improved, as can be seen when comparing the RUL prediction curves in Figures 11 to 14.In greater detail, the DNN model from most of the selected features can reasonably capture the trend of both before and after aircraft gas turbine engines' degradation intervals.
In summary, our Evolutionary DNN model architecture performs best as a simplified deep neural network data-driven model for C-MAPSS aircraft gas turbine engines data.The feature selection phase (as described in the modeling framework in Figure 4) must be included as a standard in the modeling framework for such a PHM dataset.This is one way to potentially improve the overall performance for RUL prediction for the prognostics of aircraft gas turbine engines data as well as other prognostic datasets.

Conclusions and Future Work
Even though we already included the deep neural network algorithms and proposed new DNN model architecture in this work, the features selected must still be tested with other new deep learning algorithms and methods.As demonstrated in the related works [12][13][14][15][16][17][18][19][20], their RNN, LSTM, and CNN have been proven to draw more accurate RUL prediction when compared to shallow DNN models.However, further improvements can be achieved by applying new algorithms to the selected features.One of the aspects that can improve such selected features is the reduction in the complexity of the model.Reducing input features when employing more complex deep learning algorithms can significantly reduce the model training time, possibly, from days to hours.This work aims to be a baseline for using selected features to generate a data-driven neural network model for the prognostic of aircraft gas turbine engines data.More complex deep learning algorithms; however, still need to be performed and tested for the effectiveness of such a feature selection technique.Additionally, it is also possible to use the dimensionality reduction technique such as, PCA, to transform the data from selected features to reduce dimensionality, which can possibly improve prediction accuracy and complexity.These are the key aspects that should be tested and experimented with in the future.
Lastly, we also believe that our studies will be a great benefit to aviation communities.We aim to raise the awareness and discussion on how each aircraft gas turbine engines feature can significantly help improve the overall life-span of the engines.Although, we only provided the insights based on data science perspective, we strongly believe that more study in aviation communities will be further investigated based on the results achieved in this work.

Figure 2 .
Figure 2. Proposed Deep Neural Networks Model Architecture.

Figure 3 .
Figure 3. Role of feature extraction and feature selection in the prognostics modeling process.

Figure 4 .
Figure 4.The prognostic data-driven framework for neural network algorithms.

Figure 6 .
Figure 6.Example of Sensor signals (NRc and Ps30) and all feature descriptions.

3. 3 . 1 .
Feature Selection for Aircraft Engine Dataset All possible feature selection methods were performed with the C-MAPSS dataset.Filter methods include; Deviation selection, PCA selection, Relief algorithm selection, selection, SVM selection, and Pearson correlation selection.For wrapper methods, only three methods were implemented, which include; forward selection, backward elimination, and evolutionary selection.

Figure 13 .
Figure 13.(a-g) RUL prediction points for one engine of FD002 test data.

Figure 14 .
Figure 14.(a-g) RUL prediction points for one engine of FD004 test data.

Table 2 .
Hyperparameters values evaluated in the proposed Deep Neural Network (DNN) model.
, the Pearson correlation matrix and PCA matrix are also provided in Appendices A and B.

Table 3 .
Attribute values from different filter methods.

Table 4 .
Best root mean square error (RMSE) and Prediction Score results of RUL prediction from all DNN models.

Table 5 .
The best DNN Models for FD002 test data. -

Table 6 .
Mean RMSE from all DNN models.