An Autoencoder Gated Recurrent Unit for Remaining Useful Life Prediction

: With the development of smart manufacturing, in order to detect abnormal conditions of the equipment, a large number of sensors have been used to record the variables associated with production equipment. This study focuses on the prediction of Remaining Useful Life (RUL). RUL prediction is part of predictive maintenance, which uses the development trend of the machine to predict when the machine will malfunction. High accuracy of RUL prediction not only reduces the consumption of manpower and materials, but also reduces the need for future maintenance. This study focuses on detecting faults as early as possible, before the machine needs to be replaced or repaired, to ensure the reliability of the system. It is di ﬃ cult to extract meaningful features from sensor data directly. This study proposes a model based on an Autoencoder Gated Recurrent Unit (AE-GRU), in which the Autoencoder (AE) extracts the important features from the raw data and the Gated Recurrent Unit (GRU) selects the information from the sequences to forecast RUL. To evaluate the performance of the proposed AE-GRU model, an aircraft turbofan engine degradation simulation dataset provided by NASA was used and a comparison made of di ﬀ erent recurrent neural networks. The results demonstrate that the AE-GRU is better than other recurrent neural networks, such as Long Short-Term Memory (LSTM) and GRU.


Introduction
Industry 4.0 depends on automated, smart factories, and employs sensor data and the methods of big data analysis, in the expectation that the equipment in the factory will operate automatically and self-correct to improve the product yield [1]. Sensor-related data are collected during the manufacturing processes, such as temperature, pressure, power, humidity, and chemical analysis for equipment monitoring [2,3]. These temporal patterns represent the equipment condition, and poor product quality is often associated with abnormal changes of environment or inappropriate operation settings [4]. Equipment condition can be recorded by sensor data from the past and kept as time series data. When analyzing equipment sensor data, not only the large amount of data recorded should be considered, but also the time series characteristics. The idea of a smart factory is to link and intellectualize the manufacturing process. In the past, automation merely used machines to improve production efficiency, yield, and reduce costs, but intelligence further applies technologies such as the IoT sensor, to monitor and control production lines. The combination of cloud computing, data analysis, and software and hardware integration is also an important part of intelligence. Machines communicate with automation equipment [5]. However, there are numerous parameters in the smart factory, and the data are often highly correlated between equipment and processes. Therefore, the analysis must go beyond a single stage in the equipment or process, and be comprehensive.
Equipment maintenance methods are divided into the following three types: (1) repair maintenance [6]; (2) preventive maintenance [7]; (3) predictive maintenance [8]. Corrective maintenance is a method of maintenance that is performed after some or all of the equipment fails. The most common equipment maintenance application is preventive maintenance, also known as scheduled maintenance, which carries out machine inspections, component replacement and other maintenance at prescribed times. Predictive maintenance is an equipment maintenance method based on the condition of the machine. It predicts the time when the machine may be damaged according to the development trend of the past.
Predictive Maintenance (PdM) has been used to monitor the historical health status of equipment and make timely adjustments to the equipment, which is quite different from the methods of routine maintenance employed in the past. PdM not only saves unnecessary costs, but also allow early repairs when the equipment is about to break down. It can avoid unpredictable machine stoppages caused by unexpected breakdown and improper operation. Remaining Useful Life (RUL) is an important indicator in PdM. The definition of device or system RUL is the period from the current state to the time when the device begins to operate abnormally [9]. There are two types of RUL prediction: model-based and data-driven [10]. The model-based method establishes the model based on the historical trend of equipment. Generally, model-based methods perform better than data driven methods when there is little historical data. Data-driven methods use the health status data and data collected by sensors. For modeling, the collected data must be closely related to the device status. The advantage of data-driven methods is that the algorithm is based on learning the correlation between input and output data. If there are sufficient data, the stability and accuracy of data-driven methods are better than model-based methods RUL prediction can be described as a time series forecasting problem. The main purpose of time series forecasting is to build models from historical records and predict future situations. Time series forecasting can be divided into statistical models, machine learning models and deep learning models. Common statistical models include Autoregression (AR), Moving Average (MA), Autoregressive Moving Average (ARMA), and Autoregressive Integrated Moving Average (ARIMA). Machine learning builds a function or model to detect patterns in training samples and then uses these patterns for prediction [11]. Support Machine Vector (SVM) [12], Random Forest (RF) [13], Extreme Gradient Boosting (XGB) [14] are representative machine learning algorithms for time series forecasting, but it is difficult to design an effective machine learning algorithm without substantial knowledge of the data [15]. In addition, the existing methods for analyzing time series data need pre-processing based on experience, which causes some losses of information. With improved process technology and more sensors, few existing methods can handle multi-sensor data, which is likely to cause misjudgment or missed detection.
Deep learning methods are preferred to predict and analyze time series data without predefined features. Deep learning is a branch of machine learning based on artificial neural networks with representation learning [16][17][18], which can learn the key information to make a response by using many hidden layers integrated in the network. The main difference between deep learning and machine learning is feature extraction. Usually, the features of machine learning are manually selected in advance; deep learning not only identifies features through model training, but is also much better than other algorithms at calculating results. Makridakis et al. [19] compare several time series analysis methods on the M3-Competition dataset. Most of the machine learning and deep learning algorithms perform better than traditional statistic methods. Alfred et al. [20] demonstrated that algorithms based on neural networks are more efficient in learning time series data than other algorithms.
Feature extraction is an important data pre-processing method in machine learning for high prediction accuracy. In the past, the dimension reduction of most data features was accomplished through Principal Component Analysis (PCA). PCA retains the maximum variation of features by projecting variables into another space to represent the original complete data with fewer features, to achieve the effect of data dimension reduction. The difference between PCA and Autoencoder (AE) is that AE is a non-linear dimension reduction method [21], and an unsupervised training method. First, the model compresses the original data through the encoder structure, and then restore the compressed data through the decoder. The AE rebuilds the original data with fewer features for data representation [22]. It can compress the key information of high dimensional input data into low dimensional features. AE is often used for feature extraction and dimensional reduction in time series forecasting problems [23]. Moreover, machine learning methods required manual selection of features. The quality of the features affected the models' results. Deep learning has the ability to automatically extract features. It can reduce the time spent on feature engineering and help experts to make decisions. Most of the relevant studies on RUL prediction use deep learning as a prediction model, which not only can save time in selecting features, but also makes prediction results much better than traditional machine learning models.
Recurrent Neural Networks (RNN) are a deep learning model that deals specifically with time series data and can select important features from equipment sensors. Long Short-Term Memory networks (LSTM) is a variant of RNN [24]. For example, Zhang et al. [25] constructed an LSTM model to predict the RUL of lithium batteries, predicting after how many cycles of fully charge and discharge the capacity will be lower than the normal threshold, based on the historical decline rate of capacity. Heimes [9] set the maximum remaining engine life at 130 on the PHM08 data set, and predicted the RUL using an RNN model. Zhang et al. [25] also constructed an LSTM model to predict the RUL of the jet engine. One hundred jet engines were used as training data and 100 were used as testing data. Each jet engine has 24 features that record the sensor value from normal to fault, and the RUL are predicted by the change in these values. Mathew et al. [26] predict the RUL of jet engines, using 24 parameters of the original data and comparing ten machine learning methods (Linear Regression, Decision Tree, SVM, Random Forest, KNN, K-means, Gradient Boost, Adaboost, Deep Learning, ANOVA), and verify the validity of the model through RMSE. Zheng et al. [27] used an LSTM model to predict the RUL on C-MAPSS data set, PHM08 data set, and Milling data set. The LSTM model is better than the other models. Chen et al. [28] proposed a two-phase model for RUL prediction by Kernel Principal Component Analysis (KPCA) and Gated Recurrent Unit (GRU), respectively. To the best of our knowledge, little research has integrated AE and GRU for RUL prediction.
This study develops an Autoencoder Gated Recurrent Unit (AE-GRU) model to predict the RUL of equipment. In particular, AE is used to select features, and then the correlation between sensors is found through the GRU model, and the RUL is predicted by Multi-Layer Perception (MLP) model. The first part is the Autoencoder (AE) model which is composed of an encoder and decoder. The second part is the GRU model which is a type of RNN that can deal with time series data. The GRU model finds key information in historical sensor data in combination with the MLP model. The MLP model calculates the extracted information and combines with the backpropagation algorithm to predict the RUL effectively.

Literature Review
Time series are statistical data, arranging events or data in order of their occurrence. The main purpose of time series forecasting is to build models from historical records and predict future situations.

Recurrent Neural Networks
RNN is the most common time series analysis method in deep learning. The main difference between RNN and general neural network is that the neurons of RNN between hidden layers are not independent, but influence each other. The neurons of RNN also operate in order. The neurons of the recurrent neural network have the function of temporarily storing memory. The previously input data will be temporarily stored in the internal memory, so the neuron can have different output values according to the previous state. However, in the training of RNN, too many hidden layer neuron weights will cause the problem of gradient vanish because of the heavy calculation, which will let the gradient descent method stay at the local minimum. The traditional neural network has different parameters to be learned in each hidden layer, but the parameters of the RNN only have different inputs. This feature greatly reduces the training burden of the model.
LSTM solves the problem of the gradient vanish in the random gradient descent in recurrent neural networks. The biggest difference between the LSTM and RNN is that each neuron of LSTM has a control gate function, which is input, forget, and output. These gates have their own weights. The calculation of the weights after data input determines whether the switches are turned on or off. The input gate controls whether data can be written to the memory space, and the forget gate determines whether the contents of the previous memory space are retained. The output gate controls whether the memory space operation result can be output. Although adding these gates to each neuron generates more weight to be calculated, it can solve the problem of gradient vanish found in the recurrent neural network. Figure 1 is a diagram of the LSTM network unit architecture, and the related formulas are shown as Equations (1)- (6). The LSTM unit will receive the input vector x t and the output h t−1 . Each unit contains four gates (i, g, f, o) and a cell. x is the input vector of this layer, h is the output vector, is the element-wise multiplication. W is the weight matrix, b is an error vector, and (i, f, o, g, c) represents input gate, forget gate, output gate, cell input, and cell activation respectively. The forget gate controls how much information each memory unit needs to forget, the input gate controls how much new information each memory unit adds, and the output gate controls how much information each memory unit outputs. The cell input is the information input of the current time and the information of the previous time. Cell activation controls the whether the input gate and forget gate information can be obtained.
c t = f t c t−1 + i t g t (5) Processes 2020, 8, x FOR PEER REVIEW 4 of 18 will let the gradient descent method stay at the local minimum. The traditional neural network has different parameters to be learned in each hidden layer, but the parameters of the RNN only have different inputs. This feature greatly reduces the training burden of the model. LSTM solves the problem of the gradient vanish in the random gradient descent in recurrent neural networks. The biggest difference between the LSTM and RNN is that each neuron of LSTM has a control gate function, which is input, forget, and output. These gates have their own weights. The calculation of the weights after data input determines whether the switches are turned on or off. The input gate controls whether data can be written to the memory space, and the forget gate determines whether the contents of the previous memory space are retained. The output gate controls whether the memory space operation result can be output. Although adding these gates to each neuron generates more weight to be calculated, it can solve the problem of gradient vanish found in the recurrent neural network. Figure 1 is a diagram of the LSTM network unit architecture, and the related formulas are shown as Equations (1)- (6). The LSTM unit will receive the input vector and the output ℎ . Each unit contains four gates (i, g, f, o) and a cell. x is the input vector of this layer, h is the output vector, ⨀ is the element-wise multiplication. W is the weight matrix, b is an error vector, and (i, f, o, g, c) represents input gate, forget gate, output gate, cell input, and cell activation respectively. The forget gate controls how much information each memory unit needs to forget, the input gate controls how much new information each memory unit adds, and the output gate controls how much information each memory unit outputs. The cell input is the information input of the current time and the information of the previous time. Cell activation controls the whether the input gate and forget gate information can be obtained. The gated recurrent unit (GRU) is a variant of LSTM [29]. As shown in Figure 2, (z, r, H, ) are update gate, reset gate, activation, and candidate activation respectively. The detailed formulas are shown an Equations (7)- (10). The update gate is used to control how much historical information and new information needs to be forgotten in the current state. The reset gate is used to control how much information is available from the candidate state. The candidate activation can be regarded as the new information at the current time. The reset gate is used to control how much historical information The gated recurrent unit (GRU) is a variant of LSTM [29]. As shown in Figure 2, (z, r, H, H) are update gate, reset gate, activation, and candidate activation respectively. The detailed formulas are shown an Equations (7)- (10). The update gate is used to control how much historical information and new information needs to be forgotten in the current state. The reset gate is used to control how much information is available from the candidate state. The candidate activation can be regarded as the new information at the current time. The reset gate is used to control how much historical information needs to be retained. The activation is generated by the update gate and candidate activation. The update gate in activation controls how much new and old information is retained. New information and old information have a complementary relationship. If a lot of new information is retained, the old information considered will be less, and vice versa. needs to be retained. The activation is generated by the update gate and candidate activation. The update gate in activation controls how much new and old information is retained. New information and old information have a complementary relationship. If a lot of new information is retained, the old information considered will be less, and vice versa. GRU simplifies the input gate and forget gate of LSTM into an update gate, and combines the cell states and hidden states together. Therefore, the GRU unit retains the advantages of LSTM, and further reduces the model training time by reducing the parameters in the model. According to the analysis results in [30], the results of the RNN model are poor and the results of GRU and LSTM are similar, and better than RNN. However, the GRU has fewer parameters than LSTM, so the trend in deep learning models applied to time series analysis is toward GRU.
In view of the excellent feature extraction performance of AE, and the fast calculation advantage of GRU, in this study the RUL prediction model works by extracting the important features from the original data using AE. After pre-processing of the extracted features, the GRU model and DNN (deep fully connected neural networks) are applied to predict the RUL.

RNN Applications in Time Series Forecasting
Chen et al. [31] performed a prediction of mechanical state (PMS), and extracted the collected sensor data of the machine through the feature extraction of empirical mode decomposition. This method makes the unstable signal slightly stable for building the LSTM model. Zheng et al. [32] predicts the short-term power system load capacity, the time range collected from a few hours to several weeks. The characteristics of power system load data are non-linear, non-stationary, and nonseasonal, so the research applies an LSTM model to solve this problem. ElSaid et al. [33] applied an LSTM model to predict aircraft engine vibration and predicted aircraft engine vibration values in the next 5, 10, and 20 s. 76 parameters are recorded in the aircraft flight data recorder. Fifteen key features were selected by experts with a professional background. The features were normalized to make the values range between 0 and 1, and used to make predictions. Cenggoro and Siahaan [34] constructed a DLSTM (Deep Long Short-Term Memory) neural network to predict traffic flow. The input layer had 5 input values, there were 5 hidden layers, and each hidden layer had 100 neurons. Bao et al. [35] conducted a stock price prediction. The stock price prediction is divided into three stages. Stage one applied a wavelet transform algorithm to remove noise from the original data. Stage two stacks AEs to reduces the data dimensions and picks out features automatically. Stage three applies an LSTM GRU simplifies the input gate and forget gate of LSTM into an update gate, and combines the cell states and hidden states together. Therefore, the GRU unit retains the advantages of LSTM, and further reduces the model training time by reducing the parameters in the model. According to the analysis results in [30], the results of the RNN model are poor and the results of GRU and LSTM are similar, and better than RNN. However, the GRU has fewer parameters than LSTM, so the trend in deep learning models applied to time series analysis is toward GRU.
In view of the excellent feature extraction performance of AE, and the fast calculation advantage of GRU, in this study the RUL prediction model works by extracting the important features from the original data using AE. After pre-processing of the extracted features, the GRU model and DNN (deep fully connected neural networks) are applied to predict the RUL.

RNN Applications in Time Series Forecasting
Chen et al. [31] performed a prediction of mechanical state (PMS), and extracted the collected sensor data of the machine through the feature extraction of empirical mode decomposition. This method makes the unstable signal slightly stable for building the LSTM model. Zheng et al. [32] predicts the short-term power system load capacity, the time range collected from a few hours to several weeks. The characteristics of power system load data are non-linear, non-stationary, and non-seasonal, so the research applies an LSTM model to solve this problem. ElSaid et al. [33] applied an LSTM model to predict aircraft engine vibration and predicted aircraft engine vibration values in the next 5, 10, and 20 s. 76 parameters are recorded in the aircraft flight data recorder. Fifteen key features were selected by experts with a professional background. The features were normalized to make the values range between 0 and 1, and used to make predictions. Cenggoro and Siahaan [34] constructed a DLSTM (Deep Long Short-Term Memory) neural network to predict traffic flow. The input layer had 5 input values, there were 5 hidden layers, and each hidden layer had 100 neurons. Bao et al. [35] conducted a stock price prediction. The stock price prediction is divided into three stages. Stage one applied a wavelet transform algorithm to remove noise from the original data. Stage two stacks AEs to reduces the data dimensions and picks out features automatically. Stage three applies an LSTM model to make predictions. Zhang et al. [36] divided Sea Surface Temperature (SST) into two part to make the prediction. The first part is short-term forecasting, which predicts the SST after 1 and 3 day. The second part is long-term forecasting, which predicts the weekly average and monthly average of SST. It finds the relationship between sequences through LSTM, and then make the prediction through the fully connected layer. How et al. [37] uses the angle changes on different motion sensors of the NAO robot to classify the current actions of the robot through an LSTM model. Truong et al. [38] construct an LSTM model to identify changes in human type and object type, and even predict possible human actions through the action combinations. Kuan et al. [39] constructed an MS-GRU (multilayered self-normalizing gated recurrent units) model. It has good results in predicting power loading. Zhang and Kabuka [40] used the GRU model to predict the traffic volume. The training data included the historical weather data 100 h previously to predict the traffic volume in the next 12 h. It was found that weather conditions can improve the prediction accuracy.

Research Framework
This study constructs an AE-GRU model for RUL prediction to increase equipment life, reduce unexpected harm caused by sudden shutdown of machinery, and improve the reliability of system operation. The proposed AE-GRU includes the steps of data pre-processing, feature extraction, and RUL prediction as shown in Figure 3. The data pre-processing step first defines the engine life to predict the exact RUL of the engine, and then standardizes the data to avoid errors caused by different scales of the data. It then uses the deep learning autoencoder model to extract the features of the original data. GRU is specialized in processing time series related data, and considers the changes of historical characteristics. The data is converted from 2D to 3D; the new dimension is the time data. The last step of data processing is to change the value range of the RUL. Because the original data is linearly decreasing, the results in model training and prediction are not optimal. Finally, the RUL expectancy is predicted through the GRU model and a fully-connected layer neural network.
Processes 2020, 8, x FOR PEER REVIEW 6 of 18 model to make predictions. Zhang et al. [36] divided Sea Surface Temperature (SST) into two part to make the prediction. The first part is short-term forecasting, which predicts the SST after 1 and 3 day. The second part is long-term forecasting, which predicts the weekly average and monthly average of SST. It finds the relationship between sequences through LSTM, and then make the prediction through the fully connected layer. How et al. [37] uses the angle changes on different motion sensors of the NAO robot to classify the current actions of the robot through an LSTM model. Truong et al. [38] construct an LSTM model to identify changes in human type and object type, and even predict possible human actions through the action combinations. Kuan et al. [39] constructed an MS-GRU (multilayered self-normalizing gated recurrent units) model. It has good results in predicting power loading. Zhang and Kabuka [40] used the GRU model to predict the traffic volume. The training data included the historical weather data 100 h previously to predict the traffic volume in the next 12 h. It was found that weather conditions can improve the prediction accuracy.

Research Framework
This study constructs an AE-GRU model for RUL prediction to increase equipment life, reduce unexpected harm caused by sudden shutdown of machinery, and improve the reliability of system operation. The proposed AE-GRU includes the steps of data pre-processing, feature extraction, and RUL prediction as shown in Figure 3. The data pre-processing step first defines the engine life to predict the exact RUL of the engine, and then standardizes the data to avoid errors caused by different scales of the data. It then uses the deep learning autoencoder model to extract the features of the original data. GRU is specialized in processing time series related data, and considers the changes of historical characteristics. The data is converted from 2D to 3D; the new dimension is the time data. The last step of data processing is to change the value range of the RUL. Because the original data is linearly decreasing, the results in model training and prediction are not optimal. Finally, the RUL expectancy is predicted through the GRU model and a fully-connected layer neural network. The symbols used in this research are defined as follows: : : : ℎ : The symbols used in this research are defined as follows: N : numbero f engines S : numbero f sensors n : lengtho f sensingdata Y : numbero f neurons L : numbero f hiddenlayers W : weight b : errorvector Features : New f eaturesa f terextraction

Data Pre-Processing
In the original engine data, there are only usage records for each engine, and there is no specific RUL of the engine recorded. Therefore, this study defines the time from good to bad for each engine, so that we can apply supervised learning methods to the experiment. It is possible to predict the RUL accurately using the parameter changes recorded by the sensors. As shown in Table 1, after finding the available time for each engine record (Equation (11)), the difference between the maximum time and the current time (Equation (12)) is the RUL of the engine.
RUL : T − X it (12) In the original engine data, the scales of the values measured by different sensors are different, such as pressure, temperature, humidity, etc., In order to avoid the difference in scales and make all the sensor values standard, the original data are all converted to Z-scores. The average value µ is the sum of the sensor data x i divided by n, the number of observations (Equation (13)). The standard deviation σ is also calculated (Equation (14)). The transformed data x i is calculated by calculating the difference between the sensor value of the data and the average value, divided by the standard deviation (Equation (15)). The average value of x i is equal to 0 and standard deviation equal to 1. x 3.1.1. Feature Extraction AE is used to reduce the data dimension and achieve the effect of feature extraction. As shown in Figure 4, the encoder will reduce the dimension of the data and transform the original data into a new space. The features in this space can describe the data more concisely than the original features.  (16)), where f () is the activation function which is Relu in this model, W is the weight matrix, h l is the error vector, b is the bias.
3.1.1. Feature Extraction AE is used to reduce the data dimension and achieve the effect of feature extraction. As shown in Figure 4, the encoder will reduce the dimension of the data and transform the original data into a new space. The features in this space can describe the data more concisely than the original features.  (16)), where f() is the activation function which is Relu in this model, W is the weight matrix, ℎ is the error vector, b is the bias. : * ℎ + The data collected by sensors is a series of continuous data. However, in the general model, it only uses the current sensor value to predict the RUL. The change in historical sensor value is not considered. In this study, a new data processing method is proposed to take historical information into account. The format of the data will be converted from the original two-dimensional (samples, features) to three-dimensional (samples, time steps, features). Zero-padding is often used in signal processing or Convolutional Neural Networks (CNN). The purpose is to keep or increase the dimension of the data without affecting the information of the data itself. In this study, when the current sensor record has no historical data, the zero-padding method will be used to keep the data dimension. The data collected by sensors is a series of continuous data. However, in the general model, it only uses the current sensor value to predict the RUL. The change in historical sensor value is not considered. In this study, a new data processing method is proposed to take historical information into account. The format of the data will be converted from the original two-dimensional (samples, features) to three-dimensional (samples, time steps, features). Zero-padding is often used in signal processing or Convolutional Neural Networks (CNN). The purpose is to keep or increase the dimension of the data without affecting the information of the data itself. In this study, when the current sensor record has no historical data, the zero-padding method will be used to keep the data dimension.

Maximum RUL Transformation
The maximum and minimum life of the engine is 356 and 127 cycles, and the average engine life is 209. The literature related to the prediction of RUL converts the maximum RUL to a specific reasonable value in the experiment. In this study, the transform process refers to the setting of T = 130 in Heimes [3]. The RUL greater than T will be defined as T, the RUL less than T will not change (Equation (17)). Figure 5 shows the life transform of one of the engines in the data set. Through this conversion, the RUL changes from the original linear decline to a nonlinear decline. Both the model training and the estimation of the RUL have better results in the setting. The maximum and minimum life of the engine is 356 and 127 cycles, and the average engine life is 209. The literature related to the prediction of RUL converts the maximum RUL to a specific reasonable value in the experiment. In this study, the transform process refers to the setting of T = 130 in Heimes [3]. The RUL greater than T will be defined as T, the RUL less than T will not change (Equation (17)). Figure 5 shows the life transform of one of the engines in the data set. Through this conversion, the RUL changes from the original linear decline to a nonlinear decline. Both the model training and the estimation of the RUL have better results in the setting.

RUL Prediction Model
The RUL prediction model of this study combines the advantages of effective feature extraction and fast calculation. The RUL prediction model will be divided into four steps as shown in Figure 6.
Step 1 is the input layer.
represents the sensor data collected at the current time, represents the sensor data collected at the previous moment, represents the sensor data collected two moments earlier. The number of historical time points of sensor data to input to the model are adjustable.
Step 2 is the GRU layer. The number of GRU layers is set to two in this paper. The number of GRU neurons will be tuned in the experimental part, and the best parameter combination selected for the research. The purpose of this step is to identify time series correlation in the input data through GRU. However, GRU cannot directly predict the RUL. Therefore, DNN must be added to predict the RUL.
Step 3 is the DNN layer. DNN will convert the features extracted by GRU to the prediction dimension to perform the prediction. The best parameter combination from the experiment result will be used to decide the number of neurons as in Step 2. The objective function of DNN is applied to calculate the difference between the predicted value and the true value. In this study, Mean Square Error (MSE) is used as the objective function (Equation (18)), and the gradient descent method is used to minimize the objective function and train the model. Step 4 is the output layer of the model. The output is the value of predicted RUL at the current time. The model evaluation is conducted by calculating Root Mean Square Error (RMSE).

RUL Prediction Model
The RUL prediction model of this study combines the advantages of effective feature extraction and fast calculation. The RUL prediction model will be divided into four steps as shown in Figure 6.
Step 1 is the input layer. X t represents the sensor data collected at the current time, X t−1 represents the sensor data collected at the previous moment, X t−2 represents the sensor data collected two moments earlier. The number of historical time points of sensor data to input to the model are adjustable. Step 2 is the GRU layer. The number of GRU layers is set to two in this paper. The number of GRU neurons will be tuned in the experimental part, and the best parameter combination selected for the research. The purpose of this step is to identify time series correlation in the input data through GRU. However, GRU cannot directly predict the RUL. Therefore, DNN must be added to predict the RUL. Step 3 is the DNN layer. DNN will convert the features extracted by GRU to the prediction dimension to perform the prediction. The best parameter combination from the experiment result will be used to decide the number of neurons as in Step 2. The objective function of DNN is applied to calculate the difference between the predicted value and the true value. In this study, Mean Square Error (MSE) is used as the objective function (Equation (18)

Data Collection
To validate the proposed AE-GRU, a Turbofan Engine Degradation Simulation Data Set (Prognostics and Management 2008, PHM08) (https://www.nasa.gov/) is used for performance evaluation in this section. It was carried out by a NASA tool C-MAPSS, to simulate the real large commercial aircraft turbofan engines. The data consists of a lot of time series cycles, which come from different engines of the same type. Each engine starts with a different wear level. The first section introduces the engine data collected by the sensor, and the second section describes the parameter settings of the model. The performance of different activation functions is compared and performance is optimized. Finally, the number of hidden layers and neurons is compared. In the third section, we will extract the best parameters and apply them to different models, and test the model by analyzing the root mean square error (RMSE).
There are data for 218 turbo engines, and a total of 45,918 records in the dataset. Every record has 26 original features and the customized RUL. The experimental design and verification of this study adopts k-fold cross validation to ensure the stability of the model. Figure 7 is a diagram of the sensor parameters of the turbine engines in this study. The sensor parameters of Engine 1 are presented visually. The parameter values of the sensors vary in range and the parameter trends are not consistent. The RUL of the engine will be predicted by the changes of these sensor parameters in this research.

Data Collection
To validate the proposed AE-GRU, a Turbofan Engine Degradation Simulation Data Set (Prognostics and Management 2008, PHM08) (https://www.nasa.gov/) is used for performance evaluation in this section. It was carried out by a NASA tool C-MAPSS, to simulate the real large commercial aircraft turbofan engines. The data consists of a lot of time series cycles, which come from different engines of the same type. Each engine starts with a different wear level. The first section introduces the engine data collected by the sensor, and the second section describes the parameter settings of the model. The performance of different activation functions is compared and performance is optimized. Finally, the number of hidden layers and neurons is compared. In the third section, we will extract the best parameters and apply them to different models, and test the model by analyzing the root mean square error (RMSE).
There are data for 218 turbo engines, and a total of 45,918 records in the dataset. Every record has 26 original features and the customized RUL. The experimental design and verification of this study adopts k-fold cross validation to ensure the stability of the model. Figure 7 is a diagram of the sensor parameters of the turbine engines in this study. The sensor parameters of Engine 1 are presented visually. The parameter values of the sensors vary in range and the parameter trends are not consistent. The RUL of the engine will be predicted by the changes of these sensor parameters in this research.

Data Collection
To validate the proposed AE-GRU, a Turbofan Engine Degradation Simulation Data Set (Prognostics and Management 2008, PHM08) (https://www.nasa.gov/) is used for performance evaluation in this section. It was carried out by a NASA tool C-MAPSS, to simulate the real large commercial aircraft turbofan engines. The data consists of a lot of time series cycles, which come from different engines of the same type. Each engine starts with a different wear level. The first section introduces the engine data collected by the sensor, and the second section describes the parameter settings of the model. The performance of different activation functions is compared and performance is optimized. Finally, the number of hidden layers and neurons is compared. In the third section, we will extract the best parameters and apply them to different models, and test the model by analyzing the root mean square error (RMSE).
There are data for 218 turbo engines, and a total of 45,918 records in the dataset. Every record has 26 original features and the customized RUL. The experimental design and verification of this study adopts k-fold cross validation to ensure the stability of the model. Figure 7 is a diagram of the sensor parameters of the turbine engines in this study. The sensor parameters of Engine 1 are presented visually. The parameter values of the sensors vary in range and the parameter trends are not consistent. The RUL of the engine will be predicted by the changes of these sensor parameters in this research.

Hyperparameters Setting
Based on the experience of adjusting the model, the hyperparameters in the neural network will significantly affect the results of the model. This section finds the best parameter combination to apply in the model to predict the RUL value.
In experiments on neural networks, the number of hidden layers and the number of neurons are often the most difficult to choose. The training results will also depend on the quality of batch size and epoch. Batch size means the number of samples to work through in one iteration. Epoch is the number of times that the model goes through batches to go through the entire training dataset once. For example, if a training dataset has 3000 records and batch size is set to 500, it takes 6 iterations to complete an epoch.

Activation Function
The main function of the activation function in the neural network is to introduce nonlinear characteristics. If there is no activation function in the neural network, the input and output will only correlate with a simple linear relationship and not able to handle complex issues, so the activation function is very important in deep learning.
This experiment sets the following parameters to observe the training result of the GRU model with different activation functions:
This experiment uses a 5-fold cross-validation to compare total seven activation functions and calculate the average loss of each activation function: softmax (3055.5), softplus (517.0), softsign (438.9), relu (434.8), tanh (445.3), sigmoid (990.7), hard sigmoid (1051.9). According to the training results, we find that the softmax performance is relatively poor, and the relu activation function performs best (Figure 8). The softmax activation function is a blend of multiple sigmoid, which is relatively suited to multiclass classification problems rather than a regression problem. The advantage of the Relu activation function is that neuron deactivation will only happen when the output of linear transformation is zero. The neurons will not be activated at the same time, but a certain number of neurons will be activated at a time. The objective function of Relu can be express as [41]: In addition, since the relu activation function does not need to perform exponential calculations, the convergence speed is fast and the calculation complexity levels are low. So, this study adopted relu as the activation function.

Hyperparameters Setting
Based on the experience of adjusting the model, the hyperparameters in the neural network will significantly affect the results of the model. This section finds the best parameter combination to apply in the model to predict the RUL value.
In experiments on neural networks, the number of hidden layers and the number of neurons are often the most difficult to choose. The training results will also depend on the quality of batch size and epoch. Batch size means the number of samples to work through in one iteration. Epoch is the number of times that the model goes through batches to go through the entire training dataset once. For example, if a training dataset has 3000 records and batch size is set to 500, it takes 6 iterations to complete an epoch.

Activation Function
The main function of the activation function in the neural network is to introduce nonlinear characteristics. If there is no activation function in the neural network, the input and output will only correlate with a simple linear relationship and not able to handle complex issues, so the activation function is very important in deep learning.
This experiment sets the following parameters to observe the training result of the GRU model with different activation functions: This experiment uses a 5-fold cross-validation to compare total seven activation functions and calculate the average loss of each activation function: softmax (3055.5), softplus (517.0), softsign (438.9), relu (434.8), tanh (445.3), sigmoid (990.7), hard sigmoid (1051.9). According to the training results, we find that the softmax performance is relatively poor, and the relu activation function performs best (Figure 8). The softmax activation function is a blend of multiple sigmoid, which is relatively suited to multiclass classification problems rather than a regression problem. The advantage of the Relu activation function is that neuron deactivation will only happen when the output of linear transformation is zero. The neurons will not be activated at the same time, but a certain number of neurons will be activated at a time. The objective function of Relu can be express as [41]: In addition, since the relu activation function does not need to perform exponential calculations, the convergence speed is fast and the calculation complexity levels are low. So, this study adopted relu as the activation function.

Optimizer
The purpose of optimizers is to minimize the loss function and the gap between the predicted value and real value. This experiment sets the following parameters and observes the training result of the GRU model under different optimizers:
This experiment applies 5-fold cross-validation to compare a total of seven optimization algorithms: SGD (569.8), RMSprop (438.7), Adagrad (4586.9), Adadelta (426.6), Adam (436.6), Adamax (436.8), Nadam (420.7). The parameter setting of each optimization algorithm is the best result proposed in the literature. Kingma and Ba [42] proposed Adam's optimization algorithm. Most of the current neural networks apply Adam to optimize the loss function. The advantage of Adam is quick convergence and dealing with high noise and the problem of sparse gradients. The Nadam optimization algorithm proposed by Dozat [43] changes the part of Adam's momentum and accelerates the convergence rate of the model. The study confirms that in this PHM08 dataset, Nadam will converge faster than Adam ( Figure 9). Therefore, the optimizer used in this study is Nadam.

Optimizer
The purpose of optimizers is to minimize the loss function and the gap between the predicted value and real value. This experiment sets the following parameters and observes the training result of the GRU model under different optimizers:
This experiment applies 5-fold cross-validation to compare a total of seven optimization algorithms: SGD (569.8), RMSprop (438.7), Adagrad (4586.9), Adadelta (426.6), Adam (436.6), Adamax (436.8), Nadam (420.7). The parameter setting of each optimization algorithm is the best result proposed in the literature. Kingma and Ba [42] proposed Adam's optimization algorithm. Most of the current neural networks apply Adam to optimize the loss function. The advantage of Adam is quick convergence and dealing with high noise and the problem of sparse gradients. The Nadam optimization algorithm proposed by Dozat [43] changes the part of Adam's momentum and accelerates the convergence rate of the model. The study confirms that in this PHM08 dataset, Nadam will converge faster than Adam ( Figure 9). Therefore, the optimizer used in this study is Nadam.

Number of Hidden Layers and Neurons
This experiment compares three different batch sizes: 32 (Table 2), 64 (Table 3), 128 ( Table 4). The early stopping function has been adopted in the experiment; if the MSE of verification does not rise for 10 consecutive times, the training will stop and the previous best model will be used in the test dataset. Early stopping is commonly used in the algorithm of gradient descent to avoid overfitting. The GRU and DNN layers are set to two layers. The RMSE result of different neurons in the training are compared. The input of this experiment is 24 parameters of the current raw data of the sensor, and there is no historical information (time steps = 1).

Number of Hidden Layers and Neurons
This experiment compares three different batch sizes: 32 (Table 2), 64 (Table 3), 128 ( Table 4). The early stopping function has been adopted in the experiment; if the MSE of verification does not rise for 10 consecutive times, the training will stop and the previous best model will be used in the test dataset. Early stopping is commonly used in the algorithm of gradient descent to avoid overfitting. The GRU and DNN layers are set to two layers. The RMSE result of different neurons in the training are compared. The input of this experiment is 24 parameters of the current raw data of the sensor, and there is no historical information (time steps = 1).  The experimental results show that the larger the batch size, the faster the model training speed. Therefore, batch size in this study will be set to 128. When the batch size is greater, not only the training time is fast, but the RMSE results are also better. So, the network architecture of this study will be set to G(64,64)N(8,8).

Model Comparison
This section uses the 5-fold cross-validation to evaluate the validity of the RUL model of the AE-GRU in this study, and compares the results of existing models. In total 218 engines are divided into five parts as 44, 44, 44, 44 and 42 engines. Four parts are used for training each time, and the remainder are test data. After distinguishing the data, the AE model reduces the dimension of the data, and then predicts the RUL of the engine through GRU. The performance of the model is also compared using the RMSE result. The dimension of the data is reduced from 24 inputs of the original data to 15, 10, 5. AE extracts the characteristics of the original data by non-linear transformation, and through multiple hidden layers, observing the loss of each feature in the compression and restoration process ( Table 5). The experimental results show that the dimension reduction effect is the best when the input is reduced to 15 dimensions. In 10 or 5 dimensions, the effect of restoration is poor because of the loss of original data Information. In this study, the AE will reduce the data dimension from 24 to 15 and compare the performance with other models.
After 5-fold cross-validation, (Tables 6 and 7), it is found that when the input of AE-GRU is reduced from the original 24 features to 15, the RMSE results are better than those of other models. This confirms that AE-GRU can effectively extract features and accurately predict the RUL. Time steps is the number of historical data points. time steps = 5 is the historical information of the current time point and the previous four-time records. time steps = 10 is the current time information and the previous nine-time points. The results show that when time steps = 5, the results of the model are quite close. When time steps = 10, because of the large amount of historical data, the RNN has the problem of disappearing gradients, while the general DNN has no time series characteristics, and the prediction results are the worst.  Figure, it can be seen that the five models have the problem of unstable fluctuation. Neither DNN nor RNN are ideal for training or testing. LSTM, GRU and AE-GRU are good in the training set, but precarious in the testing set, because there is not enough training data. Deep learning requires a large amount of data identify features. When the RUL is stable for 130 cycles, the AE-GRU proposed in this study can have more accurate results in predicting the RUL of the engine. It will not fluctuate like DNN and RNN. When the RUL begins to decline, AE-GRU can predict it in advance. problem of disappearing gradients, while the general DNN has no time series characteristics, and the prediction results are the worst.  Figure, it can be seen that the five models have the problem of unstable fluctuation. Neither DNN nor RNN are ideal for training or testing. LSTM, GRU and AE-GRU are good in the training set, but precarious in the testing set, because there is not enough training data. Deep learning requires a large amount of data identify features. When the RUL is stable for 130 cycles, the AE-GRU proposed in this study can have more accurate results in predicting the RUL of the engine. It will not fluctuate like DNN and RNN. When the RUL begins to decline, AE-GRU can predict it in advance.

Conclusions
This study proposes the AE-GRU model for RUL prediction. First, pre-processing was applied to the sensor data collected by different sensors. Because the scales of different sensors are not the same, we standardize the sensor data so that the sensor data can be rearranged to the same scale standard while retaining the characteristics of the data time series. Next, the RUL is defined. The purpose of defining the RUL is to allow the user to accurately predict the life value of equipment that the engine can work. Feature extraction finds the characteristics of the sensor data by deep learning. The characteristics will directly affect the predictive ability of the model. A good model is indispensable for characteristics that can fully represent the data. Data dimension conversion takes the historical sensor data into account, because the data has the characteristics of a time series. The last step of data pre-processing is the transformation of the maximum life value. The engine with an RUL greater than 130 cycles is set to 130 cycles. When using the RUL prediction, attention focuses on the RUL value when it is about to break, and the situation where the RUL value is still very large is less important. Then the GRU is combined with the back-propagation algorithm to learn the parameters. Because there are a large number of parameters to learn, the experiment uses a GPU graphics card in parallel operation and increase the process speed. At the same time, the relu activation function and Nadam optimizer are used to make the process of learning parameters converge more quickly. Combined with the best combination of parameters run under the experimental design, the prediction of the RUL has better results.
The contribution of this study is that it applies the optimized version of LSTM, the GRU. GRU can effectively capture the characteristics of the sensor data, and extract the characteristics of the data by AE before the RUL model prediction. The GRU model has many fewer parameters than LSTM but

Conclusions
This study proposes the AE-GRU model for RUL prediction. First, pre-processing was applied to the sensor data collected by different sensors. Because the scales of different sensors are not the same, we standardize the sensor data so that the sensor data can be rearranged to the same scale standard while retaining the characteristics of the data time series. Next, the RUL is defined. The purpose of defining the RUL is to allow the user to accurately predict the life value of equipment that the engine can work. Feature extraction finds the characteristics of the sensor data by deep learning. The characteristics will directly affect the predictive ability of the model. A good model is indispensable for characteristics that can fully represent the data. Data dimension conversion takes the historical sensor data into account, because the data has the characteristics of a time series. The last step of data pre-processing is the transformation of the maximum life value. The engine with an RUL greater than 130 cycles is set to 130 cycles. When using the RUL prediction, attention focuses on the RUL value when it is about to break, and the situation where the RUL value is still very large is less important. Then the GRU is combined with the back-propagation algorithm to learn the parameters. Because there are a large number of parameters to learn, the experiment uses a GPU graphics card in parallel operation and increase the process speed. At the same time, the relu activation function and Nadam optimizer are used to make the process of learning parameters converge more quickly. Combined with the best combination of parameters run under the experimental design, the prediction of the RUL has better results.
The contribution of this study is that it applies the optimized version of LSTM, the GRU. GRU can effectively capture the characteristics of the sensor data, and extract the characteristics of the data by AE before the RUL model prediction. The GRU model has many fewer parameters than LSTM but

Conclusions
This study proposes the AE-GRU model for RUL prediction. First, pre-processing was applied to the sensor data collected by different sensors. Because the scales of different sensors are not the same, we standardize the sensor data so that the sensor data can be rearranged to the same scale standard while retaining the characteristics of the data time series. Next, the RUL is defined. The purpose of defining the RUL is to allow the user to accurately predict the life value of equipment that the engine can work. Feature extraction finds the characteristics of the sensor data by deep learning. The characteristics will directly affect the predictive ability of the model. A good model is indispensable for characteristics that can fully represent the data. Data dimension conversion takes the historical sensor data into account, because the data has the characteristics of a time series. The last step of data pre-processing is the transformation of the maximum life value. The engine with an RUL greater than 130 cycles is set to 130 cycles. When using the RUL prediction, attention focuses on the RUL value when it is about to break, and the situation where the RUL value is still very large is less important. Then the GRU is combined with the back-propagation algorithm to learn the parameters. Because there are a large number of parameters to learn, the experiment uses a GPU graphics card in parallel operation and increase the process speed. At the same time, the relu activation function and Nadam optimizer are used to make the process of learning parameters converge more quickly. Combined with the best combination of parameters run under the experimental design, the prediction of the RUL has better results.
The contribution of this study is that it applies the optimized version of LSTM, the GRU. GRU can effectively capture the characteristics of the sensor data, and extract the characteristics of the data by AE before the RUL model prediction. The GRU model has many fewer parameters than LSTM but still retains the advantages of LSTM. The AE-GRU model proposed in this study has a shorter training time, and also takes advantage of the features extracted by AE. The convergence speed of the model is much faster. The validity of the model is evaluated and verified through the 5-fold cross-validation. The results of the root mean square error are better than other deep learning methods, and the research model can find the engine that is about to fail early and in time to maintain the equipment, reducing unnecessary costs.
The AE-GRU model proposed in this study has good accuracy in the prediction of RUL. With the improvement of process technology, the cost of equipment will become higher and higher. Therefore, more complex models may be needed in future research, such as a bi-directional recurrent network. In this study, only a directional recurrent network is considered. In some cases, the output at the current moment is not only related to the previous state, but also closely related to the state after it.
The pre-processing method in this study extracts features from the original data. Although the features can effectively represent the data set, they may contain a little noise. In the future, the preprocessing method might also employ an advanced version of AE, such as denoising AE (DAE). The combination of these two methods can further improve the prediction of RUL.
If the algorithm can be effectively implemented, predictive maintenance can be widely used in many applications, providing effective information for early maintenance of machinery, and self-correcting parameters can improve the yield to achieve the target of smart factories.

Conflicts of Interest:
The authors declare no conflict of interest.