Semi-Supervised Framework with Autoencoder-Based Neural Networks for Fault Prognosis

This paper presents a generic framework for fault prognosis using autoencoder-based deep learning methods. The proposed approach relies upon a semi-supervised extrapolation of autoencoder reconstruction errors, which can deal with the unbalanced proportion between faulty and non-faulty data in an industrial context to improve systems’ safety and reliability. In contrast to supervised methods, the approach requires less manual data labeling and can find previously unknown patterns in data. The technique focuses on detecting and isolating possible measurement divergences and tracking their growth to signalize a fault’s occurrence while individually evaluating each monitored variable to provide fault detection and prognosis. Additionally, the paper also provides an appropriate set of metrics to measure the accuracy of the models, which is a common disadvantage of unsupervised methods due to the lack of predefined answers during training. Computational results using the Commercial Modular Aero Propulsion System Simulation (CMAPSS) monitoring data show the effectiveness of the proposed framework.


Introduction
Incorporating IoT into maintenance has brought new possibilities, including conditionbased maintenance (CBM). CBM aims to avoid unnecessary maintenance tasks by taking maintenance actions only when there is evidence of abnormal behavior of physical assets [1,2]. If a CBM program is appropriately established and effectively implemented, it can significantly reduce maintenance costs by reducing the number of scheduled preventive maintenance operations [3].
The main feature of CBM is the condition monitoring (CM) process, in which signals are continuously monitored from certain types of sensors or other appropriate indicators to show the current state of a system or component [4]. Thus, a CBM program consists of three key steps [4]: data acquisition (information collection), data processing (information understanding and interpretation), and decision making (aimed at recommending efficient maintenance policies).
The sequence of steps mentioned above results in two essential forms of analysis within a CBM program: fault diagnosis and prognosis. While diagnosis deals with fault detection and isolation of faulty components, prognosis aims at predicting when the diagnosed fault will turn into a failure, i.e., its goal is to estimate how soon and how likely this failure is to occur [3]. A comprehensive description of prognosis and prognosis modeling is provided by ISO 13381-1 [5], which defines it as "an estimate of the time to failure and risk for one or more existing or future failure modes." Such an estimate is often referred to as remaining useful life (RUL).
Deep learning (DL) methods have recently been gaining ground in prognosis and health management as they are solutions capable of identifying and predicting the equipment condition through large datasets. They are helpful in circumstances where there is little or no investigation into the physics of the failure. However, the current state of the art concerning industrial-scale integrated solutions is still incipient since many works use simulated databases or real data with artificial faults. Thus, some challenges must be surpassed, such as the ability to learn in environments with evolving operating conditions, novelty detection, robustness to changes in the operational conditions, the capacity of generalization, and output interpretability [6].
Deep learning embraces neural network learning models with multiple layers of computational units that are capable of decomposing higher-level abstract features in terms of other more straightforward representations [7]. As an extension of the single-layer networks, it is also suitable for supervised, unsupervised, and semi-supervised types of learning.
DL emerges in prognostics and health management (PHM) as a resource to solve previously intractable problems in the field, improve performance over traditional techniques, and reduce the effort to deploy prognostic systems due to its advantages. Fink et al. [6] listed some of them as the ability to automate the processing of a significant amount of condition monitoring data, extract valuable features from high dimensional, heterogeneous data sources, learn functional and temporal relationships between and within the signal time series, and transfer knowledge between different operating conditions and different units. Furthermore, DL contributes to attenuating the need for feature engineering in datasets composed of many monitored variables, which is demanding, by incorporating it inside its own network [6].
Several studies have applied DL techniques to solve fault diagnosis and prognosis problems. Tao et al. [8], for example, studied the different structures of a two-layer network designed by varying the hidden layer size and evaluated for its impact on fault diagnosis.
Babu et al. [9] built a convolutional neural network (CNN) to predict the RUL of a system using the readings of several sensors as the input. The authors conducted a series of experiments and demonstrated how a CNN-based regression model could outperform three other regression methods, i.e., the multilayer perceptron, the support vector regression, and the relevance vector regression.
Among the authors that explored the signal reconstruction approach, Malhotra et al. [10] combined LSTM layers in an encoder-decoder to attain an unsupervised health index for a system using multi-sensor time-series data. The study concludes that LSTM-ED constructed HI learned in an unsupervised manner can capture the degradation in a system and that this HI can be used to learn a model for RUL estimation with equivalent performance to domain knowledge or exponential and linear degradation model assumptions.
Wu et al. [11] have proposed a semi-supervised diagnosis architecture called "hybrid classification autoencoder," which uses a softmax layer over the encoded features of the autoencoder. In its approach, vibration data are pre-processed in a bi-dimensional entry by a short-time Fourier transform (STFT) and subjected to consecutive convolutional layers. Experimental validation has been performed in a publicly available dataset of moto-bearing signals. The authors also presented a practical application in a hydro generator rotor diagnosed with a rub-impact fault between the turbine shaft and turbine guide bearing.
Moreover, several diagnosis and fault prognosis models and frameworks are available in the literature for the most diverse scenarios. One of the first was elaborated by Vachtsevanos et al. [12], and it divides the process into seven steps, starting from sensor data collection, passing through FMEA analysis, operating mode identification routine, feature extractor, and sequential diagnostic and prognostic modules.
Other researchers [13] prefer explicitly declaring the health index (HI) construction as a step of the prognostic scheme and prefer discerning the health stage (HS) of the system by the indicator. The health stage division in the Lei et al. [13] framework shares similarity with the fault detection and diagnostic actions but has the particular goal of splitting a degradation pattern into different health stages according to variations in its characteristics.
Even some standards attempt to generalize a conceptual framework aiming to provide basic CBM and PHM modules from data acquisition and analysis to health assessment, prognostic assessment, and advisory generation. ISO 13381-1 divides this process into four actions (one preprocessing and three prognostic types or levels of increasing complexity): data preprocessing, existing failure mode prognosis, future failure mode prognosis, and post-action prognosis.
However, although many authors conclude that semi-supervised or unsupervised learning-based methods embedded in frameworks are more appropriate for multiple reasons (see [14], for example), there is still a predominance of supervised approaches that depend on intensive manual intervention to label the data. Additionally, supervised methods cannot find previously unknown patterns in data, which are not rare in industrial environments where various causes for the failures result in very different behaviors of each signal before different incidents, even of the same type [14]. According to Sikorka et al. [15], most of the research on prognosis has been theoretical and restricted to a small number of models and failure modes. There are few published examples of prediction models applied in complex systems exposed to various operational and business conditions. So, the main objective of this work is to provide a semi-supervised framework based on autoencoder deep learning methods for fault detection and prognosis. To overcome a common limitation of unsupervised methods related to the lack of predefined answers during training, this work provides a set of metrics designed to measure the accuracy or effectiveness of the models, ensuring comparability between them for validation and improvement purposes.
Thus, this work not only expands the literature on semi-supervised methods for fault prognosis but also provides a generic framework based on an autoencoder deep learning method. Consequently, the contributions of the proposed approach can be stated as follows:

1.
This approach provides a systematic framework for implementing a semi-supervised prognosis method based upon an autoencoder deep learning method; 2.
This approach implements a framework designed for application in the industrial scenario since it considers the system's restrictions such as data management, the physical behavior of degradation processes, and business specifications; 3.
This approach enables the detection of different kinds of faults by evaluating each sensor channel (i.e., variable) individually; 4.
This approach proposes a set of metrics to evaluate the accuracy and effectiveness of the fault detection and prognosis models.
The rest of this text is structured in the following way: Section 2 presents the proposed framework in detail; Section 3 shows the results obtained by applying the framework to the CMAPSS database; and Section 4 presents the conclusions and discussions arising from this work.

The Proposed Framework
The proposed approach relies upon generating a prognosis horizon for fault degradation patterns using the reconstruction error extrapolation of a deep autoencoder trained only with the machine's normal operating condition monitoring data. This work focuses on detecting and isolating possible divergences in the monitoring measurements, which may indicate a fault, and extrapolating their growth to predict the machine's RUL. Such extrapolation is conducted using a set of more straightforward univariate functions with known behavior until a limit of divergence, signaling the failure's occurrence.
The reason for using this approach, in contrast with what is recently adopted in the prognosis literature, is that there is a demand for models capable of following, recording, and interpreting machine behavior in the context of complex engineering systems (CES). Currently, industrial equipment is generally assisted by monitoring systems (which are either automated or controlled by humans). These systems are usually assisted by programmed fault alarms based on guidelines or empirical knowledge about the process, composing the resources for predictive maintenance. This type of setup is classified in Level 0 in terms of prognostic implementation readiness according to ISO 13381 classification [5], i.e., they are CES with monitoring infrastructure capable of performing detection and sometimes fault identification, yet they do not form a strong foundation for most sophisticated prognostic techniques that require intensive and systematical diagnostic capacities.
In line with the above, the proposed model constitutes a framework that has the potential to embrace all the past requirements-which are the detection, isolation, and identification of the fault-for the remaining useful life prediction reaching Level 1 prognostics according to the same standard. In fact, since the autoencoder is a signal reconstructor and therefore can work as a hidden state reconstructor, diagnosis is possible because each channel could be compared individually to provide a multilabel classification of different kinds of faults.
Another reason for adopting this approach is the unbalanced proportion of monitoring data between faulty and non-faulty conditions in an industrial scenario since some failure events are rare for certain types of equipment. Then, the availability of a great array of data in the normal operating conditions of a CES is attractive for the use of data-driven methodologies. The steps for implementing the prognosis framework and RUL prediction are expressed in the flowchart in Figure 1.

Step 1: Data Preparation
The data selection comprises the procedure of selecting data in normal operational conditions (NOC), and it is necessary to characterize this state beforehand with the help of a specialist or some reference criteria (for example, collecting data immediately after maintenance or an arbitrary time interval before the occurrence of a fault). It is interesting pointing out that there is no need to establish a perfect boundary in the transition of the

Step 1: Data Preparation
The data selection comprises the procedure of selecting data in normal operational conditions (NOC), and it is necessary to characterize this state beforehand with the help of a specialist or some reference criteria (for example, collecting data immediately after maintenance or an arbitrary time interval before the occurrence of a fault). It is interesting pointing out that there is no need to establish a perfect boundary in the transition of the conditions since it is desired to detect incremental abnormalities.
After that, the data are scaled using the given criteria, which could be achieved by using a value range reference or by removing the mean and scaling to unit variance-standard score-according to the dataset profile in the application example. It is worth emphasizing that only NOC-labeled data are applied to calibrate the scaler to avoid distortions in the set designated to train the networks. Following this, the data are reshaped into a set of subsequences of size n that will supply the models. These subsequences are generated through a moving window with a temporal iteration step of p, thus allowing overlapping of n-p samples. Each sample has shape (n, m) (n is the subsequence size, and m is the number of channels-sensors inputs). For this work, NOC-labeled data are split between train and validation sets to prepare the neural network. The terminology test set will be designated exclusively to refer to data not applied in the DNN training and tunning process, including abnormal-labeled data.

Step 2: Fault Detection
The DNN autoencoder models are programmed according to the hyperparameter specifications in Table 1, and their structures are illustrated in Figure 2. Three different kinds of layers will be used, namely MLP, LSTM, and 1D-Convolutional (1D-CNN or Conv-1d), which are commonly used in the setup of DL models to monitor signals in the literature. Albeit MLP could express poor performance in comparison with the other layers, it is applied in this study as a reference to analyze the models. Thus, a minimum requirement for them is to outperform a classic MLP perceptron architecture. Recurrent neural networks, especially LSTM, are widely used for PHM applications. Moreover, Conv-1d is an alternative to apply convolutional operations into time-series data without demanding transformations to the bi-dimensional spatial space, which is time-consuming. In addition, Conv-1d is less computationally expensive because it has fewer parameters.  The reconstructed signal subsequences outputted from the trained AE models are compared with the actual signal observations, and the reconstruction error (RE) is evaluated. RE is computed as a mean squared error (MSE) function applied in a subsequence for each channel, so it is possible to inspect discrepancies individually. The subsequences are addressed by the time index of the last observation; then, the RE is also assigned for this position. Thus, the reconstruction error follows the notation below: where REi,j is the reconstruction error for the subsequence , = { , ( ) , … , , ( ) } with temporal index i, channel index j and size n. , correspond to the reconstructed subsequence mimicking si,j. The abnormality detection procedure comes afterward, employing the reconstruction error matrix, RE, whose entries are defined by Equation (1), to build a set of error threshold functions fth(t) that is used to classify whether or not a data entry is abnormal. First, it is important to note that the RE1:n,j series are subject to local variability due to the outliers that could come from the sensor's readings. For example, in machines with more than one operational mode or those with intermittent operations, the working routine is cyclical, having unstable behaviors during state transitions or due to variations in the cycle periods. Examples are the take-off and landing of aircraft or the switch between generation and motorization modes in hydro-generators. Moreover, it is not a trivial task to characterize state transitions, even for experts in the process, and it is part of the data Rosa et al. [23] investigated the sensitivity of AE architecture hyperparameters over its abnormality detection performance. The study concluded that some specific hyperparameters influence the model outcomes more than others, therefore serving as a reference for the search space definition. Although easy to implement, grid search is an exhaustive procedure that is inefficient without prior knowledge of the search space near the optimality. Some alternatives are the random search or search based on the Bayesian optimization theory [24].
The reconstructed signal subsequences outputted from the trained AE models are compared with the actual signal observations, and the reconstruction error (RE) is evaluated. RE is computed as a mean squared error (MSE) function applied in a subsequence for each channel, so it is possible to inspect discrepancies individually. The subsequences are addressed by the time index of the last observation; then, the RE is also assigned for this position. Thus, the reconstruction error follows the notation below: where RE i,j is the reconstruction error for the subsequence s i,j = s with temporal index i, channel index j and size n. s r i,j correspond to the reconstructed subsequence mimicking s i,j .
The abnormality detection procedure comes afterward, employing the reconstruction error matrix, RE, whose entries are defined by Equation (1), to build a set of error threshold functions f th (t) that is used to classify whether or not a data entry is abnormal. First, it is important to note that the RE 1:n,j series are subject to local variability due to the outliers that could come from the sensor's readings. For example, in machines with more than one operational mode or those with intermittent operations, the working routine is cyclical, having unstable behaviors during state transitions or due to variations in the cycle periods. Examples are the take-off and landing of aircraft or the switch between generation and motorization modes in hydro-generators. Moreover, it is not a trivial task to characterize state transitions, even for experts in the process, and it is part of the data selection step of this study. This fluctuation could severely affect the method's abnormality detection capacity and must be considered in relation to the definition of f th (t) and interpretation of the entries of RE.
The main part of Step 2 is summarized in the flowchart presented in Figure 3 and exemplified by the graphics in Figure 4. The non-conformities are detected in this work by using a set of continuous threshold functions f th (t) = c j , (c j is the maximum value between the post-processed RE samples labeled as NOC for the jth channel). Samples of the post-processed RE 1:n,j that exceed c j are labeled as abnormalities. Sometimes the post-processing of RE alone is not enough to avoid the occurrence of false positives, which are caused by pointwise or small cluster point addressing. To highlight the cumulative abnormality resulting from the monotonic growth of the degradation pattern, an offset of consecutive abnormal labeled points is used as a requirement for pointing out the beginning of the degradation. The main part of Step 2 is summarized in the flowchart presented in Figure 3 and exemplified by the graphics in Figure 4. The non-conformities are detected in this work by using a set of continuous threshold functions fth(t) = cj, (cj is the maximum value between the post-processed RE samples labeled as NOC for the jth channel). Samples of the post-processed RE1:n,j that exceed cj are labeled as abnormalities. Sometimes the post-processing of RE alone is not enough to avoid the occurrence of false positives, which are caused by pointwise or small cluster point addressing. To highlight the cumulative abnormality resulting from the monotonic growth of the degradation pattern, an offset of consecutive abnormal labeled points is used as a requirement for pointing out the beginning of the degradation.  It is worth mentioning that several other techniques can be used in each step of the proposed framework. Specifically, regarding fault detection, the objective of the proposed method is similar to that of the multivariate statistical process control (MSPC). However, despite the usefulness of MSPC for multivariate surveillance in industrial practice, there are some disadvantages regarding establishing what happened in the process. The need for a mathematical background is another drawback for applying MSPC in real scenarios [25].   In (a), there is a temporal progression of the reconstruction errors for an arbitrary variable. Train-from zero to the first green line-and validation-interposed between the green lines-sets correspond to NOC, while the test set is in the degraded condition. E(y r , y p )-dashed red line-is the maximum error between the samples in the condition labeled as normal. The localization of E in the samples' distribution is displayed in (b) as a continuous red line. As the monotonic pattern evolves, it exceeds E, and if a certain quantity of consecutive RE keeps above the limit, the abnormality is registered.

Step 3: Fault Prognosis
The RUL estimation is developed from the samples inside the degraded state intervals (I d ) provided by the abnormality detection procedure. I d is defined as a set of consecutive reconstruction error samples, meaning that: where I i,j d is the ith interval with cardinality l + 1 for the channel j, {t i , t i+1 , · · · , t i+l } is the ordered set of temporal indexes addressed for the REs. An additional notation is I d . These samples could either be subjected to another post-processing routine specially designed for prognosis or the same routine already made for abnormality detection. Hereupon, the initial goal for RUL evaluation is to determine the prognostic error threshold for each channel, which is equivalent to the failure limit of a built health index or measured quantity. As the RE cannot directly relate to future variations on the input channels-unless explicability techniques are coupled to the DNN-it is necessary to take past failure events as references to determine those thresholds. Thus, the prognostic error threshold PE (j) th for the jth channel is given as an average of k reconstructed error samples and n fault observations before the failure. The next step is iteratively fitting curves and executing extrapolations from the first detected abnormality until the error threshold for each channel to obtain the RUL prediction at the time t. At the instant t, there can be more than one estimation because the degradation evolution of each channel is treated independently and fitted in a univariate function. Therefore, a decision criterion is required to provide a singularized prediction, which is achieved by observing curve fitting metrics, prognostic threshold variability, and the monotonicity of the generated profiles together with the values of the produced estimations. The pseudocode in the Algorithm 1 systematizes details concerning the step above. Algorithm 1 Estimation of the RUL for an experimental fault event with the made assumption that the real remaining life is known for study purposes.
The Algorithm 1 has three main loops that permeate the prognosis procedure in a given inspection interval as long as an abnormality is detected. The loops, from the most to the least nested, iterate through curves (Loop 1), channels (Loop 2), and in time (Loop 3), respectively. The first one adjusts the function shape for a channel m at time t i using the least squares method and estimating the t eol and thus the RUL. The second one, in turn, evaluates the RUL and curve-fitting metrics for each one of the channels with degradation labeled in t i using the decision function D f 1 . By the end, the third loop decides whether a prediction is made at the time t i and its value using the decision function D f 2 . Decision functions are subroutines that hierarchically dispose of the most likely remaining life prediction(s) after inputting a list of them by means of the analysis of a set of curve-fitting metrics. D f 1 only employs R 2 in the sorting and eliminates nonsensical outcomes and those below an established fitting limit. The final result comes from the mean of the remnants' occurrences. D f 2 considers monotonicity beyond R 2 and weights the last one in a fraction of 0.8 out of 1.
Furthermore, the curve fitting is executed using a non-linear least-squares problem with bounds on the variables. The objective is to find a local minimum of the cost function F(x), which is: where c is a vector of estimable parameters, N is the number of available data points, ρ(s) is a scalar loss function that reduces the outliers' influence, and  Table 2 and represent common degradation patterns found in mechanical components [13,26]. Table 2. Selected model functions f (i) c (i) , x that cause the vector f subject to be minimized in accordance with the algorithm presented in Figure 2.
The value of the function f −1 (RE), which is the inverse of f (t), at the point p = PE gives the component t * (j) eol . Thus, the estimated RUL at the instant t i is r(t i ) = t * (j) eol − t i for the fitted curve.

Step 4: Performance Assessment
The performance assessment is realized with dedicated performance metrics for comparing the autoencoders during the training process, abnormality detection, and prognostics. The AE convergence is observed through the train and validation set loss on the last training epoch. Abnormality detection capacity is measured by detection coverage, d, and false-positive coverage, f , respectively: The indicator d measures the ratio between the samples correctly signaled by the method as abnormalities and the real set of degradation occurrences, whereas f relates to the ratio of NOC samples highlighted on the same condition and the real entries in the normal state.
Other indicators used in evaluating performance are the discontinuity index int is the number of intervals where abnormality was detected, the time interval between the first spotted abnormal point t sp , and the concrete tipping point for the degraded stage t d : The prognostic capacity, in turn, is quantified by the root-mean-square error (RMSE), NASA's scoring function adaptation (ns-score) [27,28], and the prognostic horizon. The first two are defined as: where N * indicates the total number of RUL estimations, ∆ (k) is the difference between the predicted and the real remaining life of the kth sample, ∆ (k) = r(t k ) − r * (t k ), and β is 1 14 if RUL is underestimated (but 1 10 otherwise). The ns metric is not symmetric and penalizes overestimation more than underestimation [29].
The prognostic horizon is defined as the time interval between the time t (C) when a made prediction first meets a specified performance criteria C that continues being satisfied until t eol for all t (i) such that t (C) ≤ t (i) ≤ t eol , thus: Moreover, some metrics proposed by Saxena et al. [27,28] may be used for auxiliary performance inspection, i.e., not designated for a specific finality of tunning or validation in this study within the models' comparison schema. These metrics are relative accuracy and cumulative relative accuracy.
If relative accuracy (RA) is defined as an error measurement in the RUL prediction relative to the actual RUL r * (t λ ) at a specified t λ , then: where l is the index for the lth prognostic experiment, r l * (t λ ) is the ground truth remaining life at the time t λ , and r(i λ ) is an appropriate central tendency point estimate of the predicted RUL distribution at the time index t λ .
Since relative accuracy is expressed punctually, to attain an overall view of the algorithm behavior over time, it is necessary to aggregate the measurements as a normalized weighted sum of relative accuracies for all the predictions in one prognosis experiment, resulting in a metric called cumulative relative accuracy, which is: where w r l (l) is a weight factor as a function of RUL at all time instances, p λ is the ordered set of all time indexes before t λ , and n(p λ ) is the cardinality of the set p λ . Apart from the metrics based on accuracy, it is also important to mention the monotonicity criteria applied as an input of the decision function for the RUL discrimination at the time t i , previously elucidated. Lei et al. [13] argue that machinery degradation is an irreversible process and thus should be linked with monotonic increasing or decreasing trends.
There are monotonicity metrics based on the count of finite differences d/dx = x k+1 − x k of a health index sequence X = {x k } k=1:K with x k constituting the value of HI at the time t k [30]. The selected one is described as: where K is the number of the elements of the set X, represents the number of positive and negative differences, respectively, and then Mon 1 (X) quantifies the absolute difference between them, normalizing it for the interval [0, 1].

Results
This chapter presents the results of applying the framework to the CMAPSS database. Section 3.1 describes the CMAPSS dataset, while Section 3.2 shows the results obtained.

Application Example in CMAPSS Dataset
The database chosen for the study is a variant of the Commercial Modular Aero-Propulsion System Simulation (CMAPSS), publicly available and recognized as one of the datasets frequently used for benchmark prediction algorithms. It was recently updated after joint work between NASA and ETH Zurick's intelligent maintenance systems center, so that the amount of sensor samples has been increased to 1 Hz, making it suitable for the study of the models oriented for large volumes of data.
The CMAPSS-2 [31] is composed of a set of synthetic RTF trajectories, that is, with the artificial degradation of nine turbofan engines that were produced by the simulator from the input of real flight conditions, which are characterized by the scenario descriptor variables: altitude, Mach number, throttle-resolver angle (TRA), and total inlet blade temperature. The base is divided into six units designated for training and another three for testing, with operating conditions slightly different from the others. In this study, only the training data from CMAPSS-2 were used, which does not compromise the feasibility study since the tested model is unsupervised and, therefore, uses only a part of the samples from each unit for training.
The inserted degradation pattern is of a continuous type and is divided into four states: the degradation condition at the beginning of the operation; the normal state; a transition zone between the normal to abnormal conditions; and an abnormal state. The simulation considers the alternating presence of failure modes in the main sub-components of the motor: fan, LPC, HPC, HPT, and LPT. Their deteriorations are modeled by adjustments in flow capacity and efficiency. More information about the modeling can be found in Chao et al.'s work [32]. Figure 5 outlines the allocation of the main subsystems of a turbofan engine. testing, with operating conditions slightly different from the others. In this study, only the training data from CMAPSS-2 were used, which does not compromise the feasibility study since the tested model is unsupervised and, therefore, uses only a part of the samples from each unit for training. The inserted degradation pattern is of a continuous type and is divided into four states: the degradation condition at the beginning of the operation; the normal state; a transition zone between the normal to abnormal conditions; and an abnormal state. The simulation considers the alternating presence of failure modes in the main sub-components of the motor: fan, LPC, HPC, HPT, and LPT. Their deteriorations are modeled by adjustments in flow capacity and efficiency. More information about the modeling can be found in Chao et al.'s work [32]. Figure 5 outlines the allocation of the main subsystems of a turbofan engine. In this application example, the units have been subjected to high-and low-pressure turbine failure modes with an initial condition of random deterioration of about 10% of the health index implicit in the simulator. Table 3 details the failure modes for each unit and provides additional information on the number of samples, the transition time to abnormality, and end of life ( ) in cycles. Figure 6 details the trajectory imposed on the flow and efficiency modifiers for the tested units.  In this application example, the units have been subjected to high-and low-pressure turbine failure modes with an initial condition of random deterioration of about 10% of the health index implicit in the simulator. Table 3 details the failure modes for each unit and provides additional information on the number of samples, the transition time to abnormality, and end of life (t eol ) in cycles. Figure 6 details the trajectory imposed on the flow and efficiency modifiers for the tested units. Table 3. Information about subset samples of each unit (adapted from [32]).  This application example follows the framework with sets of hyperparameters and fixed neural network architecture, whose feature space is composed of 18 variables, which are the same condition monitoring signals used by Chao et al. [32]. In addition to that, a detailed description of the CMAPSS simulator variables can be found in [31].

Dataset Fraction Unit (u) Rows (10 4 ) t s * (Cycles) t eol (Cycles) Failure Mode
The autoencoder models are subject to a validation procedure that consists of two steps: the first one is to evaluate whether its performance (through an analysis of the metrics presented in Section 2.4) surpasses that of a simplified baseline model, which does not use deep learning, and the second one is to compare it with alternatives presented in the literature that employ similar techniques and databases.
The baseline model is built from a simple regression extrapolation procedure of the pre-processed original inputs of the database, following the sequence of steps: down sampling at a rate of 1 sample every 200 (without crossing the limits of operational cycles) and later smoothing by simply moving an average size of 500, so that the samples of this model and the one submitted for validation are similar, and then the application (see Figure 2) of the methodology and performance evaluation, obviously with the metrics of Section 2.4.

Results from Application Example
The MSE loss convergence during the networks' training progression is shown in Figure 7 and the progression of useful life estimations over the course of the operation of the units is presented in Figures 8 and 9. The time instant , x-axis, is normalized in relation to the total life ( ) of the motors and is interpreted as a percentage (0-100%) of This application example follows the framework with sets of hyperparameters and fixed neural network architecture, whose feature space is composed of 18 variables, which are the same condition monitoring signals used by Chao et al. [32]. In addition to that, a detailed description of the CMAPSS simulator variables can be found in [31].
The autoencoder models are subject to a validation procedure that consists of two steps: the first one is to evaluate whether its performance (through an analysis of the metrics presented in Section 2.4) surpasses that of a simplified baseline model, which does not use deep learning, and the second one is to compare it with alternatives presented in the literature that employ similar techniques and databases.
The baseline model is built from a simple regression extrapolation procedure of the preprocessed original inputs of the database, following the sequence of steps: down sampling at a rate of 1 sample every 200 (without crossing the limits of operational cycles) and later smoothing by simply moving an average size of 500, so that the samples of this model and the one submitted for validation are similar, and then the application (see Figure 2) of the methodology and performance evaluation, obviously with the metrics of Section 2.4.

Results from Application Example
The MSE loss convergence during the networks' training progression is shown in Figure 7 and the progression of useful life estimations over the course of the operation of the units is presented in Figures 8 and 9. The time instant t, x-axis, is normalized in relation to the total life (t eol ) of the motors and is interpreted as a percentage (0-100%) of t eol or as normalized cycles. The y-axis indicates the predicted RUL at instant t (also expressed as a percentage of t eol ) and the orange dashed line, the real value of the RUL (that is, t − t eol ) at that instant. It is noted that the beginning of the forecast differs from the units since it is directly related to the abnormality detection capacity, which is made by a criterion similar to that used by Rosa et al. [23], wherein there is a difference in a consecutive set of points of the maximum reconstruction error between the samples in NOC.    For all the analyzed models, the time of the first prediction (t f pt ) occurred after half of the degradation time of the engines. From 50% to 65% constitutes a region of instability in the forecasts in which there are remaining life estimates that exceed the value of t eol near 100% or underestimate it close to 1%. This is because the deterioration trends are incipient and have a low rate of change, which makes it difficult for the algorithm to decide which of the curves is the most appropriate, as some have a very similar fit condition. After 70% of the t eol , a stable convergence zone is formed, and the adherence of the projections to the real RUL curve gradually improves up to 100%, which is the desired behavior. Compared to the baseline model, the proposed models advance to a stable condition much earlier (~65%) than the Baseline (~80%).  Table 4, while Table A1 (in Appendix A), presents all the results organized by unit. Both tables show the RMSE, fractional RMSE's L1, L2, and L3, time of the first prediction ( ), the (Equation (6)) divided by the total number of estimates, cumulative relative accuracy, and prognostic horizons for 5% and 20% of the errors. It should be noted that L1, L2, and L3 stand for the RMSE fraction only for samples inside the first, second, and third thirds of the second half of the normalized teol, respectively.
There is no expressive gain in the RMSE of the proposed model when compared to the Baseline (−15.64%, Table 4) due to the rough projections made at the beginning of the degradation process. When these prognostic samples are disregarded, it is possible to notice a performance gain for this metric, which is expressive from the third third (L3) and From the three models tested, Conv-1d showed the best result in terms of advancing convergence to the actual prognostic result for all units. It can be seen from Figure 8 that it is the model with the most anticipated first average prediction time of all the units and adheres to the reference line of progression of the RUL in about 75% of the t eol . The MLP model visually manifested a behavior similar to the convolutional one and also presented a zone of forecast instability with high fractional RMSE but with time stamping metrics (fpt, H T (5) , and H T (20) ) later compared to the second. The LSTM model did not show a concentrated region of large prediction errors like the previous two, but it did show sparse peaks of high errors for two or three cycles in units 2, 16, 15, and 5. Although it may seem that the LSTM provides more stable predictions, in fact, gaps in the forecasts may occur, especially in the region of 60-75% t eol , in which large magnitude discrepancies are suppressed by the restriction of the algorithm to disregard RUL estimates, if r(t) + t exceeds t eol , above 300 cycles.
For a moving average subsequence of n = 500, it can be seen that the three autoencoder models outperform the Baseline, which starts to provide consistent forecasts after 80 normalized cycles have elapsed. An increase in the time window of the moving average could proportionate a positive impact, especially on the base model, as it benefits the most from signal attenuation in regions of instability. However, increasing n reduces the number of samples of each unit available for curve fitting in the prediction algorithm so that the RUL of some units arranged in Table 3 could not be calculated.
The difference between the forecast and the actual value of the RUL, also expressed as a percentage of the t eol , is shown in Figures 10 and 11. The blue dashed lines indicate 20% error limits in Figure 10 and 5% error limits in Figure 11, which is taken as a reference for calculating the prognostic horizon. The proposed method manages to keep the estimates within the error margin of +/− 20% t eol , but it has complications in meeting the goal of +/− 5% t eol , with only a few units achieving this result even after 80% of the machine's life. There are two possible reasons for this answer: the first one, mentioned above, is the absence of a global tuning of the model, including the neural architecture, which is not at its optimal performance in terms of training with NOC samples; the second one is the uncertainty regarding the choice of the error threshold for the prognosis, which can increase the estimates above what was expected.
The summary of the results obtained for the values of the performance metrics is presented in Table 4, while Table A1 (in Appendix A), presents all the results organized by unit. Both tables show the RMSE, fractional RMSE's L 1 , L 2 , and L 3 , time of the first prediction (t f pt ), the ns (Equation (6)) divided by the total number of estimates, cumulative relative accuracy, and prognostic horizons for 5% and 20% of the errors. It should be noted that L 1 , L 2 , and L 3 stand for the RMSE fraction only for samples inside the first, second, and third thirds of the second half of the normalized t eol , respectively. There is no expressive gain in the RMSE of the proposed model when compared to the Baseline (−15.64%, Table 4) due to the rough projections made at the beginning of the degradation process. When these prognostic samples are disregarded, it is possible to notice a performance gain for this metric, which is expressive from the third third (L 3 ) and improves the proximity of t eol , therefore quantitatively corroborating the notion that the proposed model advances to a state of convergence in the zone before the baseline.   Table 4 inspection reveals that the global RMSE is lower than the Baseline model. The reasons are that the Baseline model produced less and later estimates in comparison to the autoencoders, as can be viewed in Figures 8 and 9, and when closer to the end of life, the prediction errors tend to be smaller due to the presence of more information about the pronounced degradation. The models start to equate in performance as there is an approximation to the stable convergence zone, and there is a slight divergence between the RMSE L3 values. Although the Baseline also has a lower RMSE L3 , it should be noticed that it has performed fewer predictions even in that region-see Figures 9 and 11. The prognostic horizon is certainly greater for the autoencoder-based solutions, highlighting the Conv-1d, which has the earlier t f tp , so a correlation with the H (C) was already expected.
The difference t f tp -H (C) could be interpreted as a latency of the model in the archive or an acceptable error margin. The presented models could not overcome the data-oriented arrangements programmed by Chao et al. [32], nor is this the intention, as they use supervised learning, thus mapping the channel signature throughout the degradation evolution and not only in the NOC. Even if it is not possible to exceed this author in performance, it is important to note that there is a great proximity between the mean squared error values for the stable convergence zone ( ). There is a great similarity between the behavior of the operating time forecasts plotted by this author with the one shown in this work. Greater uncertainty is also demonstrated at the beginning of the forecasting process and gradually reduces until teol. It is observed that the use of a supervised technique allows a tfpt very close to the beginning of the unit's life and that the supervised method purely derived from ANN can make inferences almost in real time after being trained and generated by a new sample (without the computational costs of the curve fitting). On the other hand, Figure 11.
Progression of the prediction error relative to the total life of the asset, E p = 100 × (r * (t) − r(t))/t eol , over the time of operation of the units for the reconstruction error extrapolation and baseline methods. The prognostic horizon of 5% is represented by the blue dashed lines. In (a) results are presented for the Conv-1d autoencoder, (b) LSTM, (c) MLP and (d) for the Baseline model. Moreover, ns, another error evaluation metric, differed from the RMSE's outcomes by showing a similar quantity overall. This fact is justifiable because, even though the models have differed significantly in global accuracy, all of them displayed a greater tendency to overestimate predictions, which is penalized by this metric. CRA, in turn, follows the RMSE behavior, as they are almost analogous measurements when a linear weighting w(x) = x (Equation (10)) is taken.
Generally, the proposed autoencoder models are more stable than the Baseline model, detect abnormalities earlier, and enter a region of stable convergence earlier. They manage to meet the margin of error requirement below 20% of t eol for at least a fourth of the unit's life but struggle to meet the requirement of a 5% forecast horizon.
Finally, the comparison with the literature is based on the publication by Chao et al. [32], who also built deep learning models to estimate the RUL on the CMAPSS-2 basis. This comparison aims to verify if the framework exhibits coherent behavior for the predictions over time. It is made by the qualitative inspection of the prediction errors' progression, see Figures 10 and 11, which is also plotted by Chao et al. [32] for the same three kinds of layers used in this study. Moreover, some performance measurements taken in this work are compared with the results obtained by the cited author. They are the RMSE and prognostic horizon.
The presented models could not overcome the data-oriented arrangements programmed by Chao et al. [32], nor is this the intention, as they use supervised learning, thus mapping the channel signature throughout the degradation evolution and not only in the NOC. Even if it is not possible to exceed this author in performance, it is important to note that there is a great proximity between the mean squared error values for the stable convergence zone (RMSE L3 ). There is a great similarity between the behavior of the operating time forecasts plotted by this author with the one shown in this work. Greater uncertainty is also demonstrated at the beginning of the forecasting process and gradually reduces until t eol . It is observed that the use of a supervised technique allows a t fpt very close to the beginning of the unit's life and that the supervised method purely derived from ANN can make inferences almost in real time after being trained and generated by a new sample (without the computational costs of the curve fitting). On the other hand, supervised learning techniques tend to be more specific to the application-failure mode-and have a shorter lifespan, requiring retraining to adapt to changes in the operating equipment. Therefore, the advantage of our approach is its capacity to be easily implemented in an industrial context, which has the particularity of having an abundance of engineering system data in NOC with few recorded faults. Furthermore, real scenarios had lower-quality data labels or unlabeled data. Our framework is designed considering this observation since there is no need to attribute labels or even discriminate sets of abnormal samples. Another point is that we elaborate a complete framework that embraces detection and prognostic models, while Chao et al. [32] focus only on train models for RUL estimation without worrying about scalability. In the end, the proposed framework is more suitable for use in different industrial domains and has an extensive application range because it does not require physical information or intensive knowledge about the fault's nature and its signature in the sensors' readings.

Conclusions
The above results allow us to infer that the framework developed can match the performance with a baseline model that uses simple linear regression on pre-processed signals. It is noteworthy that there is still a large margin of adjustment available through hyperparameter modulations, neural network architectures, and post-processing adjustments for reconstruction errors, among others, to achieve more significant gains.
One future improvement of the prediction algorithm concerns the predictions at the beginning of the degradation process, in which different curves tend to show high R 2 , overestimating the remaining lifetime value. This behavior is somewhat expected, as predictions tend to improve in accuracy as more information about the condition becomes available. However, the decision functions need to be fixed to avoid outliers that exceed t eol values by more than 300% and avoid producing an atypical variability of predictions between two consecutive moments. One way to achieve this is to force downgrade outliers by noting that dRUL real /dt = −1. In addition, it is also necessary to calibrate the decision functions better so that the other indicators are taken into account in the hierarchy of functions and sensors, which currently relies heavily on R 2 values.
As for the data pre-processing and reconstruction error post-processing routines, it is feasible to explore new trend-smoothing techniques that do not suppress many samples, which was the case with the moving average used. Applying a moving average of n = 2000 points implies losing the equivalent of 30 cycles at the beginning of the abnormality state and contributing to the effect mentioned in the above paragraph. This is a problem for units with low t eol , or that operated for a few cycles in an abnormal condition before failure, such as unit 14, as there are few samples designated for the prognostic step. Another alternative would be to merge the labeled inputs with those in normal conditions at an offset of n at most before the detection point.
For future works, a study on the quantification of uncertainties at different stages of the method can be carried out. Potential advances can be made by using Bayesian neural networks or generative models or by adopting probabilistic regressors to extrapolate reconstruction errors. Other sources of uncertainties are in the attribution of the moment of transition to the degraded state, in the attribution of the prognostic trend limit, and, therefore, in the estimation of the RUL, the latter of which absorbs all the variability arising from the decision-making process within the algorithm of prognosis as well as inherits the epistemological and random remnants of the models and data sets, respectively. On the other hand, quantifying multiple sources of uncertainty considerably impacts performance, especially if Monte Carlo random sampling techniques are prevalent. The problem of integrating multiple sources of uncertainty within this framework to produce predictive results with safety margins and fulfill the specificities of the current regulations in a computationally efficient manner is still open to the author. In the absence of the possibility of performing this globally, it is recommended to identify the factors that most contribute to the variability of the prognosis result, such as the definition of the limit error for prognosis, which greatly impacts performance, as observed in the application example. Funding: This paper presents part of the results obtained with the execution of the project PD-06491-0341-2014 "Methodology for asset management applied to hydro-generators based on mathematical models of reliability and maintainability" carried out by the Federal University of Technology at Parana and University of Sao Paulo to COPEL Geração e Transmissão S.A within the scope of the Electric Sector's Research and Technological Development Program regulated by the National Agency of Electrical Energy (ANEEL).

Data Availability Statement:
Publicly available datasets were analyzed in this study. These data can be found here: https://www.nasa.gov/content/prognostics-center-of-excellence-data-set-repository (accessed on 4 December 2022).

Conflicts of Interest:
The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results. Table A1 presents the performance metrics for each unit's remaining life evaluation of the tested models. Table A1. Performance metrics for each unit's remaining life evaluation of the tested models. L 1 , L 2 , and L 3 stand for the RMSE fraction only for samples inside the first, second, and third thirds of the normalized t eol second half, respectively.