Uncertainty Quantiﬁcation for Full-Flight Data Based Engine Fault Detection with Neural Networks

: Current state-of-the-art engine condition monitoring is based on a minimum of one steady-state data point per ﬂight. Due to the scarcity of available data points, there are difﬁculties distinguishing between random scatter and an underlying fault introducing a detection latency of several ﬂights. Today’s increased availability of data acquisition hardware in modern aircraft provides continuously sampled in-ﬂight measurements, so-called full-ﬂight data. These full-ﬂight data give access to sufﬁcient data points to detect faults within a single ﬂight, signiﬁcantly improving the availability and safety of aircraft. Artiﬁcial neural networks are considered well suited for the timely analysis of an extensive amount of incoming data. This article proposes uncertainty quantiﬁcation for artiﬁcial neural networks, leading to more reliable and robust fault detection. An existing approach for approximating the aleatoric uncertainty was extended by an Out-of-Distribution Detection in order to take the epistemic uncertainty into account. The method was statistically evaluated, and a grid search was performed to evaluate optimal parameter combinations maximizing the true positive detection rates. All test cases were derived based on in-ﬂight measurements of a commercially operated regional jet. Especially when requiring low false positive detection rates, the true positive detections could be improved 2.8 times while improving response times by approximately 6.9 compared to methods only accounting for the aleatoric uncertainty.


Introduction
Engine condition monitoring is considered a key technology for lowering maintenance, repair and overhaul expenses while improving the safety and availability of aircraft [1]. Estimating the current health state of the aircraft engine gained from engine condition monitoring systems by analyzing in-flight measurements provides the foundation for effective maintenance planning. Besides tracking and trending long-term deterioration, engine condition monitoring applications detect, isolate and identify single faults [2].
Current state-of-the-art engine condition monitoring systems i.e., Refs. [3][4][5][6] are based on analyzing a minimum of one steady-state snapshot per flight. The sparsity of available data negatively impacts fault detection as there are difficulties distinguishing between random scatter and an underlying fault. Depending on the fault type and severity, it can take several flights until fault detection [5,7,8]. The resulting latency in fault detection increases the risk of secondary damage. Recently, with the increased adoption of non-mandatory data acquisition equipment, continuously sampled datasets are available covering whole flights. These continuously sampled datasets are also referred to as full-flight data. Full-flight data provide sufficient data points to detect engine faults based on a statistically relevant sample size within a single flight, enabling faster response times. Despite the advantages of full-flight data, analyzing the corresponding datasets heavily increases the amount of According to [13] information redundancy is required for fault detection and diagnosis. In current engine condition monitoring applications, this redundancy is typically established by utilizing thermodynamic engine models for computing reference values representing the nominal performance of the aircraft engine. Fault detection performs a comparison between these reference values and in-flight measurements. Significant deviations between the measurements and model predictions indicate an underlying fault. In general, fast execution times are required to analyze the large number of data points provided by full-flight data. Thermodynamic engine models are generally slow since the solution is determined iteratively. On the other hand, state-of-the-art machine learning approaches are well suited for analyzing full-flight data providing fast execution times omitting the slow iterative computation of thermodynamic engine models. Depending on the configuration of the data acquisition, full-flight data often include discrete features resembling the position of valves, e.g., for anti-icing and customer bleed extraction. Building a physically sound thermodynamic engine model without profound system information is difficult as a meaningful relationship between discrete parameter setting and mass flow extraction has to be derived. On the other hand, data-driven models can infer the effect of such discrete parameters. The sometimes limited system information, in combination with the requirement for timely data analysis, makes data-driven model building a good alternative for processing full-flight data.
Different data-driven methods such as artificial neural networks [14][15][16][17][18][19], Generalized Additive Models [17,19,20] or Support Vector Regression [21] have already been successfully applied to model the performance of gas turbines. However, one major drawback of datadriven approaches is their black-box characteristic making it difficult to substantiate the results. Especially the widespread utilization of artificial neural networks also covering safety-critical applications, e.g., self-driving cars [22], or medical diagnosis [23] lead to increased research in uncertainty quantification, improving the reliability and robustness of their results.
In general, two types of uncertainty are differentiated in model building: aleatoric uncertainty and epistemic uncertainty [24]. Aleatoric uncertainty defines the inherently probabilistic variability of a dataset caused by measurement uncertainty. On the other hand, epistemic uncertainty defines the uncertainty caused by the insufficient coverage of the relevant value range by the available data. For example, when using artificial neural networks for approximating the input-output characteristic of a technical system, they basically define a high-dimensional curve fit. However, the output of the artificial neural network is essentially only trustworthy in operating regimes for which sufficient data have been available for training. Otherwise, the extrapolation error becomes dominant [25,26]. While the epistemic uncertainty can be minimized by taking additional data points of different operating regimes into account, the aleatoric uncertainty is more or less fixed. Dedicated algorithms handle the approximation of the aleatoric and epistemic uncertainty. The epistemic uncertainty can be approximated, for example utilizing Ensemble Models [27], Out-of-Distribution Detection [28], Dropout [29] or Bayesian Neural Networks [30]. The aleatoric uncertainty can be evaluated by approximating the probability density functions of individual measurements with artificial neural networks [31]. Despite an existing concept for approximating the aleatoric uncertainty for full-flight engine data [32], there is no method taking both the aleatoric and epistemic uncertainty into account.
In the following, artificial neural networks are chosen for approximating the performance of aircraft engines. Correctly assessing the temporal correlations in full-flight data is a prerequisite for approximating the engine performance [33] and is more difficult to achieve with other data-driven modeling methods. Amongst artificial neural networks, there are specific architectures to process time series, such as Long-Sort Term Memory (LSTM) [34], Gated Recurrent Units (GRU) [35], or Dilated Convolutional Neural Networks [36]. Apart from the proven capability of the above listed artificial neural networks to model the steady-state and transient performance of gas turbines, there is additionally existing research in uncertainty quantification for neural networks. One existing method for approximating the aleatoric uncertainty in [32] is extended by an Out-of-Distribution Detection for additionally taking the epistemic uncertainty into account. The proposed approach is then tested utilizing full-flight data of a commercially operated regional jet. A comprehensive investigation of the detection rates underlying different fault cases is provided. With the results obtained, it can be shown that the additional uncertainty quantification leads to higher detection rates with faster response times.

Artificial Neural Networks with Uncertainty Quantification
Since the approximation of the aleatoric uncertainty according to [32] has already been successfully applied to in-flight measurements, it is used as starting point for further improvement. The approximation of the aleatoric uncertainty introduces additional model complexity to the neural network by requiring an increased number of output nodes. Therefore, a complementary method for estimating epistemic uncertainty was chosen, leaving the artificial neural network unchanged. Of the methods listed in the previous section, only the Out-of-Distribution Detection meets these requirements.

Approximating the Aleatoric Uncertainty
For modeling the aleatoric uncertainty, the training data are assumed to be sampled from a given probability density function p(y|Θ) with parameters Θ. The parameters Θ of the probability density function are then estimated by the neural network based on input parameters x.
For example, utilizing a Gaussian probability density function in Equation (1) for approximating the probability distribution of the measurements y requires the mean µ and the standard deviation σ to be approximated by the artificial neural network. The parameters are estimated by defining the corresponding output nodes of an artificial neural network. An example of the resulting architecture of the artificial neural network underlying a Gaussian probability function is visualized in Figure 2.
(1) Figure 2. Architecture for approximating a univariate Gaussian probability density function with artificial neural networks.
An optimization defines the weights and biases of the neural network nodes, maximizing the likelihood of observing the data underlying the chosen probability density function p(y|Θ). Concerning the objective function of the optimization, maximizing the likelihood is equal to minimizing the negative log-likelihood N LL. For a flight of length l, the corresponding negative log-likelihood is defined by Equation (2) [31].
Especially for long-haul flights with extended cruise segments, there are more data points for cruise than other flight segments. This imbalance in data can bias the neural network towards approximating the cruise with high accuracy while neglecting the remaining flight segments. In order to ensure that all flight phases are represented with similar accuracy, the negative log-likelihood N LL is first computed for each flight phase separately. The optimization is then based on the average negative log-likelihood N LL.
In general, engine condition monitoring requires multivariate datasets to be estimated for which the approach presented above can easily be extended. However, the approximation of multivariate datasets increases the total number of parameters Θ to be estimated as additional cross-correlations between variables must be considered. In the present work, the in-flight measurements are approximated assuming multivariate Gaussian probability density functions, which results in Even though the correlation matrix Σ is symmetric, i.e., Σ i,j = Σ j,i , the additional cross-correlations Σ i,j increase the complexity of the artificial neural network as additional output nodes have to be provided for their estimation. To reduce the total number of parameters to be estimated, the in-flight measurements are considered to be sampled independently, leading to uncorrelated measurement noise and, therefore, negligible crosscorrelations Σ i,j . This simplification collapses the correlation-matrix Σ into a diagonal matrix Σ = diag(Σ 1,1 , · · · , Σ n,n ).
Correctly assessing the transient performance of aircraft engines requires the previous data points to be considered [33] resulting in an auto-correlation. In order to account for this temporal correlation, a temporal feature extraction utilizing dilated convolutional neural networks [36] is used as a preprocessing step. The resulting architecture of the neural network for approximating the in-flight measurements of aircraft engines is visualized in Figure 3. Input to the artificial neural network is a multivariate time series consisting of continuous and discrete parameters defining the environmental conditions, power settings, and controller settings. In the next step, global feature extraction is conducted by nonlinearly extracting and compressing the temporal information of the provided time series. Finally, the extracted features are processed by individual feed-forward neural networks approximating the measurements' mean µ and standard deviation σ. The feature extraction and the neural network for estimating the probability density function are trained simultaneously. A similar approach for estimating the aleatoric uncertainty applied to full-flight data is discussed in [32].

Multivariate Time Series
Feature Extraction (Dilated Convolutional Neural Network)

Approximating the Epistemic Uncertainty
The epistemic uncertainty of neural networks is closely related to the extrapolation error caused by the insufficient coverage of the relevant value range by the available training data. Its effect can be alleviated by providing well-defined input features reducing the total number of parameter combinations that have to be covered by the model. Using non-dimensional parameters according to [37], is recommended for gas path measurements since they collapse the engine performance to well-defined characteristics reducing the impact of environmental conditions [38]. These characteristics are mainly affected by controller settings such as bleed positions and airflow towards the cabin. Hence, whether or not a neural network can approximate the engine performance depends on the availability of sufficient data points with dedicated controller settings. An example of poor model accuracy related to insufficiently available controller settings during training is displayed in Figure 4. Since the data used for training the neural network were gathered during summer, data points with active anti-icing are scarce and the model's ability to correctly predict those operating regimes is limited. If the anti-icing is turned off, the approximations of the neural network match the in-flight measurements. However, over dedicated portions of the flight 315 s ≤ t ≤ 2700 s and 4150 s ≤ t ≤ 7100 s, the engine anti-icing is active, and the measurements are close to the upper prediction boundary of µ + 2σ. If the tail anti-icing is turned on as well, deviations between the measurements and the neural network predictions increase further, surpassing the range of µ ± 2σ. The results lead to the conclusion that in order to prevent false positives originating from epistemic uncertainty, regions with high modeling uncertainty have to be identified and excluded. For the dataset examined in this article, there are a total of five controller settings available: engine anti-icing (EAI), tail anti-icing (TAI), wing anti-icing (WAI), bleed configuration and airflow towards the cabin (Pack). In order to quantify the availability of a sufficient number of data points within the training dataset, ensuring accurate model building, a confidence score L setting is defined in Equation (6). The confidence score L setting is based on the likelihood of occurrence of the controller settings p i (x(t)), which are derived based on the dataset used for training the artificial neural network. In the proposed approach, the confidence score is computed separately for different flight phases PH to account for the impact of the operating conditions on the controller settings leading to conditional probabilities p i (x(t)|PH(t)). Additionally, the probabilities are assumed to be statistically independent, neglecting the impact of different setting permutations.
The resulting confidence score L setting related to the previously shown flight is visualized in Figure 5. Since the confidence score is defined as the product of multiple probabilities leading to small values, the logarithmic confidence score L setting is displayed here. The higher the logarithmic confidence score L setting , the more data points were available for training and the higher the model accuracy. Therefore, the confidence score L setting can now be used to effectively exclude regions with high modeling uncertainty by defining an appropriate limit L lim . For the example flight, the timestamps with active tail-anti-icing around 1270 s ≤ t ≤ 2640 s are characterized by a low confidence score L setting resulting in high model uncertainty.

Fault Detection
Similar to [32], fault detection is based on the Mahalanobis Distance [39], defining the normalized distance of a test data point y j from a probability density function The vector of means µ and the correlation matrix Σ are the output of the neural network. The inverse of the correlation matrix Σ −1 j in the definition of the Mahalanobis Distance d M essentially weights the distances by the aleatoric uncertainty, ensuring that data points with high uncertainty are weighted less. This weighting directly reduces the risk of false positives in regions of high aleatoric uncertainty. Another advantage of the Mahalanobis Distance d M is the definition of a single distance measure even for multivariate datasets. The availability of a single distance measure simplifies fault detection since only a single parameter has to be monitored.
The resulting Mahalanobis Distance d M for a nominal example flight is visualized in Figure 6 alongside the flight profile. Especially for large transients, the artificial neural network has difficulties predicting the engine performance resulting in singular peaks in the Mahalanobis Distance d M lasting only a few seconds. The fault detection scheme must account for these singular peaks to prevent false positives. In general, faults are considered to be persistent over a certain period of time, affecting the overall magnitude of the Mahalanobis Distance d M . In order to avoid false positives triggered by singular events, the peaks in the Mahalanobis Distance d M are removed by applying a Butterworth low-pass filter [40]. This low-pass filter ensures that only the magnitude of the Mahalanobis Distance d M is considered for fault detection. The resulting Mahalanobis Distance d M after applying the low-pass filter is additionally visualized in Figure 6. Fault Detection: The total number of outliers is computed in the last step. Since there will always be a certain number of statistical outliers, a threshold n lim on the total number of outliers is introduced. If the number of outliers detected exceeds this predefined threshold n lim , the outliers are no longer considered statistical but systematic, indicating a fault.

Description of the Database
The proposed fault detection method is tested and trained with in-flight measurements of a commercially operated regional jet [41]. The dataset contains in-flight measurements of 35 aircraft covering a time period of three years. The data were anonymized, so there is no information about the aircraft or engine type. In general, the detection rates in engine condition monitoring depend highly on the model accuracy [42]. Since in flightmeasurements vary due to production scatter and different degrees of degradation [43], only data of an individual engine serial number were extracted. Altogether 300 consecutive flights were extracted from the dataset. Nominal engine performance was ensured by comparing parallel mounted engines according to [44]. Since the in-flight measurements are acquired with different sampling rates, all measurements were first interpolated to a sampling rate of 1 Hz, the minimum sampling rate provided by most airlines [45]. Furthermore, only complete flights were extracted from the provided database.
The dataset covers more than 180 different parameters, mostly related to aircraft dynamics. Concerning gas-path measurements, only the measurements displayed in the cockpit N1, N2, EGT, W f are provided. In order to limit the total number of input parameters to be processed by the artificial neural network, the dataset was manually filtered, extracting parameters that are considered to affect the performance of the aircraft engine. The resulting input and output parameters of the neural network are summarized in Table 1. In order to improve the training of the neural network [46], the discrete controller settings were normalized to x ∈ [0, 1] and the continuous measurements were standardized to zero mean and a variance of one.

Output Parameter
Parameter Description The provided dataset of full-flight data does not provide any information concerning potential faults. For a comprehensive investigation of the detection rates of the proposed fault detection scheme underlying various fault cases, synthetic datasets were generated by the superimposition of the in-flight measurements with measurement deviations generated utilizing a calibrated aircraft engine model of a regional jet. The fault cases were imposed by adjusting the capacities Q and efficiencies η of the engine components according to The scaling factors ∆Q and ∆η were chosen according to the OBIDICOTE test cases [47], which provide benchmark test cases for engine condition monitoring applications. The fault cases considered in this study and the corresponding scaling factors ∆Q and ∆η are summarized in Table 2. Table 2. Definition of the OBIDICOTE test cases according to [47].

Assessment of the Articifical Neural Network
Of the 300 flights extracted from the dataset of full-flight data, the first 100 consecutive flights were used for training and validating the artificial neural network. The flights within the training and validation dataset were randomly sampled, applying a ratio of 85%/15%, where the larger dataset was used for training the neural network. The remaining 200 flights are used to test the neural network and evaluate the detection rates. The training of the neural network was conducted for 1500 epochs utilizing Adam optimization [48] with a learning rate of lr = 0.001. Altogether 100 models were trained to account for randomness caused by the initialization of the neural network or the sampling of flights composing the training dataset. The neural network's architecture is constant for all models and was defined in advance by evaluating the loss functions for different architectures.
The output of the proposed neural network architecture for the corrected exhaust gas temperature EGT c of an example flight in Figure 7 exemplifies the main advantage of utilizing uncertainty quantification. For neural networks without uncertainty quantification, e.g., trained on minimizing the mean squared error, the output will resemble the predictions for the mean exhaust gas temperature µ EGT . While the approximated mean exhaust gas temperature µ EGT can approximate the measured exhaust gas temperature EGT with high accuracy during climb and cruise, significant deviations are experienced during descent. These large deviations are mainly attributed to hysteresis in controller settings which are more dominant during descent and landing. Considering the engine's power setting, the shaft speed of the fan N1 is relatively stable during climb and cruise, while fast changes in N1 are dominant during descent and landing. Difficulties approximating the descent and landing are experienced for all flights within the training, validation, and test datasets, as can be seen considering the mean squared error mse in Table 3 and the mean standard deviation σ in Table 4. Since engine faults are identified by comparing the neural network's output with the in-flight measurements, such large deviations can lead to false positives if no uncertainty quantification is considered. On the other hand, the proposed neural networks with uncertainty quantification counteract the significant deviation by increasing the uncertainty, ultimately reducing the risk of false positives.

Detection Rates
The detection rates are evaluated by computing the true positive detection rates (TP) for the different fault cases and the false positive detection rates (FP) for nominal engine performance defined in Equations (10) and (11) [49].
The proposed fault detection algorithm features three thresholds directly affecting its sensitivity for fault detection: the limit on the confidence score L lim ensuring model quality, the limit on the Mahalanobis Distance d M,lim used for detecting outliers, and the total number of outliers tolerated until fault detection n lim . To determine the optimal combination of thresholds, a grid search was performed, discretizing the limits and searching for parameter combinations that achieve maximum average true positive detection rates TP for predefined thresholds on the maximum allowable false positive detection rates FP ≤ FP lim . Here, the average true positive detection rate TP was computed, taking into account the true positive detection rates TP j of the individual OBIDICOTE test cases.
The resulting average true positive detection rates TP and the corresponding limits on the outliers tolerated until fault detection n lim for the algorithms with and without additional estimation of the epistemic uncertainty are visualized in Figure 8. Since all 100 trained models were evaluated, the results are statistically evaluated and visualized as box-plots. The results clearly show the advantage of performing an additional estimation of the epistemic uncertainty. The results with the additional estimation of the epistemic uncertainty require smaller limits on the outliers tolerated until fault detection n lim while achieving higher average true positive detection rates TP. The difference between the two methods becomes more pronounced when requiring small false positive detection rates FP. Considering FP ≤ 0.5%, the presented method improves the average true positive detection rate TP by a factor of 2.8 compared to the method only accounting for the aleatoric uncertainty. Furthermore, the number of outliers tolerated n lim and consequently the response time can be improved by approximately 6.9. Especially when requiring low false positive detection rates FP, the resulting true positive detection rates TP are too low for operational application when only the aleatoric uncertainty is approximated. For the presented use case with a sampling rate of 1 Hz, a period of time with on average 7.9 min of faulty engine performance during a 75 min flight is required until fault detection.  Considering the median true positive detection rate TP for the different fault cases with aleatoric and epistemic uncertainty quantification summarized in Table 5 reveals that the poor average detection rates TP mainly result from difficulties identifying fault case f . Since only minimum instrumentation is provided for the available data set, observability issues exist for certain fault cases. Incorporating more sensors within the fault detection algorithm can improve the detection rates. Due to its modular architecture, the proposed fault detection approach can be easily extended for different measurement suits.

Sensitivity Study: Fault Initiation
The results presented in the previous section were derived by initiating the fault right at the start of the time series t = 0. Additional examinations were conducted to quantify the sensitivity of the detection rates concerning the point in time when the faults are initiated. For this sensitivity study, the thresholds on the confidence score L lim , Mahalanobis Distance d M,lim , and the total number of outliers tolerated until fault detection n lim retained in the previous section are kept constant. The faults are initialized relative to the total flight length.
The median average true positive detection rates TP for different relative fault initiation times t init are displayed in Figure 9. The results show compromising detection rates if the fault happens later during the flight. For example, if a fault is initiated within the last 25% of the flight, the maximum achievable average detection rates are less than 35%. The decreased performance of the fault detection approach in later flight phases is mainly attributed to the increased uncertainty experienced during descent and landing, already described in Section 3.1. Correspondingly, only faults strongly affecting the measurements can be detected. In the worst case, the fault can not be detected within the current flight. However, the chances are high that the fault can be detected within the next flight, which is still an improvement compared to current state-of-the-art methods [5,7,8].

Discussion
This paper presents a novel approach for estimating the aleatoric and epistemic uncertainty in data-driven engine fault detection. The algorithm can detect arbitrary faults requiring only datasets representing nominal engine performance. All tests conducted were based on in-flight data of a commercially operated regional jet, ensuring real changes in environmental conditions and controller settings. Compared to alternative approaches only accounting for the aleatoric uncertainty, the presented approach results in improved detection rates and faster response times. Especially if low false positive detection rates are required, methods based on only the aleatoric uncertainty lead to too low true positive detection rates unsuitable for operational application. Various fault cases could be detected within a single flight removing the latency of current state-of-the-art fault detection based on steady-state snapshots. For the tests, only minimal instrumentation was provided. Fault detection can potentially be further enhanced by providing additional sensors to improve the observability of the engine.
In the presented use case, the engine model was trained based on datasets of an individual engine to avoid the impact of production scatter and account for engine degradation. To ensure fast coverage of an engine within condition monitoring, the dataset used for training the model covers only a short period of time, limiting the diversity of training data and increasing the epistemic uncertainty. Institutional Review Board Statement: Not applicable.

Informed Consent Statement: Not applicable.
Data Availability Statement: Publicly available datasets were analyzed in this study. This data can be found here: https://c3.ndc.nasa.gov/dashlink/projects/85/ (accessed on 25 May 2020).

Conflicts of Interest:
The authors declare no conflict of interest.

Nomenclature
The following nomenclature are used in this manuscript: