1. Introduction
Effective maintenance of flight critical components including gas turbine engines plays a significant role in the aircraft industry. Engine failures due to poor maintenance schedules result in huge economic losses and environmental damages. It is therefore crucial to perform regular and effective maintenance through the support of advanced powerplant health management (PHM) technologies. Condition-based maintenance is the key advancement in the field, where maintenance actions are taken based on actual evidence about the existing health status of the engine under operation. Potential damages on the gas path components due to fouling, erosion, corrosion, and an increase in tip clearance can be detected and isolated before they become severe enough. This requires relevant and significantly sufficient measurement information based on sensors installed on the engine gas path for control and monitoring purposes.
Over the past few decades, gas turbine diagnostics has been extensively studied and several techniques have been developed. Depending on the approaches used, the proposed methods can be broadly categorized into three groups as model-based, data-driven, and hybrid [
1]. Gas path analysis (GPA) and its derivatives [
2,
3], the Kalman filter (KF) and its derivatives [
4,
5], artificial neural networks (ANNs) and their derivatives [
6,
7,
8], genetic algorithms (GAs) [
9], fuzzy logic (FL) [
10], and Bayesian networks (BNs) [
11,
12] are some of the most widely applied techniques.
Diagnostic methods are different in their ability to:
- (1)
Handle the non-linearity of the gas turbine performance;
- (2)
Provide accurate results with the minimum number of sensors possible;
- (3)
Cope with measurement uncertainty;
- (4)
Deal with simultaneous faults;
- (5)
Perform real-time diagnostics with high computational efficiency;
- (6)
Discriminate rapid and gradual degradation;
- (7)
Perform qualitative and quantitative fault diagnostics;
- (8)
Quickly detect faults with negligible false alarms due to measurement noise;
- (9)
Provide easily interpretable results.
In this context, though past studies have contributed significant advancements, methods under each category still have their own advantages and limitations. Data-driven methods are advantageous in the absence of an accurate mathematical model or detailed expert knowledge about the engine [
13]. In addition, they are more preferable with regard to reduced sensor requirements, robustness against measurement uncertainty effects, and simultaneous fault diagnostic capability [
14]. The model-based solutions are best at interpreting the gas turbine behavior since they consider the real physics of the engine. Moreover, when baseline shifts are needed for the fault diagnostics, updating the model can be performed with less cost and effort than retraining the data-driven methods. Model-based methods have some accuracy deficiencies due to measurement uncertainty and model smearing effects. Data-driven methods, on the other hand, lack interpretability of their internal working (they are “black-box” models) and require a large amount of data for training, and the training process can be excessively time consuming [
1,
14]. Hybrid techniques that apply a collective problem-solving approach have shown promising performance when the methods are integrated on the basis of their complementary strengths [
15,
16]. In general, the current literature on gas turbine diagnostics, for instance [
17,
18,
19], shows that advanced diagnostic method development is still a subject of considerable research effort.
The rapid advancement of machine-learning (ML) methods opens extended research access to investigate the contribution to the gas turbine application domain. For instance, a deep autoencoder was utilized by Yan and Yu [
20] for measurement noise removal and gas turbine combustor anomaly detection. A re-optimized deep-autoencoder-based anomaly detection method was also demonstrated in [
21] for fleet gas turbines. In a different study, a long short-term memory-network-based autoencoder (LSTM-AE) framework was developed for gas turbine sensor and actuator fault detection and classification using raw time series data [
22]. A combined GPA- and LSTM-based gas turbine fault diagnostics and prognostics method was devised by Zhou et al. [
23]. The GPA was dedicated to estimating performance health indices of the target gas path components of the case study engine through a performance adaptation process. The LSTM method was employed to forecast the future degradation profile of the components based on the estimated health indices. In recent years, there has been a rapid rise in the use of convolutional neural networks (CNNs) for rotating machinery diagnostics inspired by their powerful feature learning and classification ability [
24]. A considerable number of applications can also be found in gas turbine prognostics, such as [
25,
26,
27]. Nevertheless, there have been only a few attempts in gas turbine diagnostics. Liu et al. [
28] proposed a CNN-based technique to monitor the performance of gas turbine engine hot components based on exhaust gas temperature (EGT) profiles. Guo et al. [
29] used a 2D-CNN algorithm for gas turbine vibration monitoring using transformed vibration signals as the input. Grouped convolutional denoising autoencoders were used to reduce measurement noise and extract useful features from aircraft communications, addressing, and reporting system (ACARS) data [
8]. A 1D CNN was employed for abrupt fault diagnostics based on time series data [
30]. Zhong et al. [
31] and Yang et al. [
32] evaluated the effectiveness of the transfer learning principle with a CNN for engine fault diagnostics with limited fault samples. In both studies, the authors considered single-fault scenarios only. Conversely, the benchmark studies conducted within the Glenn Research Center at NASA [
33] recommended considering multiple faults as well for more reliable diagnostic solutions.
This paper proposes a novel modular CNN-based multiple fault detection and isolation (FDI) method integrated with a nonlinear physics-based trend-monitoring system. The physics-based part is used to monitor the gas turbine performance trend and compute useful measurement deviations induced by gas path faults. The FDI part is composed of four hierarchically arranged CNN modules trained to classify rapid faults at the component level using fault signatures provided by the physics-based part. Two different comparative investigations were performed in the end to show the benefits of the proposed method and draw concluding remarks. The main contributions of this work are summarized as follows:
- i.
A novel physics-assisted CNN framework is proposed for three-shaft turbofan engines’ fault diagnostics. The framework can discriminate between gradual and rapid gas turbine deterioration followed by a successful isolation of gas path faults at the component level. The physics-based scheme can also update itself for baseline changes caused by maintenance events. This avoids the need for retraining the CNN algorithm after every overhaul. However, using the physics-based scheme alone for diagnostics has some accuracy deficiencies due to measurement uncertainty and model smearing effects. Hence, the CNN technique is coupled to offset these limitations and enhance the overall diagnostic accuracy;
- ii.
As demonstrated by the experimental results, the proposed method can deal with multiple fault scenarios, which will increase the significance of the method in real-life situations [
33];
- iii.
The benefits of applying a modular CNN framework for gas turbine FDI were verified through a comparison with a similar LSTM framework and with a single CNN-based FDI scheme. It was shown that the method proposed outperformed the other methods;
- iv.
It was also verified that the method proposed is advantageous in handling a considerable disparity between the training and test datasets, which is difficult for most of the traditional data-driven methods [
34]. This robustness is important to accommodate engine-to-engine degradation profile differences.
2. Gas Turbine Performance Degradation
The performance of a gas turbine degrades during operation due to internal and external abnormal conditions. Fouling, erosion, corrosion, and increased tip clearance are among the most common gas path problems. As illustrated in
Figure 1, gas turbine degradation can be recoverable and non-recoverable [
35]. The former is mostly recoverable through effective online and offline compressor washing with major inspection during engine overhaul. Degradation due to fouling is the most prominent example here. The latter refers to the residual deterioration remaining after a major overhaul, which can be considered as a permanent performance loss. Mechanical wear caused by erosion and corrosion may lead to airfoil distortion and untwisting, thereby resulting in non-recoverable performance loss. Both recoverable and non-recoverable degradation should be monitored continuously for optimal maintenance scheduling.
Degradation can also be categorized as short-term and long-term based on the formation and growth rate [
36]. Short-term degradation results in fast performance changes and is usually caused by fault events. As shown in
Figure 2, they can manifest themselves as “abrupt” or “rapid” degradation modes. Abrupt faults are fault events that appear instantaneously and remain fixed in magnitude with time, for instance sensor bias, actuator fault, foreign object damage/domestic object damage (DOD/FOD), and system failure. Rapid faults refer to fault events that initiate and grow in magnitude with time. Gradual degradation refers to gradual performance losses that develop slowly and simultaneously in all engine components over time due to mechanical wear, mainly triggered by erosion and corrosion [
37]. The performance loss due to gradual degradation increases through time and may reach a non-restorable stage. For a detailed description about different degradation mechanisms, their effects, and necessary actions to restore performance losses, the interested reader is referred to [
1,
38].
A diagnostic system for life-cycle assessment should be able to distinguish short-term and long-term degradation modes followed by detecting and isolating faults successfully [
39]. If the diagnostic algorithm is not adaptive to trend shifts due to engine degradation, its effectiveness will eventually decrease with the engine age. This is more problematic for most of the data-driven techniques, since baseline shifts may cause differences between training and test data patterns, which potentially could affect the diagnostic accuracy. One of the possible solutions to overcome this problem is incorporating AGPA in the diagnostic system. In AGPA, the model adapts performance trend changes with time and updates the baseline when it is convenient. Measurement deviations with respect to the deteriorated engine profile can then be used to assess gas turbine faults based on machine-learning techniques. However, using AGPA alone has some accuracy deficiencies due to measurement uncertainty and model smearing effects [
40].
4. Results and Discussion
We performed a modular-CNN-based engine fault detection and isolation for a triple-spool turbofan engine application. In this framework, four CNN modules were trained individually and connected in such a way that they could first detect a fault and then isolate to the component level. A series of experiments was carried out aiming to select the optimal CNN structure associated with each classifier. The impacts of the important hyperparameters in the CNN training including the number of filters and size, number of layers and order of arrangements, type of optimization algorithm, and number of epochs were analyzed. For instance, different numbers of filters in the range of 1–256 were investigated.
Based on the training results, a similar CNN architecture was selected for all modules involved in the framework. The only difference was the training data size used and the number of outputs associated with the classification problem to which they were assigned. For each classification module in the hierarchy, the selected architecture with the best performance consisted of double consecutive convolutional layers followed by a single pooling layer, including ReLU and dropout layers in between. Considering a second convolutional layer, with an increased number of filters and decreased filter size compared to the first convolutional layer, we demonstrated better classification performance. For all modules, 20 epochs (27,640 iterations with 1382 iterations per epoch) were required to reach the maximum accuracy. Adam was the optimization algorithm utilized primarily due to its simplicity and little memory requirement.
4.1. Fault Detection
The best classification results for CNN1 are presented in
Figure 11. The results on the main diagonal of the detection decision matrix show the number of correctly detected patterns with respect to the health and fault class, whereas the values to the right and left side of the diagonal represent wrong detections. As presented in
Table 9, the detection accuracy of the algorithm was then assessed in terms of the standard fault detection accuracy indicators. Ideally, a fault detection system is required to have negligible false alarm and missed detection rates, but this is practically difficult to achieve due to several factors including measurement noise, feature extraction limitations, and insufficient sampling [
59]. When the detection system becomes more sensitive to faults aiming to provide early warnings and avoid unexpected failures, the frequency of false alarms rises. A high false alarm rate is one of the common problems of the traditional threshold-based gas turbine health management technologies [
59]. This may be among the reasons why high feature extraction techniques have been receiving more attention in resent studies. As shown in this figure, generally, a 95.9% overall detection accuracy was achieved with a 0.4% FAR and a 13.2% MDR. The TPR was low because low-level faults up to 0.025% (i.e., ∆η = 0.01% and ∆Γ = 0.02% according to Equation (12)) were considered. About 20% of the fault data were generated from fault magnitudes ≤1%. This indicates that increasing the lower detectable fault, say for example to 0.25% will enhance the detection accuracy considerably by increasing the TPR and decreasing the FAR. Hence, considering the measurement noise effects, the obtained FAR and MDR values are encouraging.
The effect of the optimization algorithm on the classification performance was also investigated. A comparison was made against two other popular algorithms, namely stochastic gradient descent with momentum (sgdm) and root-mean-squared propagation (RMSProp) [
60], under similar circumstances. As shown in
Table 9, all three optimizers had similar performances with a maximum overall accuracy difference in the order of 1%. Nevertheless, as Adam is known for its low memory and computational requirements and has some additional advantages in deep-learning applications [
60], it was selected for training all CNN models in the proposed framework.
Effect of the Data Distribution
It is known that the accuracy of deep-learning methods often relies on the amount of data available for training, and the degree of dependency differs from case to case [
61]. Increasing the training sample until it becomes enough to represent the distribution of the data required to define the nature of the problem under consideration usually improves the prediction accuracy. However, memory and computational burden should be taken into consideration. Using a high-performance computer (HPC) or GPU can overcome this problem. Another important factor to be noted is that the distribution of the datasets representing the two classes should not be skewed. This is because most ML algorithms often experience poor classification performance due to the high chance of biasedness to the majority class. In contrast, imbalanced data availability is a common problem in fault detection, since in real-life operations, most of the samples available are for the healthy state of the asset [
31]. In this experiment, six different randomly selected datasets of size 72,000 (≈30%H and 70%F), 92,988 (≈46%H and 54%F), 113,976 (≈56%H and 44%F), 134,964 (≈63%H and 37%F), 155,952 (≈68%H and 32%F), and 176,940 (≈72%H and 28%F) were considered for CNN1. In all these datasets, the same 50,400 fault cases were used. There were 30% random dropouts applied right before the pooling layer and fully-connected layer to control the overfitting problem.
Table 10 summarizes the detection results obtained based on the six data groups. The results revealed that the overall detection accuracy increased with increasing the sample size. For example, increasing the size from Set-1 to Set-2 enhanced the ODA by 1.8% with a 3.2% lower FAR and a 1.5 higher MDR. When Set-6 was applied instead, the ODA rose by 5.2%, while the FAR dropped by 6.1% and the MDR rose by 2.6%. The observed differences in the calculated performance indicators were because of the difference in the healthy-to-faulty data proportion among the six data groups.
4.2. Fault Isolation
The next step in the diagnostic process is root cause determination (or fault classification). Upon fault detection through CNN1, the underlying faults were isolated applying CNN2 followed by CNN3 and CNN4. The classification results obtained from CNN2, CNN3, and CNN4 are presented in
Table 11,
Table 12 and
Table 13, respectively. The classification accuracy indicators described in
Section 3.3 were used. For CNN2 and CNN3, an OCA performance better than 99% and for CNN4 an OCA performance better than 98% were achieved. Similarly, KC values higher than 0.98 were obtained for all three models. The ~1% accuracy difference between the CNN3 and CNN4 modules was expected due to: (1) the difference in the number of faults that they were responsible for and (2) the difficulty of simultaneous fault classification. On the other hand, the similar accuracy observed for different fault states within a classifier could be due to the equal data size contribution of each class in the training dataset and their relationship (the slight accuracy differences were expected due to the nature of the training itself). Model complexity and computational requirements were found to reduce from top to bottom. This can be attributed to the difference in the number of classes involved corresponding to each module, as well as the learning data size used. Overall, the obtained results demonstrated the excellent potential of CNNs for engine gas path components’ FDI.
4.3. The Proposed Method vs. a Single Network Classifier
A single CNN model was trained to classify the 22 classes at once, and the results were compared with those of the proposed method. As reported in
Table 14, the model achieved a 96% overall accuracy with a 0.9167 KC. Although both the OCA and KC indicators appeared to be high, the accuracy recorded for some of the faults, for instance F3, F6, F12, and F18, showed considerable errors. Using such a single network to assess the health of the entire gas path system has additional disadvantages: (1) the complexity of the model increases with the increasing number of faults; (2) a fault detection should first take place before any further investigation is performed; (3) high computational burden. This may be the reason why recent studies have been paying more attention to multiple model approaches [
62].
4.4. CNN vs. LSTM
As the CNN, the best architecture for each classification module in the hierarchy was determined through a training experiment. The hyperparameters considered in the investigation include the number of layers, number of hidden units in each LSTM layer, number of dropout layers and their corresponding percentage dropout values to avoid overfitting, the batch size, and number of epochs required to complete the training. The efficient Adam optimization algorithm was used for all models.
Figure 12 shows the influence of the number of epochs, hidden units, and batches on the classification accuracy and training time of the four ULSTM modules. For an arbitrary batch size and number of hidden units, the classification accuracy increased with increasing epochs up to 10. Thenceforth, the accuracy did not change significantly. Similarly, for a given random number of epochs and batch size, the accuracy increased with the increasing number of hidden units up to 40 and showed negligible changes from there on. On the other hand, it was observed that both too small and too high batch size values reduced the accuracy. In all the cases, the highest training time recorded was for Classifier1 due to the data size used for training. However, training time could not be an issue if high-performance computers/hardware, such as GPU, or parallel computing are used.
Based on the training performances, the number of epochs, hidden units, and batches were fixed as 10, 70, and 100, respectively, for all classifiers. The comparison results presented hereafter were based on these hyperparameters.
Table 15 and
Table 16 present the fault detection and fault classification comparison results obtained, respectively. The detection results revealed that both the CNN and LSTM methods considered in this comparison showed similar accuracy. ULSTM-BiLSTM1 should some advantages in terms of the FAR (0.2% lower than CNN1, 0.3% lower than ULSTM1 and BiLSTM-ULSTM1, and 0.4% lower than BiLSTM1), but on the other hand, it showed the lowest TPR (87%) and ODA (96%). CNN1 had the highest ODA (96.5%) and true positive rate (89.2%) and the second lowest FAR (0.6%).
4.5. Evaluation Based on Extrapolation Data
The generalization performance of the CNN and LSTM algorithms was also assessed and compared based on extrapolation data to ensure the stability of the models. The extrapolation data in this case represent performance data of the engine that are not part of the 176,940 datasets.
Figure 13 illustrates the fault patterns used to generate the 176,940 data (FM1 and FM2) and the extrapolation data (FM3 to FM8). FM1 refers to the fault magnitude used for the FAN, IPC, and HPC and FM2 for the HPT, IPT, and LPT based on a 2:1 ratio between flow capacity and efficiency changes. Although the 2:1 ratio is the common practice in gas turbine diagnostics [
39], this does not seem to be the only possible scenario occurring, according to the reports on gas turbine degradation [
1]. Hence, in order to evaluate the influence of the ratio change on the diagnostic accuracy, 3:1, 2.25:1, and 3:2 ratios were considered to generate the extrapolation data.
For the FAN, IPC, and HPC faults, both flow capacity and efficiency values reduced, as seen in the patterns below the zero axis. For the HPT, IPT, and LPT faults, flow capacity increased while efficiency decreased (i.e., the patterns above the zero axis). Module faults are represented in terms of the root sum square of the two performance parameters.
Since the four LSTM models used in the comparative analysis showed similar accuracy, only the results obtained from the ULSTM algorithm are included in
Table 17, due to the space limitation. It is seen that the CNN classifiers had better generalization performance than the LSTM classifiers. When the extrapolation data were used, the single-fault classification accuracy and double-fault classification accuracy of the CNN algorithm reduced by 1.8% and 2.8%, respectively. However, the higher than 96% accuracy achieved for both single- and double-component fault classification was in the acceptable performance range to determine the type of the fault that the gas turbine engine had encountered.