3. Training and Testing
The general procedure for identifying the data used to train and test the classifiers is sketched in
Figure 1, where it has to be observed that the word “training” does not rigorously apply to the K-NN classifier because there is no training phase.
First, each piece of field data is labelled with the supposed label.
After that, rough acceptability thresholds can be used to filter the labeled field data. This step relies on the fact that, in the GT field, UMIs are currently detected by using physics-based acceptability thresholds: if a given piece of data exceeds such thresholds, it is labeled as a UMI. In practice, such assumption may lead to misleading results, since anomalous observations may be caused by feature noise, e.g., sensor faults. Thus, in the current paper, the out-of-range values, i.e., the ones that exceed the acceptability thresholds, are tested because they are candidate UMIs.
Conversely, data that lie within the minimum and maximum acceptability thresholds are used to both train and test the supervised classifiers. Such data are named “filtered data” in
Figure 1.
Alternatively, if the acceptability thresholds are not used to preliminary filter the candidate UMIs, all labeled field data, namely non-filtered data in
Figure 1, are used to both train and test the classifiers.
A fraction of filtered or non-filtered data is identified to train each classifier, while the remaining data are used for testing. Then, the features are selected and the “Raw data for training” (see
Figure 1) are sampled in different UOMs, by defining the number of classes
c. Thus, the dataset of training data comprises both field and scaled data, by making all classes inherently balanced. Once the classifiers have been trained, raw data for testing are used to test the capability of the learner. For each piece of testing data, its label is provided, as well as the accuracy and the posterior probability of prediction.
Each classifier was trained and tested by means of the tools available in the Matlab® environment.
4. Field Data
In the current paper, each ML classifier is tested by using one experimental dataset collected by a pressure sensor (see
Figure 2).
The field dataset comprises a fleet of thirty Siemens GTs of the same type, installed in eleven sites worldwide. Thus, coherent comparisons can be performed among the considered GTs. Each site comprises a different number of GTs, which varies from one to ten; the amount of data acquired from each site is in the range from 0.7% (i.e., Site 6) to 24.2% (i.e., Site 8), with respect to the total amount of available data (
Table 1).
In order to distinguish UMIs from sensor faults [
12], field data were elaborated to extract two features, i.e., mean value and standard deviation, which are exploited to train and test the classifiers. For feature extraction, only steady state operations were considered. Each piece of data of the dataset was obtained by considering the mean value and standard deviation over sixty consecutive field time points. As a result, 2721 measurements in total were obtained.
The characteristics of the dataset are shown in
Figure 2; for each site, box plots report the median, the lower and upper quartiles, as well as the minimum and maximum values of data. All measurements were normalized with respect to the maximum acceptability threshold for confidentiality reasons. The values of skewness and kurtosis of the two considered features are reported in
Table 2. Based on the analysis of such indices, Site #11 is characterized by the highest values of the skewness and kurtosis, thus revealing that its data distribution is heavy-tailed.
Based on engineering practice and an in-depth knowledge of the GT type, the nondimensional pressure is expected to lie between 0.65 (i.e., minimum acceptability threshold) and 1.00 (i.e., maximum acceptability threshold).
All data were originally labeled as kPa absolute. However, all data acquired from Site #11 are approximately two orders of magnitude lower than the minimum acceptability threshold; such data represent 13.4% of the entire dataset. Thus, data acquired from Site #11 may be incorrectly classified as out of range values [
12], whose anomalous behavior may be caused by sensor faults [
39]. However, sensor faults generally affect a limited amount of data. Thus, based on the general procedure outlined in
Figure 1, all data acquired from Site #11 are candidate UMIs.
Subsequently, an inspection made by Siemens confirmed that all data collected from Site #11 were affected by UMIs and, based on the scale factor between such piece of data and the ones acquired from the other sites, the bar absolute label was subsequently assigned.
It has to be mentioned that the use of an experimental dataset strengthens the outcomes of this paper. In fact, experimental noisy labels are included within the dataset, while the common practice is to artificially introduce misleading labels [
5].
5. Analysis of Classification Performance
Performance of ML classifiers strictly depends on the characteristics of training data, such as data quality and quantity [
40], which can negatively affect the capability of supervised learners. To provide a general classifier able to efficiently detect UMIs, several analyses are documented and discussed for testing each classifier in challenging scenarios. Such analyses evaluate the influence of (i) data quality, (ii) data quantity and (iii) number of classes on the performance of each classifier.
Data quality. In the current paper, the influence of data quality is evaluated by means of different analyses. First, each classifier is trained by means of correctly labeled data only, i.e., filtered data (
Figure 1). Then, each classifier is trained by means of non-filtered data (
Figure 1), in which all data acquired from Site #11 (approximately 13% of the dataset) are experimentally affected by label noise issue. The effect of UMIs on classification capability is further investigated by means of a sensitivity analysis in which the rate of UMI is progressively increased, as outlined in
Table 3. At each step, the rate of UMIs is increased by roughly 10%, so that the label noise experimentally affects all data acquired from Site #11, as well as all data collected from additional sites, in which noisy labels were implanted. As in Site #11, implanted UMIs were originally labeled as kPa absolute, though bar absolute was the correct label. Such analyses are aimed at assessing the robustness of each classifier by varying data quality.
Data quantity. The amount of data used for training a classifier usually affects its classification capability. In this paper, the influence of data quantity is assessed by means of two analyses.
In the first analysis, each classifier is trained by using 10% of filtered or non-filtered data of each site. This choice is in line with the outcomes documented in Manservigi et al. [
12], which reports a detailed sensitivity analysis on the amount of data used for training the SVM classifier. In Manservigi et al. [
12], the amount of data used for training the classifier was equal to 10%, 25%, 50%, 75%, and 90% of both filtered and non-filtered data. Based on results, in most cases the SVM classifier was slightly affected by the amount of training data. Thus, in the current paper, each classifier is trained by means of the lowest rate of data, i.e., 10%.
The second analysis, namely site cross-validation, mimics the practical condition in which all data acquired from a novel site have to be labeled. In line with the case study reported in
Section 4, if filtered data train the classifier, all data acquired from two sites, i.e., Site #11 and one additional site in turn, are tested. Instead, if non-filtered data are used, all data acquired from only one site in turn are tested. Since each site accounts for a different number of data, each training combination is performed by means of a different amount of data, which varies from 62.4% to 99.3% of the data included in the dataset. As a result, the site cross-validation further strengthens the classifier reliability vs. amount of training data.
Number of classes. The classification capability of the classifiers is tested by means of twelve classes. The selected UOMs were identified by means of engineering practice. In fact, since all field data were originally labeled as kPa absolute, five additional UOMs, i.e., mmH
2O, mbar, inH
2O, psi, and bar, are accounted for in absolute terms. As a result, six absolute UOMs in total are considered, whose scale factors with respect to the kPa absolute are highlighted in
Figure 3 (full circles). Such scale factors are independent from the dataset under analysis.
Moreover, the six absolute UOMs are converted in gauge terms, by obtaining twelve UOMs in total. It has to be mentioned that the relationship between absolute and gauge UOMs is strictly correlated to the dataset under analysis. The position of the gauge UOMs (empty circles) with respect to the absolute UOMs is reported in
Figure 3 for the considered case study.
It can be observed that, in the current paper, the identification of the true UOM is significantly challenged. In fact, the conversion factor between kPa absolute and inH2O gauge is equal to just 1.2. In addition, regardless of the number of classes, when non-filtered data are used, each classifier is also trained by means of incorrectly labeled data.
6. Indices of Classification Performance
The performance of each classifier is evaluated by means of two indices, i.e., classification accuracy and posterior probability.
Classification accuracy. In the field of ML, confusion matrix is usually exploited to evaluate the effectiveness of supervised classifiers. Based on the classifier prediction, four metrics can be calculated, i.e., the rate of true positives (TPs), false positives (FPs), true negatives (TNs) and false negatives (FNs), which can be used to calculate classification accuracy, precision, recall, and specificity.
In the current paper, accuracy (Equation (10)) meaningfully quantifies the performance of the classifiers.
In fact, based on the training and testing procedure outlined in
Section 3, for each site, the true label (i.e., the positive class) of all data used for testing is unique. Thus, true negative and false positive rates are inherently null. As a result, precision is equal to 100%, while specificity cannot be calculated and accuracy corresponds to recall.
Since classification accuracy represents the rate of correctly labeled data, the higher the classification accuracy the better is the classification performance.
Posterior probability. The posterior probability represents the confidence that a given data belongs to a given class. By considering all classes, for each data the sum of the c posterior probabilities is equal to 100%.
The procedure employed to calculate the posterior probability depends the considered ML classifier.
In the SVM RO classifier, the score-matrix
P is calculated for each pair of classes; each element
pij (
pij ∈ [0,1]) is the confidence that a given data belongs to the class i. Consequently,
pji = 1 −
pij. The comprehensive procedure to calculate the score-matrix
P is described in Platt [
41].
In the NB classifier, the posterior probability is calculated as in Equation (8).
Finally, in the K-NN classifier, the posterior probability of a given class is the ratio between the number of the K neighbors that belong to that class and the K value.
The optimal supervised classifier is the one that correctly classifies the “Raw data for testing” (see
Figure 1) and that maximizes both classification accuracy and posterior probability. In addition, for the sake of industrial attractiveness, classification results have to be obtained with the lowest computational effort.
Receiver Operating Characteristic curve. The capability of each ML classifier is also assessed by means of the receiver operating characteristic (ROC) curve, which displays the false positive rate (FPR) vs. the true positive rate (TPR). Thus, the ROC curve shows the trade-off between the probability of detecting false positives or true positives. For an accurate ML classifier, the ROC curve should climb steeply [
42].
Area Under the Curve. The area under the curve (AUC) is the area under the ROC curve, which is in the range from 0 to 1. The higher the AUC value, the better the performance of the ML classifier [
26].
7. Results and Discussion
This section compares classification accuracy, posterior probability, and computational time of the supervised classifiers.
Support Vector Machine, NB, and K-NN classifiers can classify data by means of several approaches. For the sake of brevity, for each classifier, only the most promising approach is hereafter reported. In the current paper, the identification of the optimal ML classifier is focused on three alternatives, i.e., SVM RO, NB, and K-NN by using K = Ntr0.5.
In fact, Manservigi et al. [
12] demonstrated that SVM RO is the most suitable approach for UMI detection. As in Subasi et al. [
42], the
σ value (defined in Equation (6)) is equal to 1. In addition, a specific analysis on the influence of the
σ parameter revealed that its value slightly affected SVM performance [
12].
Regarding the NB classifier, as made in Petschke and Staab [
32] a Gaussian Naïve Bayes classifier is assumed.
Finally, Manservigi [
43] has recently examined the effect of the
K value on the capability of the K-NN classifier. By assuming that
Ntr is the number of training data (
Figure 1), three analyses were carried out, in which: (i)
K = 1, (ii)
K = (
Ntr/
c)
0.5, and (iii)
K =
Ntr0.5. In these analyses, the ratio
Ntr/
c corresponds to the number of “Raw data for training” (
Figure 1). Among the three
K value settings, the best results were obtained when
K =
Ntr0.5, especially when non-filtered data were accounted for. This outcome confirms the analysis reported in Cheng et al. [
22], which suggested setting
K =
Ntr0.5.
For the K-NN classifier, neighbors are identified by means of the Euclidean distance.
The results are split into four sections.
Section 7.1 describes in detail the case in which data are filtered by means of acceptability thresholds, which were previously set thanks to domain knowledge about the considered GT type. For each classifier,
Section 7.2 documents the classification performance when raw data are not filtered out. This analysis mimics the situation in which acceptability thresholds cannot be employed (e.g., their boundaries are not known). Thus, the effect of the experimental noise label can be evaluated.
For all sites characterized by the same true UOM, the mean classification accuracy and posterior probability is calculated; posterior probabilities of absolute and gauge UOMs are highlighted by means of full and dotted bars, respectively.
Then, a sensitivity analysis on the rate of UMIs is reported in
Section 7.3. For the sake of clarity, the sensitivity analysis is separately discussed since implanted UMIs are also accounted for.
7.1. Training with Filtered Data
When 10% of filtered data train the classifiers, the label selection of the kPa absolute sites is mainly focused on two out of twelve classes, i.e., kPa absolute and inH2O gauge.
SVM RO and NB achieve approximately the same classification accuracy. In fact, the rate of correctly labeled data slightly varies from 95% (i.e., SVM RO) to 97% (i.e., NB). The K-NN classifier achieves the poorest classification accuracy, since 77% of data were correctly labeled.
The inH2O gauge label slightly challenges the classifiers. In fact, for SVM RO and NB a limited fraction of data (between 3% and 9%) is classified as inH2O gauge; this rate increases up to 23% in the K-NN classifier.
Posterior probability values (
Figure 4) confirm these positive results. In fact, the correct UOM is provided by the NB classifier by means of the highest posterior probability, which is on average equal to 93%, followed by the SVM RO (i.e., 87%) and K-NN (i.e., 80%).
Instead, all data of Site #11 are always univocally classified; in addition, posterior probability is always equal to 100%.
Both classification accuracy and posterior probability of kPa absolute sites are usually lower than the ones obtained for Site #11. This is due to the scale factor between the true label of the site under analysis and its nearest UOM. In fact, the conversion factor between kPa absolute and inH2O gauge is only equal to 1.2, while the scale factor between bar absolute and bar gauge is roughly equal to 5. As a result, the assignment of the correct UOM is more challenging for Site #1 through #10 than for Site #11.
Site cross-validation generally confirms the results achieved by using 10% of filtered data for training the classifiers. In fact, classification accuracy and posterior probability (
Figure 5) are always higher than 89% and 75%, respectively.
7.2. Training with Non-Filtered Data
The comparison between the exploitation of filtered and non-filtered data for training the classifiers reveals that all ML approaches are only slightly affected by data reliability. In fact, the reduction in terms of classification accuracy is lower than 3%, thus proving the robustness of all classifiers.
As a general comment, the highest and homogeneous classification accuracy is provided by SVM RO, i.e., 97% and 99% for the kPa absolute and bar absolute sites, respectively. Instead, NB correctly classifies 94% of data acquired from Site #1 through #10 and 99% of data acquired from Site #11. The K-NN classifier confirms the poorest classification capability, since 76% of data acquired from Site #1 through #10 are correctly labeled.
Though the classification accuracy is generally confirmed, posterior probability (
Figure 6) proves to be more affected by data reliability. Therefore, the label is correctly provided, but classifier confidence decreases. This outcome is clearer for the SVM RO and K-NN classifiers, whose posterior probability decrease up to 24% (Site #1 through #10) and 27% (Site #11), respectively. As a result, SVM RO provides the true UOM with 66% (kPa absolute sites) and 78% (bar absolute sites) of confidence, while such rates are equal to 71% (kPa absolute sites) and 73% (bar absolute sites) for the K-NN classifier.
Instead, posterior probability of the Naïve Bayes classifier only slightly decreases; in fact, it is equal to 84% for the kPa absolute sites and 96% for the bar absolute site.
A further interesting result, which highlights the theoretical differences among the classifiers, can be inferred by comparing the posterior probability of the twelve UOMs. As a general comment, posterior probability of SVM RO is significantly scattered. For Site #1 through #10, UOMs at the left-hand side of kPa absolute (
Figure 3) usually achieve higher posterior probability values than UOMs placed at the right-hand side; this rule of thumb holds with the exception of the inH
2O gauge label, which is the second-most probable label because of the challenging scale factor.
This result may rely on the fact that the wrongly assumed mmH2O absolute data of Site #11 slightly overlap the kPa absolute data acquired from the other sites. Thus, the area that identifies the mmH2O absolute label is shifted toward the kPa absolute label.
As a consequence, the posterior probability of all UOMs located between mmH
2O absolute and kPa absolute is higher than that of the other UOMs. However, this outcome cannot be clearly grasped from
Figure 6 because few outliers are included within the dataset.
Similarly, psi absolute and kPa absolute are the second and third-most probable labels, respectively, for Site #11.
Instead, for the NB classifier, posterior probability of each class strictly depends on its proximity to the true UOM. In fact, for Site #1 through #10, posterior probability of inH2O absolute, mbar absolute, inH2O gauge, and kPa gauge is higher than the one of the other UOMs. Similarly, for Site #11, posterior probability of all UOMs is null, with the exception of bar gauge and psi gauge.
The K-NN classifier identifies the correct UOM among a limited number of labels. In fact, for Site #1 through #10, posterior probability of four classes only is not null, i.e., mmH
2O absolute, inH
2O gauge, mbar gauge, and kPa absolute (the true UOM). This outcome can be explained by considering that wrongly assumed mmH
2O absolute data approximately overlap data that are correctly labeled as kPa absolute. In addition, the inH
2O gauge and mbar gauge labels are the classes that are nearest and second-nearest to the true UOM (see
Figure 3). A similar result is confirmed for Site #11, in which the kPa absolute label is the only incorrect UOM whose posterior probability is not null. In fact, wrongly assumed kPa absolute data numerically overlap data correctly labeled as bar absolute.
The site cross-validation carried out by means of non-filtered data (
Figure 7) generally confirms the results provided by training the classifiers with 10% of non-filtered data. In fact, the classification accuracy for Site #1 through #10 is in the range from 86% (i.e., K-NN) to 93% (i.e., SVM RO). However, the true label is assigned by SVM RO by means of the lowest posterior probability, i.e., 70%.
As previously demonstrated, the true UOM of Site #11 is usually better detected in terms of both classification accuracy and posterior probability. In addition, the site cross-validation further proves that quality of data for training slightly affects the classification accuracy, while posterior probability may significantly decrease, even by 19%, as in SVM RO. In addition, SVM RO confirms the most scattered results in terms of posterior probability, whereas only a few classes are more probable for the K-NN classifier.
7.3. ROC Curve and AUC
The results reported in
Figure 4,
Figure 5,
Figure 6 and
Figure 7 are confirmed by analyzing the ROC curves reported in
Figure 8 and the AUC values shown in
Table 4 for the three ML classifiers. For the Naïve Bayes classifier, the ROC curve of both the kPa absolute sites and the bar absolute site exhibit a stepwise behavior (
Figure 8b) and the AUC is always higher than 0.993 (
Table 4). The SVM RO classifier is the second-best ML approach, since the AUC is at least 0.992. Finally, the ROC curve of the K-NN classifier is characterized by a smoother trend than the NB and SVM RO classifiers, by achieving a lower AUC value, accordingly. Thus, such results point out that the NB classifier is the optimal ML classifier because it minimizes the false positive rate and simultaneously maximizes the true positive rate.
7.4. Sensitivity Analysis
The effect of UMIs on classification capability is further investigated by means of a sensitivity analysis in which the rate of UMIs varies from 0% to 100%. This analysis is performed by considering 12 UOMs and by training the classifier by means of 10% of data.
The sensitivity analysis comprises ten scenarios. In the first scenario, the rate of UMI is equal to 0% and the corresponding classification results have already been thoroughly analyzed in
Section 7.1, i.e., “Training with filtered data”. In the second scenario, the classifiers are trained by also using the experimentally noisy labels of Site #11, i.e.,
Section 7.2 “Training with non-filtered data”. The remaining scenarios account for both experimental and implanted UMIs. Accuracy and posterior probability of incorrectly labeled data obtained by varying the rate of UMI are reported in
Figure 9.
As can be grasped from
Figure 9, the sensitivity analysis confirms that the NB is the most robust classifier for UMI detection. In fact, both accuracy (
Figure 9a) and posterior probability (
Figure 9b) slightly decrease by increasing the rate of UMI and are always higher than the values provided by the other two methodologies, especially when the rate of UMIs is high.
Slightly poorer results are achieved by using SVM RO. In particular, the rate of UMI mainly affects the posterior probability, which exhibits an almost linear trend by increasing the noise rate. However, the classifier proves to be reliable up to a rate of UMI roughly equal to 40%.
Finally, the sensitivity analysis further proves that the K-NN classifier can be exploited only when a limited amount of data is incorrectly labeled or in the case that a limited number of outliers is included within the dataset [
44]. In fact, both accuracy and posterior probability of the K-NN classifier significantly decrease when the rate of UMI is approximately equal to 20%, so that only 66% of data are correctly labeled.
These results can be explained by considering the theoretical differences among the considered classifiers. In fact, as stated in
Section 7.2, the posterior probability of each class provided by the NB classifier strictly depends on its proximity to the true UOM. Thus, the NB classifier proves to be more robust than SVM RO and K-NNs. In fact, the SVM RO classifier tends to scatter the posterior probabilities, while the K-NN classifier labels the testing data as outliers.
7.5. Discussion and Guidelines
The capability of SVM RO, NB, and K-NNs is directly compared in
Table 5 and
Table 6. In
Table 6, robustness in the presence of UMIs and computational time are qualitatively described as extremely positive (
✓✓), positive (
✓), or not acceptable (
✕).
As demonstrated in
Section 7.1,
Section 7.2,
Section 7.3, the accuracy of SVM RO is generally slightly higher than that of the NB classifier. However, NB provides the correct label by means of a higher posterior probability. Thus, since the optimal classifier is the one that maximize both accuracy and posterior probability, their product is calculated. As can be grasped by means of
Table 5, the NB classifier always provides the highest product of accuracy and posterior probability, which is always higher than 76%.
Furthermore, the computational of both training and testing is low, by taking less than 1 s for the considered dataset (
Table 6). In addition, NB is a robust classifier even when increasing the rate of UMI (
Table 6). For these reasons, NB proved to be the optimal classifier for UMI detection.
The SVM RO may represent an effective alternative to the NB classifier, despite the fact that the computational time for both training and testing the classifier may be up to 30 times higher than that the optimal methodology. In addition, the robustness of the SVM technique may be significantly compromised in terms of posterior probability when noisy labels are used to train the model.
Finally, despite the promising results achieved in this paper and the low computational time, the exploitation of the K-NN classifier is discouraged for detecting UMIs, since its classification capability is strongly correlated to the K value and dataset composition. Thus, K-NN may not be a general approach suitable for detecting heterogeneous data, because it may require a fine tuning based on its application. In addition, a limited rate of UMIs, e.g., 20%, may significantly decrease the classifier capability.
As a final comment, it has to be mentioned that the experimental noisy labels of the dataset challenge all classifiers in realistic GT applications; thus, the inferred rules of thumb can be considered of general validity.
8. Conclusions
This paper dealt with the detection of the unit of measure inconsistency, a challenging label noise issue that makes a physical quantity inconsistent with the assigned unit of measure. To this purpose, the capability of three supervised classifiers was compared with the final aim to provide general guidelines for detecting unit of measure inconsistencies in gas turbines. Three well-known classifiers, i.e., Support Vector Machine, Naïve Bayes, and K-Nearest Neighbors, were tested by considering a test case composed of an experimental dataset collected on a large fleet of Siemens gas turbines in operation.
The effectiveness and robustness of each classifier were thoroughly assessed by varying field data reliability for training each classifier.
The analyses revealed that the Naïve Bayes classifier provided the most reliable results, both in terms of classification accuracy and posterior probability. In fact, when experimental label noise issues affected the dataset, accuracy and posterior probability were equal to 94.4% and 84.0%, respectively. In addition, a sensitivity analysis on the rate of incorrectly labeled data revealed that the NB classifier was only slightly affected by noisy data. The superior capability of the Naïve Bayes classifier was also demonstrated by means of the receiver operating characteristic curve and the related area under the curve, which provide a trade-off between the probability of detecting false positives or true positives.
Despite the promising results obtained by means of the Support Vector Machine classifier (with Radial Basis Function and One-vs-One decomposition strategy), its classification capability was usually lower than that of NB. In addition, computational time for both training and testing the classifier was higher than that of the other methodologies.
Finally, the exploitation of the K-Nearest Neighbors was discouraged, since its effectiveness may be strongly correlated to the selected K value and dataset composition.
Future works are planned to further test the Naïve Bayes by means of additional experimental datasets.