4.1. Performance on Real Industrial Case
In the present work, the subsets of selected variables we assumed to have a fixed size. An estimate of the number of variables to be selected by the variable selection procedures can be calculated through principal component analysis (PCA). Thus, considering a cumulative explained variance of 95.0%, the number of required principal components corresponded to a total of 20 variables. The complete analysis is shown in
Figure A4 present in
Appendix A.3.
The performance of the analyzed variable selection approaches are characterized in terms of the following performance metrics: fault detection rate (FDR %), false alarm rate (FAR %) and regression score
. To establish a reference point for all the studied faults, the learning models were also trained without taking into account procedures of variable/feature selection.
Table 6 shows the respective results that will be considered as the reference performance values for comparison with the performance of the models trained with use of variable selection methods. The regressor predictions obtained with these models for Faults I, II, and III are presented in
Figure A6,
Figure A7 and
Figure A8 in
Appendix A.4.
Table 7 shows the performance of the regressors when filter-based variable selection methods were used. In general, the regressors were able to detect Fault F-III, but unable to detect Fault F-I. On the other hand, Fault F-II led to the highest detection rates (FDR %) when the variable selection method was based on mutual information. As one can see, the learning models that used variable selection procedures based on linear correlation (Pearson and Spearman) were more likely to present overfitting, as
values for the validation set were negative. However, lower values of
for Fault F-I validation set were expected because this set was much larger than the test set and, chronologically, was the most distant from the fault event, incorporating dynamic behaviors that had not been possibly captured in the training set. As it might already be expected,
values were obtained in the test because of the presence of many faulty data.
Considering the average performance of the 4 regressors, the highest FDR values and lowest FAR values were achieved when the mutual information-based variable selection method was used. This can possibly be explained because the mutual information metric is able to capture nonlinear associations among the variables, while the Pearson or Spearman correlations are unable to detect these nonlinear associations.
When compared to the reference performance, the methods based on linear correlations (Pearson and Spearman) led to worse results in the three faults, while the method based on mutual information was better in the three cases.
The regressor performance obtained when wrapper-based variable selection procedures were used are summarized in
Table 8. It is possible to observe that Fault F-III was properly detected with all analyzed wrapper-based variable selection methods. On the other hand, Fault F-I was not detected, except when the Random Forest model was used, while the best detections of the F-II fault were achieved with the variable selection procedure based on the forward feature elimination (Lasso) followed by the backward feature elimination (Lasso). As it might be expected, high FDR (% ) and
values were obtained with the training and validation sets when the learning model in the wrapper method coincided with the regressor model (Random Forest). Another aspect that must be highlighted regards the general performance of wrapper methods, which achieved higher
values than filter methods. Regressors trained with use of wrapper methods presented better ability to correctly model new data (generalization), as observed in the regression scores of the validation sets. In addition, only the wrapper methods that used Lasso learning model exceeded the reference performance in all faults detection scenarios.
Table 9 presents the regressor performance obtained with embedded-based variable selection procedures. On the whole, although Fault F-III was always properly identified, these regressors showed lower rates of failure detection than described previously for wrapper-based variable selection approaches. Besides, the selection procedures based on random forest schemes provided poorer models that were subject to overfitting. In general, the learning models that considered a variable selection step based on embedded methods did not show substantial improvements when compared to the reference performances.
Although variable selection methods based on causal relationships were classified as filter methods,
Table 10 shows the independent evaluation of the respective fault detection results obtained with these methods. As one can see, causality-based approaches outperformed the other methods when tested with most of the faults in terms of selecting the subset that produces the best regression accuracy. These approaches also led to the best
values for the validation set, generating more generalistic learning models and providing on average the highest FDR and lowest FAR values among all methods applied here. This better generalization capability proved to be fundamental in the analyzed context because the process is likely to be subject to dynamic changes during the operation time as a function of the variations on the plant operating conditions. In particular, the PCMCI procedure, with PCStable stage using partial correlation and MCI stage using conditional mutual information metrics, proved to be the most suitable procedure for the detection of Faults II and III, while the best Fault I detection performance was achieved using the PCMCI procedure considering partial correlation metrics in its two stages.
Figure 2 shows the predictions of Fault F-I obtained with PCMCI (partial correlation). For all analyzed regressors, it is possible to observe good
values for the training and validation sets and a clear divergence between measured data and respective predictions in the test set near the failure event.
Figure 3 presents the respective SPE index plot, where regression residues in the training and validation sets remained below the control limit, except for some sporadic points which were responsible for the observed FAR rates. This control limit was exceeded consistently in the reported fault event, proving the capacity of these models for fault detection. As one can see, the abnormality was detected before the fault event reported by the operation, which explains the poor FDR and monotonous FAR values obtained by all regressors regardless of the variable selection algorithm.
Figure 4 and
Figure 5 show, respectively, the dimensionless temperature predictions and SPE index during Fault II detection. In this case, the PCStable (partial correlation) with MCI (conditional mutual information) algorithm was used as the variable selection procedure. The fault was properly detected according to the reported event and SPE behavior. On the other hand, the intermittent nature of this failure explains the poorer obtained FDR values.
Finally, the prediction results and SPE index behavior in the Fault-III detection scenario are presented in
Figure 6 and
Figure 7, respectively. As previously pointed out, this fault was detected appropriately, despite the oscillatory character of the predicted variable. Moreover, the event reported by the operation seemed to have occurred before the actual manifestation of the failure; consequently, the maximum reachable FDR rate corresponds (approximately) to the value of 63% reported in
Table 7,
Table 8,
Table 9 and
Table 10.
An important aspect of the discussion about variable selection methods based on causality is the insertion of lagged variables in the analysis, which derives, naturally, from the discovery and reconstruction of lagged links. The inclusion of these time-shifted variables can allow for improved modelling of the dynamic behaviour of process trajectories, while using the same detection model [
56,
57,
58].
Mutual information, which was applied in filter methods, is a metric that is similar to those used in causal methods. However, this methodology determines the relationships between pairwise variables, neglecting the effect of the remaining variables on the pair. Therefore, conditional approaches are more appropriate as they attempt to isolate the effects of variables during the discovery of causal connections. Basically, while one approach looks for correlated (nonlinearly) variables, the other approaches look for causal variables.
As previously highlighted, lagged-conditionally independence discovery procedures search for the causal connections of the predicted variable Y. Hence, the use of lagged variables seems natural to define the subset of the selected variables.
4.2. Performance on Benchmark Case
As described previously, the size of the training subset was kept constant, as determined through PCA analysis, being required 15 components to describe 99.5 % of cumulative variance. The complete PCA analysis is shown in
Figure A5 in
Appendix A.3.
Table 11 shows the regressors performance when applying the most prominent variable selection procedures by class. According to FDR and FAR metrics, the detection of Fault IDV(1) was better when the PCMCI approach was employed, while Fault IDV(5) was correctly detected with similar performance by the PCMCI and l1-regularization (Lasso) methods. The obtained
values in the test sets reflect that they are composed mostly of no-faulty data.
The better performance of the causal methods for variable selection in this case study can be explained by the inclusion of lagged variables for model training, which according to the literature [
59,
60], can exert a determining role in the detection of failures in the TEP process.
It is worth mentioning that the use of variable selection methods (except the causal methods) did not lead to notable improvements in relation to the reference performance. Hence, the use of variable selection schemes in TEP case study does not constitute a limiting step for detection of the analyzed faults, as the process variables are more causally interconnected and the redundant variables do not interfere drastically in the performance of the models. However, the selection of variables allows working with less complex and computationally faster models. Moreover, it must be clear that the use of causal methods for selection of relevant variables did allow the improvement of the analyzed performance, being recommended for more involving implementations.
4.3. Analysis of Selected Variables
The oil and gas fiscal metering process constitutes an interesting case study because it involves a large number of variables measured along the different sections of the process, making it difficult to define a priori the most relevant variables for the prediction of a particular variable of interest. Intuitively, it is expected that this subset will contain variables from the same plant section to which the prediction target variable belongs and reflects phenomenological characteristics of the process. In this context,
Figure A9,
Figure A10,
Figure A11 and
Figure A12 in
Appendix A.5 show the subsets selected by the most outstanding selection methods (by class) according to the previously reported results. These selections correspond to the training set used to detect Fault F-I, where the predicted variable corresponded to FIT-02B-A (gas flow rate in fiscal meter 2B in Section A of
Figure 1). The process variables and respective tags are listed in
Table A3 in
Appendix A.6.
The ranking of relevant variables determined by the distinct variable selection methods show PDIT02B-A (differential pressure in fiscal meter 2B in Section A in
Figure 1) as most important measurement, which is consistent with the inherent physical principle of the fiscal meter measurement. However, it was the causal methods that considered in their respective selected subsets the largest amount of variables geographically adjacent to the monitored fiscal meter, representing the phenomenological nature of the process.
On the other hand, in systems of high dimensionality, the causal characterization methods are useful not only for fault diagnosis ([
61,
62,
63]), but also for generating better models for fault detection as already shown in this work. In addition, the causal networks reconstructed from time series [
36] keep some causal properties that can be intuitively extracted from the respective process flow diagram (PFD).
Another representative performance metrics is the mean absolute error (MAE).
Figure 8 shows the MAE obtained by the different regressors for Fault F-I in the validation set considering all the variable selection methods studied here. As one can see, the MAE values were lower when the methods for selecting variables based on causality were used. It is important to note that better adjustments and performance can be possibly achieved if hyperparameters optimization stages are carried out during the training procedures. However, as the present work emphasized the study of the effect of the variable selection procedures and not of the effect of hyperparameters on the regression model performances during fault detection, optimization of hyperparameters was not sought.
Finally,
Table 12 shows the CPU times demanded by each method during the selection of variables for the detection of Fault F-I. It can be observed that the causal method were the slowest ones, given the more involving computation of causal links. However, considering that the variable selection stage must be performed before the training stage, this computational demand would not constitute a limiting factor for eventual online applications.
4.4. Final Considerations
In general, all fault detection metrics showed improvements when applied any variable selection approach studied in this work. Moreover, these approaches reduced the fault detection problem dimensionality, allowing building simple learning models in which is a desired attribute in online monitoring.
Variable selection methods based on causality led to better performance in fault detection since included lagged-time variables addressed to model the dynamic behavior of the process trajectories. Furthermore, as was discussed in
Section 4.3, the selected variables subset kept causal associations in respect to the predicted variable reflecting phenomenological characteristics of the process.
The results obtained showed that the wrapper-based methods prevail over filter-based methods in terms of prediction accuracy, as similarly observed in the literature [
3,
6]. However, causality methods can be classified as filter-based methods because the variable selection engine is independent of the regressor model. This independence explains the homogeneity in terms of fault detection metrics observed in the four learning models along the faults scenarios studied.
The fault detection scenarios corresponding to the real industrial case provided the opportunity to work with issues rarely found in simulated or benchmark cases such as high dimensionality, real noised measures, and divergences between the fault events reported and the actual manifestation of the failure.