Variable Selection for Fault Detection Based on Causal Discovery Methods: Analysis of an Actual Industrial Case

Clavijo, Nayher; Melo, Afrânio; Soares, Rafael M.; Campos, Luiz Felipe de O.; Lemos, Tiago; Câmara, Maurício M.; Anzai, Thiago K.; Diehl, Fabio C.; Thompson, Pedro H.; Pinto, José Carlos

doi:10.3390/pr9030544

Open AccessArticle

Variable Selection for Fault Detection Based on Causal Discovery Methods: Analysis of an Actual Industrial Case

by

Nayher Clavijo

¹

,

Afrânio Melo

¹

,

Rafael M. Soares

¹

,

Luiz Felipe de O. Campos

¹

,

Tiago Lemos

¹

,

Maurício M. Câmara

¹

,

Thiago K. Anzai

²,

Fabio C. Diehl

²,

Pedro H. Thompson

² and

José Carlos Pinto

^1,*

¹

Programa de Engenharia Química/COPPE, Universidade Federal do Rio de Janeiro, Cidade Universitária, CP 68502, Rio de Janeiro CEP 21941-972, RJ, Brazil

²

Centro de Pesquisas Leopoldo Americo Miguez de Mello—CENPES, Petrobras—Petróleo Brasileiro SA, Rio de Janeiro CEP 21941-915, RJ, Brazil

^*

Author to whom correspondence should be addressed.

Processes 2021, 9(3), 544; https://doi.org/10.3390/pr9030544

Submission received: 29 January 2021 / Revised: 4 March 2021 / Accepted: 12 March 2021 / Published: 19 March 2021

(This article belongs to the Special Issue Synergies in Combined Development of Processes and Models)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Variable selection constitutes an essential step to reduce dimensionality and improve performance of fault detection and diagnosis in large scale industrial processes. For this reason, in this paper, variable selection approaches based on causality are proposed and compared, in terms of model adjustment of available data and fault detection performance, with several other filter-based, wrapper-based, and embedded-based variable selection methods. These approaches are applied in a simulated benchmark case and an actual oil and gas industrial case considering four different learning models. The experimental results show that obtained models presented better performance during the fault detection stage when variable selection procedures based on causality were used for purpose of model building.

Keywords:

fault detection and diagnosis; variable selection; feature selection; causality; conditional mutual information; real oil and gas process facility

1. Introduction

In the last decade, industrial process monitoring strategies have constantly evolved due to the technological improvements of sensors, equipment and instrumentation and, simultaneously, the increasing relevance of Industry 4.0 in the actual manufacturing process scenario [1]. As these complex industrial processes can produce large amounts of data, a large number of measured variables can be simultaneously monitored in modern plant-wide process monitoring platforms [2]. Hence, removing irrelevant and redundant variables constitutes an important data treatment step, simplifying data driven models, improving the process monitoring performance, and avoiding overfitting. In particular, Blum and Langley reviewed different definitions used to describe variable relevance in the machine learning literature [3].

Variable selection methodologies constitute an important approach to reduce dimensionality in fault detection and diagnosis problems and have become more relevant because of the recent significant increase of data-driven methods in this research area [4]. Variable selection algorithms normally try to identify the subset of measured variables that lead to the best analytical performance, being usually divided into three categories: filter-based, wrapper-based and embedded-based methods [3,5].

Filter-based methods do not depend on the employed learning algorithm and are often applied as a preprocessing step where the analyzed variables are ranked by relevance according to intrinsic properties of the data. This approach scores features in accordance with a certain statistical criterion, almost always making use of the

χ^{2}

statistics, T statistics, Pearson correlation, Spearman correlation, Fisher criteria, and metrics derived from the Information Theory. Some of these criteria have been revised by Ghosh et al. [6].

Wrapper-based methods explore the space of variable subsets and evaluate the performance of the models built with the subsets, consequently depending on the learning algorithm. These methods usually make use of one of the two following approaches: a sequential method, when selection starts with an empty set (full set), and add features (remove features) until satisfaction of a determined regression performance [7]; and the heuristic method, when the variable subsets are generated with help of a stochastic algorithm [8,9].

Embedded-based methods combine the learning model with an optimization problem, allowing variable selection and model building to be performed simultaneously. Differently from wrapper-based methods, embedded-methods incorporate the knowledge about a specific structure of the regression function in the variable selection engine. Two distinct families of embedded methods can be usually defined: the regularization methods [10,11] and the tree-based methods [12].

From the point of view of computational costs, filter-based methods are more attractive because of their inherent simplicity, as variable ranking can be established through simple score computations for each variable. Nevertheless, the variable subset found by these methods may not correspond to the subset that, jointly, maximize the classifier-regressor performance, since the variable relevance is affected neither by the model structure nor by the remaining process variables [13].

On the other hand, wrapper-based methods are more computationally intensive, but prevail over filter-based methods in terms of prediction accuracy since they take into account the classifier-regressor performance during the variable selection step [3,6]. One possible drawback of using wrapper-based methods is that the classifiers-regressors are prone to overfitting [14].

Finally, embedded-based methods try to compensate for the drawbacks discussed previously by incorporating the variable selection procedure as a part of the training process. However, application of these methods can be very intricate and limited to some specific learning models [5].

For all previously described strategies, an important aspect in the variable selection procedure is the criterion that defines the relevance of a single variable or subset of variables. Several criteria have been investigated and can be grouped into one or more of the following categories: distance, information, dependence, consistency and classifier error [15].

Mutual information (MI) is a measure of statistical independence that can be used to evaluate the relationship between random variables, including nonlinear relationships, being invariant under transformations of the feature space [16]. Distinct variable selection algorithms based on MI have been revised by Vergara and Estévez [17] using both filter-based methods [18,19] and wrapper-based methods [20,21]. In particular, Huang et al. [22] proposed a method where MI is used initially for variable ranking, while in a second step variable selection is guided by the maximization of information criterion. Relevant issues associated with the dimension of the selected variable subset and the mathematical connection between mutual information and regression performance metrics have been discussed in the literature [23,24]. Other metrics derived from MI, including the joint mutual information (JMI) [25], conditional mutual information (CMI) [26,27,28], and dynamic mutual information (DMI) [29], have also been studied. In particular, CMI, Granger metrics [30] and Transfer Entropy (TE) [31] can be used as causality measures and can be calculated from observed time series to characterize inner cause-effect relationships.

Based on the previous paragraphs, the present manuscript discusses and proposes the use of variable selection approaches based on time-lagged causality algorithms developed and applied for causal network reconstruction [32]. The relevance of variables are defined in accordance with their causal strength in respect to the predicted and monitored variable. Consequently, the proposed methodology allows isolating the partial effect of each variable in a set over the predicted variable, quantifying the amount of information shared conditionally with the remaining variables. The causality quantification algorithms can adopt linear metrics (partial correlation) or nonlinear metrics (conditional mutual information). To validate the respective performance, results obtained with help of causality procedures are compared with results obtained with several other feature selection methods mentioned previously. Variable selection methodologies are applied in two scenarios:

a benchmark case, where the procedures are used to evaluate some simulated faults of the Tennessee-Eastman process.
a real industrial case, where the procedures are applied to actual industrial measurement datasets extracted from an oil and gas processing plant, with the objective to detect sensor faults reported by the operator.

The paper is structured as follows: in Section 2, we discuss information and causality theoretic preliminaries. In Section 3, the case studies and their respective faults scenarios are presented, and the research methodology is also discussed. In Section 4, we present and discuss several variable selection approaches applied in fault detection for both real (see Section 4.1) and artificial scenarios (see Section 4.2). This leads to recommendation of the use of variable selection methods based on causality approaches as discussed in Section 4.3. Finally, in Section 5, we conclude the paper and discuss future research.

2. Theoretical Background

2.1. Mutual Information and Entropy

The strategy for variable selection includes the identification of the input variables that contain the highest amount of information in relation to the output. Hence, entropy and mutual information are suitable measures in this context [23].

Entropy is a measure of uncertainty of a random variable. Considering

X_{j}

as a discrete random variable, entropy can be defined as [33]:

H (X_{j}) = - \sum_{x_{j} \in X_{j}} p (x_{j}) l o g p (x_{j})

(1)

where

x_{j}

is the possible value of the random variable

X_{j}

,

p (x_{j})

is the probability density function of

x_{j}

.

For the case of two discrete random variables, i.e.,

X_{j}

and

Y_{j}

, the joint entropy of

X_{j}

and

Y_{j}

can be defined as follows [25]:

H (X_{j}, Y_{j}) = - \sum_{x_{j} \in X_{j}} \sum_{y_{j} \in Y_{j}} p (x_{j}, y_{j}) l o g p (x_{j}, y_{j})

(2)

where

p (x_{j}, y_{j})

denotes the joint probability density function of

X_{j}

and

Y_{j}

. Given that the value of another random variable

Y_{j}

is known, the remaining uncertainty to describe the outcome of a random variable

X_{j}

can be expressed by the conditional entropy [34]:

H (X_{j} | Y_{j}) = - \sum_{x_{j} \in X_{j}} \sum_{y_{j} \in Y_{j}} p (x_{j}, y_{j}) l o g p (x_{j} | y_{j})

(3)

where

p (x_{j} | y_{j})

denotes the conditional probability density function of

X_{j}

and

Y_{j}

. The amount of information that one variable provides about another one can be quantified by the mutual information (MI) [33]:

I (X_{j}, Y_{j}) = - \sum_{x_{j} \in X_{j}} \sum_{y_{j} \in Y_{j}} p (x_{j}, y_{j}) l o g \frac{p (x_{j}, y_{j})}{p (x_{j})}

(4)

Additionally, the MI and the entropy can be related as follows [33]:

I (X_{j}, Y_{j}) = H (X_{j}) - H (X_{j}, Y_{j})

(5)

Mutual information can be interpreted as an independence or a correlation measure, being always non-negative, and equal to zero if and only if X and Y are independent [17].

2.2. Conditional Mutual Information

The conditional mutual information (CMI) can be given by [26]:

I (X_{j}, Y_{j} | Z_{i}) = \sum_{z_{i} \in Z_{i}} p (z_{i}) \sum_{x_{j} \in X_{j}} \sum_{y_{j} \in Y_{j}} p (x_{j}, y_{j} | z_{i}) l o g \frac{p (x_{j}, y_{j} | z_{i})}{p (x_{j} | z_{i}) p (y_{j} | z_{i})}

(6)

and measures the conditional dependence between

X_{j}

and

Y_{j}

given

Z_{i}

. The CMI can be interpreted as the MI between X and Y that is not contained in a third variable Z, and expressed in entropy terms as follows [26].

I (X_{j}, Y_{j} | Z_{i}) = H (X_{j} | Z_{i}) - H (X_{j} |, Y_{j}, Z_{i})

(7)

2.3. Conditional Independence and Causality

An important task consists of quantifying the information flow in multiviariate systems. This quantification should be directed to meet the following tasks [35]: (1) quantification of linear-nonlinear associations and (2) characterization of the directionality of information flow propagation (causal interactions). These causal interactions can be visualized as links in an interaction network map.

According to Runge (2018) [36], a pair of variables (or nodes)

X_{t - τ}^{i}

and

X_{t}^{j}

are connected by a direct causal link

X_{t - τ}^{i} \to X_{t}^{j}

, for

τ > 0

if and only if

X_{t - τ}^{i} / ⊥ ⊥ X_{t}^{j} | X_{t}^{-} \ X_{t - τ}^{i}

(8)

so that they are not independent conditionally over the past of the whole multivariable system (process)

X_{t}^{-}

excluding

X_{t - τ}

. Here it is assumed that the multivariable system

X

contains N variables

= (X^{i = 1}, X^{i = 2}, . . ., X^{i = N}, . . .)

. The past of the entire system is denoted as

X_{t}^{-} = (X_{t - 1}, X_{t - 2}, . . ., X_{t - τ_{m a x}})

, where the subset

X_{t - τ}

is composed by the lagged variables

(X_{t - τ}^{i = 1}, X_{t - τ}^{i = 2}, . . ., X_{t - τ}^{i = N})

. When

X_{t}^{i} = X_{t}^{j}

, this measure represents an autodependency at lag

τ

. Moreover, the set of parents of a variable (node)

X_{t}^{j}

can be defined by [36]:

P_{X_{t}^{j}} \equiv \{Z_{t - τ} : Z \in X, τ > 0, Z_{t - τ} \to X_{t}^{j}\}

(9)

The parents of all variables (subprocesses) in

X

and the contemporaneous links comprise the time series graph [32].

Characterization of causal links (Equation (8)) can be performed with different linear or nonlinear independence measures. In particular, MI constitutes an important metric to measure linear and nonlinear associations between variables, but not the direction of dependence. The Granger causality [30] and, in a more general context, the Transfer Entropy (TE) [31] can provide practical means to satisfy these tasks [36].

I_{X^{i} \to X^{j}}^{T E} = I (X_{t - τ}^{i}; X_{t}^{j} | X_{t}^{-} \ X_{t - τ}^{i})

(10)

Here TE is expressed in CMI terms and measures the aggregated influence of

X^{i}

over all past lags, but lead to a problem when high-dimensional probability density functions (PDF) must be estimated [37]. Lag-specific variants of TE (relative information transfer, and momentary information transfer) have been introduced [32,35] to avoid the computation of PDFs of high dimensions. An analogous form of Equation (10) can be obtained by substituting CMI (nonlinear independence measure) by the linear partial correlation (linear independence measure) term [32].

2.4. Approaches

In this present work, causal links characterization algorithms are used in the variable selection context. Some of these algorithms are shortly described in the following sections.

2.4.1. PC-Stable Algorithm

The PC algorithm [38] is a well-known causal link characterization algorithm, widely used to reconstruct causal relationships which can be represented by a Directed Acyclic Graph (DAG). The algorithm consists of an iterative procedure where pairs of variables (at different time lags) conditionally independent (at some significance level) are estimated. The lagged links, computed according to Equation (10), provides the strength and orientation of these causal links. In the present work, a robust modification of the PC algorithm called PC-stable [39] is used.

In particular, the PC algorithm evaluates and removes links from the DAG and updates the network dynamically. Therefore, the resulting network is dependent on the order in which the conditional independence tests are performed. On the other hand, the PC Stable algorithm prevents the link deletion affecting the conditioning set Z of the other variables. A schematic example [40] of the PC algorithm and the PC Stable algorithm applications are presented in Appendix A.1 to introduce the main aspects and differences of the algorithms.

Algorithm 1 summarises the procedures of the PC-stable method as applied in the present work. A detailed description of this algorithm can be found in the original references [39,41,42].

Algorithm 1: PC-stable algorithm

2.4.2. PCMCI Algorithm

Another causal link characterization algorithm is the PCMCI algorithm [36], which is aimed to circumvent some PC-stable limitations related to the optimal selection of conditioning sets, improving the accuracy of independence strength estimation. Briefly, the PCMCI algorithm considers two stages:

Estimate the parents $P_{X_{t}^{j}}$ for every variable $X_{t}^{j} \in X$ using the PC-Stable algorithm.
Using the estimated set of parents, perform a novel independence test called momentary conditional independence (MCI), where given the variable pair $(X_{t - τ}^{i}, X_{t}^{j})$ :

$M C I : X_{t - τ}^{i} / ⊥ ⊥ X_{t}^{j} | P_{X_{t}^{j}} \ \{X_{t - τ}^{i}\}, P_{X_{t - τ}^{i}}$

(11)

Theoretical description and practical applications of the PCMCI algorithm have been thoroughly discussed elsewhere [36,41].

3. Case Studies

3.1. Benchmark Case: Tennessee-Eastman Process

The Tennessee-Eastman Process (TEP) [43] is a widely used benchmark model, which serves as the industrial basis to assess the capability of fault detection and diagnosis methods. In the TEP process, there are five major units: reactor, condenser, compressor, separator and stripper. The process provides two products from four reactants. Also, an inert and a by-product are present in the process streams, leading to a total of 8 components denoted as A, B, C, D, E, F, G and H. TEP covers 22 process variables and 12 manipulated variables, resulting in 34 measured variables. Typically, measurements related to the mole fraction of components in the reactor feed, purge gas and product streams are not considered because their characteristic sampling intervals are too long [44]. A schematic diagram of the TEP process and the complete list of variables are presented in Figure A3 and Appendix A.2.

The benchmark presents 20 faults, which were originally defined by Downs and Vogel [43], and an additional valve fault further introduced in Chiang et al. [44]. The dataset used in the present work was generated with the original FORTRAN code available at http://brahms.scs.uiuc.edu (accessed 13 March 2018).

In the present work, 2 out of the 21 available faults were considered to validate the proposed variable selection methods. Table 1 summarizes the analyzed TEP faults.

3.2. Real Industrial Case: Oil and Gas Fiscal Metering Station

The industrial data used in the present work were acquired with helpf of online sensors of an onshore metering station located in a Petrobras field.

Briefly, the industrial process is composed of three fiscal metering stations (two gas fiscal metering stations and an oil fiscal metering station). Figure 1 describes shortly the oil and gas fiscal metering process and enumerates its respective sections. A more detailed description of this process, the observed faults and their respective phenomenological and economic consequences during the custody transfer process have been discussed elsewhere [45].

The dataset used in the present work comprises measurements of 112 process variables, collected at frequency

f = 1 m i n

during three consecutive years. Table 2 summarizes these process variables.

Several types of faults were identified in the fiscal sensors of the process plant. Most of them were related to faults of temperature sensors and flow sensors. Table 3 shows the faults studied in the present work and the respective dataset sizes for training, validation, and testing of the employed regressor models. In particular, detections of faults F-II and F-III were performed with the same training data.

3.3. Methodology

The fault detection approach used in the present work is discrepancy-based, where the fault pattern is recognized using a residual measure calculated as the difference between the process sensors readings and the expected values obtained using the predictions of a model that represents the process in fault-free conditions. In the present work, the model building is based on unsupervised learning techniques (regression models) since the number of failures in each sensor and the constant process evolution makes the use of supervised techniques (classification models) unfeasible. The purpose of the unsupervised techniques is to model, based on the input and output datasets (X and Y, respectively), the dynamic behavior of the process in normal (non-fault) conditions. The process condition monitoring is based on a residual metric calculation. In the present case, the Square Prediction Error (SPE) was used, which is a fault detection index that allows the identification of abnormal conditions along the control chart.

Here, the fault detection models were configured considering a single output Y as the fault observed variable (See monitored variable in Table 1 and Table 3). On the other hand, the input variable (set X) was generated applying the variable selection procedures (listed in Table 4) on the original complete dataset. Moreover, the number of variables in the input set X was previously established as the number of principal components needed to describe 95% of the cumulative variance in the respective training set.

To evaluate the fault detection performance in the above-reported case studies, the machine learning regressors described in Table 5 were considered, which were evaluated for each input subset X generated by the variable selection methods. The architecture and respective hyperparameters were defined according to heuristics reported in the literature, also summarized in Table 5. Thus, the efficiencies of the variable selection algorithms were directly associated with the fault detection performance of the models trained with the same hyperparameter values, allowing observation of the variable selection effect with small influence of the hyperparameter values.

Overall, we considered 12 selection variable methods corresponding of which there were three filter-based, four wrapped-based, two embedded-based, and three causality-based ones. Each selection variable method was applied in five fault detection scenarios, three faults cases of the actual industrial case and two faults cases of TEP, using four different machine learning models. Furthermore, references fault detection scenarios for the actual industrial process and TEP were obtained considering the original complete dataset (i.e., without taking into account procedures of variable selection).

4. Results

In this section, the effectiveness of the proposed causal link characterization approaches are evaluated as variable selection methods. Other variable selection algorithms were considered in order to compare and discuss effects on fault detection cases. Results and discussions regarding real industrial and benchmark datasets are shown in Section 4.1 and Section 4.2, respectively.

4.1. Performance on Real Industrial Case

In the present work, the subsets of selected variables we assumed to have a fixed size. An estimate of the number of variables to be selected by the variable selection procedures can be calculated through principal component analysis (PCA). Thus, considering a cumulative explained variance of 95.0%, the number of required principal components corresponded to a total of 20 variables. The complete analysis is shown in Figure A4 present in Appendix A.3.

The performance of the analyzed variable selection approaches are characterized in terms of the following performance metrics: fault detection rate (FDR %), false alarm rate (FAR %) and regression score

R^{2}

. To establish a reference point for all the studied faults, the learning models were also trained without taking into account procedures of variable/feature selection. Table 6 shows the respective results that will be considered as the reference performance values for comparison with the performance of the models trained with use of variable selection methods. The regressor predictions obtained with these models for Faults I, II, and III are presented in Figure A6, Figure A7 and Figure A8 in Appendix A.4.

Table 7 shows the performance of the regressors when filter-based variable selection methods were used. In general, the regressors were able to detect Fault F-III, but unable to detect Fault F-I. On the other hand, Fault F-II led to the highest detection rates (FDR %) when the variable selection method was based on mutual information. As one can see, the learning models that used variable selection procedures based on linear correlation (Pearson and Spearman) were more likely to present overfitting, as

R^{2}

values for the validation set were negative. However, lower values of

R^{2}

for Fault F-I validation set were expected because this set was much larger than the test set and, chronologically, was the most distant from the fault event, incorporating dynamic behaviors that had not been possibly captured in the training set. As it might already be expected,

R^{2}

values were obtained in the test because of the presence of many faulty data.

Considering the average performance of the 4 regressors, the highest FDR values and lowest FAR values were achieved when the mutual information-based variable selection method was used. This can possibly be explained because the mutual information metric is able to capture nonlinear associations among the variables, while the Pearson or Spearman correlations are unable to detect these nonlinear associations.

When compared to the reference performance, the methods based on linear correlations (Pearson and Spearman) led to worse results in the three faults, while the method based on mutual information was better in the three cases.

The regressor performance obtained when wrapper-based variable selection procedures were used are summarized in Table 8. It is possible to observe that Fault F-III was properly detected with all analyzed wrapper-based variable selection methods. On the other hand, Fault F-I was not detected, except when the Random Forest model was used, while the best detections of the F-II fault were achieved with the variable selection procedure based on the forward feature elimination (Lasso) followed by the backward feature elimination (Lasso). As it might be expected, high FDR (% ) and

R^{2}

values were obtained with the training and validation sets when the learning model in the wrapper method coincided with the regressor model (Random Forest). Another aspect that must be highlighted regards the general performance of wrapper methods, which achieved higher

R^{2}

values than filter methods. Regressors trained with use of wrapper methods presented better ability to correctly model new data (generalization), as observed in the regression scores of the validation sets. In addition, only the wrapper methods that used Lasso learning model exceeded the reference performance in all faults detection scenarios.

Table 9 presents the regressor performance obtained with embedded-based variable selection procedures. On the whole, although Fault F-III was always properly identified, these regressors showed lower rates of failure detection than described previously for wrapper-based variable selection approaches. Besides, the selection procedures based on random forest schemes provided poorer models that were subject to overfitting. In general, the learning models that considered a variable selection step based on embedded methods did not show substantial improvements when compared to the reference performances.

Although variable selection methods based on causal relationships were classified as filter methods, Table 10 shows the independent evaluation of the respective fault detection results obtained with these methods. As one can see, causality-based approaches outperformed the other methods when tested with most of the faults in terms of selecting the subset that produces the best regression accuracy. These approaches also led to the best

R^{2}

values for the validation set, generating more generalistic learning models and providing on average the highest FDR and lowest FAR values among all methods applied here. This better generalization capability proved to be fundamental in the analyzed context because the process is likely to be subject to dynamic changes during the operation time as a function of the variations on the plant operating conditions. In particular, the PCMCI procedure, with PCStable stage using partial correlation and MCI stage using conditional mutual information metrics, proved to be the most suitable procedure for the detection of Faults II and III, while the best Fault I detection performance was achieved using the PCMCI procedure considering partial correlation metrics in its two stages.

Figure 2 shows the predictions of Fault F-I obtained with PCMCI (partial correlation). For all analyzed regressors, it is possible to observe good

R^{2}

values for the training and validation sets and a clear divergence between measured data and respective predictions in the test set near the failure event. Figure 3 presents the respective SPE index plot, where regression residues in the training and validation sets remained below the control limit, except for some sporadic points which were responsible for the observed FAR rates. This control limit was exceeded consistently in the reported fault event, proving the capacity of these models for fault detection. As one can see, the abnormality was detected before the fault event reported by the operation, which explains the poor FDR and monotonous FAR values obtained by all regressors regardless of the variable selection algorithm.

Figure 4 and Figure 5 show, respectively, the dimensionless temperature predictions and SPE index during Fault II detection. In this case, the PCStable (partial correlation) with MCI (conditional mutual information) algorithm was used as the variable selection procedure. The fault was properly detected according to the reported event and SPE behavior. On the other hand, the intermittent nature of this failure explains the poorer obtained FDR values.

Finally, the prediction results and SPE index behavior in the Fault-III detection scenario are presented in Figure 6 and Figure 7, respectively. As previously pointed out, this fault was detected appropriately, despite the oscillatory character of the predicted variable. Moreover, the event reported by the operation seemed to have occurred before the actual manifestation of the failure; consequently, the maximum reachable FDR rate corresponds (approximately) to the value of 63% reported in Table 7, Table 8, Table 9 and Table 10.

An important aspect of the discussion about variable selection methods based on causality is the insertion of lagged variables in the analysis, which derives, naturally, from the discovery and reconstruction of lagged links. The inclusion of these time-shifted variables can allow for improved modelling of the dynamic behaviour of process trajectories, while using the same detection model [56,57,58].

Mutual information, which was applied in filter methods, is a metric that is similar to those used in causal methods. However, this methodology determines the relationships between pairwise variables, neglecting the effect of the remaining variables on the pair. Therefore, conditional approaches are more appropriate as they attempt to isolate the effects of variables during the discovery of causal connections. Basically, while one approach looks for correlated (nonlinearly) variables, the other approaches look for causal variables.

As previously highlighted, lagged-conditionally independence discovery procedures search for the causal connections of the predicted variable Y. Hence, the use of lagged variables seems natural to define the subset of the selected variables.

4.2. Performance on Benchmark Case

As described previously, the size of the training subset was kept constant, as determined through PCA analysis, being required 15 components to describe 99.5 % of cumulative variance. The complete PCA analysis is shown in Figure A5 in Appendix A.3.

Table 11 shows the regressors performance when applying the most prominent variable selection procedures by class. According to FDR and FAR metrics, the detection of Fault IDV(1) was better when the PCMCI approach was employed, while Fault IDV(5) was correctly detected with similar performance by the PCMCI and l1-regularization (Lasso) methods. The obtained

R^{2}

values in the test sets reflect that they are composed mostly of no-faulty data.

The better performance of the causal methods for variable selection in this case study can be explained by the inclusion of lagged variables for model training, which according to the literature [59,60], can exert a determining role in the detection of failures in the TEP process.

It is worth mentioning that the use of variable selection methods (except the causal methods) did not lead to notable improvements in relation to the reference performance. Hence, the use of variable selection schemes in TEP case study does not constitute a limiting step for detection of the analyzed faults, as the process variables are more causally interconnected and the redundant variables do not interfere drastically in the performance of the models. However, the selection of variables allows working with less complex and computationally faster models. Moreover, it must be clear that the use of causal methods for selection of relevant variables did allow the improvement of the analyzed performance, being recommended for more involving implementations.

4.3. Analysis of Selected Variables

The oil and gas fiscal metering process constitutes an interesting case study because it involves a large number of variables measured along the different sections of the process, making it difficult to define a priori the most relevant variables for the prediction of a particular variable of interest. Intuitively, it is expected that this subset will contain variables from the same plant section to which the prediction target variable belongs and reflects phenomenological characteristics of the process. In this context, Figure A9, Figure A10, Figure A11 and Figure A12 in Appendix A.5 show the subsets selected by the most outstanding selection methods (by class) according to the previously reported results. These selections correspond to the training set used to detect Fault F-I, where the predicted variable corresponded to FIT-02B-A (gas flow rate in fiscal meter 2B in Section A of Figure 1). The process variables and respective tags are listed in Table A3 in Appendix A.6.

The ranking of relevant variables determined by the distinct variable selection methods show PDIT02B-A (differential pressure in fiscal meter 2B in Section A in Figure 1) as most important measurement, which is consistent with the inherent physical principle of the fiscal meter measurement. However, it was the causal methods that considered in their respective selected subsets the largest amount of variables geographically adjacent to the monitored fiscal meter, representing the phenomenological nature of the process.

On the other hand, in systems of high dimensionality, the causal characterization methods are useful not only for fault diagnosis ([61,62,63]), but also for generating better models for fault detection as already shown in this work. In addition, the causal networks reconstructed from time series [36] keep some causal properties that can be intuitively extracted from the respective process flow diagram (PFD).

Another representative performance metrics is the mean absolute error (MAE). Figure 8 shows the MAE obtained by the different regressors for Fault F-I in the validation set considering all the variable selection methods studied here. As one can see, the MAE values were lower when the methods for selecting variables based on causality were used. It is important to note that better adjustments and performance can be possibly achieved if hyperparameters optimization stages are carried out during the training procedures. However, as the present work emphasized the study of the effect of the variable selection procedures and not of the effect of hyperparameters on the regression model performances during fault detection, optimization of hyperparameters was not sought.

Finally, Table 12 shows the CPU times demanded by each method during the selection of variables for the detection of Fault F-I. It can be observed that the causal method were the slowest ones, given the more involving computation of causal links. However, considering that the variable selection stage must be performed before the training stage, this computational demand would not constitute a limiting factor for eventual online applications.

4.4. Final Considerations

In general, all fault detection metrics showed improvements when applied any variable selection approach studied in this work. Moreover, these approaches reduced the fault detection problem dimensionality, allowing building simple learning models in which is a desired attribute in online monitoring.

Variable selection methods based on causality led to better performance in fault detection since included lagged-time variables addressed to model the dynamic behavior of the process trajectories. Furthermore, as was discussed in Section 4.3, the selected variables subset kept causal associations in respect to the predicted variable reflecting phenomenological characteristics of the process.

The results obtained showed that the wrapper-based methods prevail over filter-based methods in terms of prediction accuracy, as similarly observed in the literature [3,6]. However, causality methods can be classified as filter-based methods because the variable selection engine is independent of the regressor model. This independence explains the homogeneity in terms of fault detection metrics observed in the four learning models along the faults scenarios studied.

The fault detection scenarios corresponding to the real industrial case provided the opportunity to work with issues rarely found in simulated or benchmark cases such as high dimensionality, real noised measures, and divergences between the fault events reported and the actual manifestation of the failure.

5. Conclusions

In the present paper, variable selection methods based on causality are implemented, analyzed and then obtained performances were compared with the performances obtained with several other filter-based, wrapper-based, and embedded-based variable selection methods. Two case studies were presented, corresponding to a simulated benchmark (the Tennessee-Eastman process) and an actual industrial case (a fiscal gas metering station). As shown through many examples, all learning algorithms considered in the present work provided better regression and fault detection performances when using variable selection procedures based on causality. In particular, the variable selection approaches based on causality establish the causal connections of the predicted variable, also allowing the determination of the respective lagged-conditionally dependence. Hence, the subset of the selected variables reflects phenomenological characteristics of the process, as it became evident in the industrial case study. Although the variable selection methods based on causality were more computationally demanding, the use of these methods in monitoring scenarios that involve a large number of variables is highly recommended, especially because it can be performed as pre-processing data analysis stage and thus does not compromise the characteristic computation time of the final application.

Let us also discuss some directions for future research. In the present work, we proposed the use of the causal discovery approach in variable selection methods addressed to assists the process fault detection. As discussed above, these causal methods are based on the estimation of lagged-conditional dependence of the measured variables. In high-dimensional systems, estimating these dependencies involve the computation of complicated density probability function, which can lead to inaccurate or spurious estimations of the causal relationships [37]. Future work should include the use of ruled-based frameworks belonging to expert knowledge [64] as equipment adjacency in the process plant, aiming at the incorporation of physical and operational restrictions in the estimation of the causal relationships. Finally, it would be useful to propose other benchmark cases in order to test these causal methods in feature/variable selection problems with high-redundancy datasets, aiming to evaluate the approach robustness in the presence of correlated measures.

Author Contributions

Conceptualization, M.M.C., T.K.A., and J.C.P.; data curation, A.M., M.M.C., L.F.d.O.C., F.C.D., T.K.A., and P.H.T.; funding acquisition, T.K.A., F.C.D. and J.C.P.; methodology, T.L.; project administration, T.L., L.F.d.O.C., T.K.A., F.C.D., and P.H.T.; resources, T.K.A., F.C.D., P.H.T., and F.C.D.; software, A.M., M.M.C., R.M.S., and P.H.T.; supervision, J.C.P.; writing, original draft, N.C., M.M.C., and T.K.A.; writing, review and editing, N.C., M.M.C., R.M.S., J.C.P. and T.K.A. All authors have read and agreed to the published version of the manuscript.

Funding

This study was financed in part by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior-Brasil (CAPES)-Finance Code 001. The authors also thank CNPq-Conselho Nacional de Desenvolvimento Científico e Tecnológico, and Petrobras (Petróleo Brasileiro SA), for the financial support of this work, as well as for covering the costs to publish in open access.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data sharing is not applicable.

Conflicts of Interest

The authors declare no conflict of interest. The founders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; nor in the decision to publish the results.

Abbreviations

The following abbreviations are used in this manuscript:

MI	Mutual information
JMI	Joint mutual information
CMI	Conditional mutual information
DMI	Dynamic mutual information
TE	Transfer entropy
PDF	Probability density function
DAG	Directed acyclic graph
TEP	Tennessee Eastman process
SPE	Square prediction error
PCA	Principal components analysis
FDR	Fault detection rate
FAR	False alarm rate
RF	Random Forest
RR	Ridge regression
MLPR	Multi-layer perceptron regressor
CCA	Canonical correlation analysis
MCI	Mutual conditional independence
PFD	Process flow diagram
MAE	Mean absolute error

Appendix A

Appendix A.1. PC Algorithm and PC-Stable Algorithm

Given a Directed Acyclic Graph (DAG) G consisting of a set of nodes or vertices V (circles) and a set of edges (lines)

E \subseteq V \times V

. A set of variables and its causal interactions can be represented, respectively, by the nodes and edges of a DAG, where the direction and measure of these interactions are obtained according to described in Section 2.3.

The PC algorithms are based on the conditional independence evaluation that establishes the existence of a link (edge) between the variables (nodes) X and Y, if given a set of conditions Z, X and Y are not independent conditioning on Z.

An intuitive approach to construct the complete network map (DAG) consists of the exhaustive search of all possible conditions Z to determine if X and Y are conditionally connected. However, this is a computationally inefficient method, turning the PC algorithm or PC-Stable algorithm as interesting approaches to address causal link characterization problems in high dimensional systems.

The PC Algorithm considers, at the beginning, a network fully connected. For each edge, tests if the pair of variables connected, X and Y, are independent conditioning on a subset Z of all neighbours/parents of X and Y and remove or retain the respective edge based on the result. The mutual conditional independence tests (MCI) are applied by levels, according to the size d of the conditions set Z. At the first level (

d = 0

), all pairs of nodes (variables) are tested, conditioning on the empty set Z. The algorithm can remove some of the edges (links) and only tests the remaining edges in the next level (

d = 1

). The size of the conditioning set, d, is progressively increased until d is greater than the size of the adjacent sets of the testing nodes.

Figure A1 shows an example of PC algorithm being applied to a hypothetical dataset with four nodes, A; B; C; and D [40]. As one can see, three edges remains after level 1 tests (i.e.,

Z = []

). At the next level, each remaining edge will be tested conditioned on each neighbour/parent of the testing variables (i.e.,

d = 1

). For example, given the edge

A - B

, there are, at most, two tests which are conditioning on C and conditioning on D. In particular, the test

M C I (A, B | C)

returns conditional independence, then the edge is removed from the graph and the algorithm continues testing on the other edge.

Conditional independence test is likely to present inaccurate values in high dimensional systems. Furthermore, for the PC Algorithm, removing or retaining an edge would result in changes in the condition set Z of other nodes since the network graph is updated dynamically. Therefore, the resulting network graph is dependent on the order in which the conditional independence tests are performed [40]. For example, in Figure A1, when the test

M C I (A, B | C)

returns conditional independence and, consequently, the edge is removed, the set of neighbors/parents of A is also updated

a d j (A) = C, D

. Therefore, when testing the edge

A - C

, the conditions set not contains B. Considering the case where the conditional test

M C I (A, B | C)

wrongly removes the edge

A - C

, it misses the test

M C I (A, C | B)

which may remove the edge

A - C

. On the other hand, if the procedure tests

M C I (A, C | B)

first and removes the edge

A - C

, it would end up with a different network.

Figure A1. Example of applying PC algorithm in a hypothetical dataset [40].

Figure A2. Example of applying PC-Stable algorithm in a hypothetical dataset [40].

Appendix A.2. Tennessee Eastman Process

Colombo andMaathuis [39] proposed the PC-Stable algorithm addressed to obtain a stable output skeleton that does not depend on how variables are ordered in the input dataset. In this method, the neighbour/parents (adjacent) sets of all nodes are evaluated and kept unchanged at each particular level, preventing that the edge deletion affects the conditioning set of the other nodes. Figure A2 shows the respective network update in each level when the PC-Stable algorithm is applied.

Figure A3. A schematic diagram of TEP.

Table A1. Measured variables of TEP.

Measured Variable ID	Description
F1	Feed flow component A (stream 1) in kscmh
F2	Feed flow component D (stream 2) in kg/h
F3	Feed flow component E (stream 3) in kg/h
F4	Feed flow components A/B/C (stream 4) in kscmh
F5	Recycle flow to reactor from separator (stream 8) in kscmh
F6	Reactor feed rate (stream 6) in kscmh
P7	Reactor pressure in kPa gauge
L8	Reactor level
T9	Reactor temperature in °C
F10	Purge flow rate (stream 9) in kscmh
T11	Separator temperature in °C
L12	Separator level
P13	Separator pressure in kPa gauge
F14	Separator underflow in liquid phase (stream 10) in m³/h
L15	Stripper level
P16	Stripper pressure in kPa gauge
F17	Stripper underflow (stream 11) in m³/h
T18	Stripper temperature in °C
F19	Stripper steam flow in kg/h
J20	Compressor work in kW
T21	Reactor cooling water outlet temperature in °C
T22	Condenser cooling water outlet temperature in °C
XA	Concentration of A in reactor feed (stream 6) in mol%
XB	Concentration of B in reactor feed (stream 6) in mol%
XC	Concentration of C in reactor feed (stream 6) in mol%
XD	Concentration of D in reactor feed (stream 6) in mol%
XE	Concentration of E in reactor feed (stream 6) in mol%
XF	Concentration of F in reactor feed (stream 6) in mol%
YA	Concentration of A in purge (stream 9) in mol%
YB	Concentration of B in purge (stream 9) in mol%
YC	Concentration of C in purge (stream 9) in mol%
YD	Concentration of D in purge (stream 9) in mol%
YE	Concentration of E in purge (stream 9) in mol%
YF	Concentration of F in purge (stream 9) in mol%
YG	Concentration of G in purge (stream 9) in mol%
YH	Concentration of H in purge (stream 9) in mol%
ZD	Concentration of D in stripper underflow (stream 11) in mol%
ZE	Concentration of E in stripper underflow (stream 11) in mol%
ZF	Concentration of F in stripper underflow (stream 11) in mol%
ZG	Concentration of G in stripper underflow (stream 11) in mol%
ZH	Concentration of H in stripper underflow (stream 11) in mol%

Table A2. Manipulated variables of TEP.

Manipulated Variable ID	Description
MV1	Valve position feed component D (stream 2)
MV2	Valve position feed component E (stream 3)
MV3	Valve position feed component A (stream 1)
MV4	Valve position feed components A/B/C (stream 4)
MV5	Valve position compressor recycle
MV6	Purge valve position (stream 9)
MV7	Valve position underflow separator (stream 10)
MV8	Valve position underflow stripper (stream 11)
MV9	Valve position stripper steam
MV10	Valve position cooling water outlet of reactor
MV11	Valve position cooling water outlet of separator
MV12	Rotation speed of reactor agitator

Appendix A.3. Principal Component Analysis in Case Studies

Figure A4. Principal component analysis in real industrial dataset.

Figure A5. Principal component analysis in TEP.

Appendix A.4. Regressors Prediction of Reference Scenarios in Real Industrial Case

Figure A6. Regressors prediction for dimensionless Gas flow rate in fiscal meter 02B (Section A) along Fault-FI without considering variable selection stage.

Figure A7. Regressors prediction for dimensionless temperature in fiscal meter 02A (Section A) along Fault-FII without considering variable selection stage.

Figure A8. Regressors prediction for dimensionless temperature in fiscal meter 02A (Section A) along Fault-FIII without considering variable selection stage.

Appendix A.5. Selected Subsets in Fault Detection F-I Scenario

Figure A9. Variables subset for Fault F-I detection selected by the filter method based on mutual information.

Figure A10. Variables subset for Fault F-I detection selected by the forward feature elimination (Lasso) method.

Figure A11. Variables subset for Fault F-I detection selected by the L1-regularization (Lasso) method.

Figure A12. Variables subset for Fault F-I detection selected by the PCMCI (partial correlation) method.

Appendix A.6. Variables and Tags of the Real Industrial Case

The variables of oil and gas metering process and its respective tags are listed in the table.

Table A3. Lisf of variables—Oil and gas fiscal metering station.

Variable	Tag	Plant Section	Variable	Tag	Plant Section
Gas flow rate in processing 05	FIP-05-D	D	Temperature of water output in cooler 02B	TI-02B-D	D
Level Tank 03	LI-03-F	F	Temperature of water output in cooler 02A	TI-02A-D	D
Pump pressure 05 in oil transfer	PP-05-D	D	Flow rate in transfer oil 01B	FIT-01B-F	F
Temperature in treatment tank 01A	TI-01A-F	F	Flow rate in transfer oil 01A	FIT-01A-F	F
Flow rate for water treatmente 01A	FIT-01A-E	E	BSW in treatment tank outlet 02	BSW-02O-D	D
Density in gas fiscal meter 01	DR-01-A	A	BSW in treatment tank 01	BSW-01-D	D
Specific mass in oil fiscal meter 01	SM-01-C	C	Density in gas fiscal meter 03	DR-03-B	B
BSW in treatment tank 02	BSW-02-D	D	Temperature of oil output in cooler 01B	TI-01B-D	D
Flow of water treated	FW-E	E	Temperature of oil output in cooler 01A	TI-01A-D	D
Pump pressure 02B	PP-02B-D	D	Temperature of oil input in heat exchanger 02B	TI-02B-D	D
Pump pressure 02C	PP-02C-D	D	Oil flow rate 2	FIO-2-E	E
Density in gas fiscal meter 03	DR-03-A	A	Tank Pressure 01	PI-01-F	F
Static pressure in gas fiscal meter 03	PIT-03-B	B	Flow rate for water treatmente 01B	FIT-01B-E	E
Flow rage in gas fiscal meter 02A	FIT-02A-A	A	Temperature in treatment tank 01B	TI-01B-F	F
Pressure differential in gas fiscal meter 02A	PDIT-02A-A	A	Temperature of oil output in heat exchanger 01B	TI-02B-D	D
Pressure differential in gas fiscal meter 02B	PDIT-02B-A	A	Temperature of oil input in heat exchanger 01A	TI-01A-D	D
Static pressure in gas fiscal meter 02A	PIT-02A-A	A	Temperature of oil input in heat exchanger 01B	TI-01B-D	D
Static pressure in gas fiscal meter 02B	PIT-02B-A	A	Pump pressure 01B in oil transfer	PP-01B-D	D
Temperature in gas fiscal meter 02A	TIT-02A-A	A	Pump pressure 01A in oil transfer	PP-01A-D	D
Temperature in gas fiscal meter 02B	TIT-02B-A	A	Pressure differential in oil treatment tank 01	PDIT-01-D	D
Gas flow rate	FIT-GC-G	G	Electric current in pump 07	EC-07-D	D
Pump pressure in oil transfer	PP-0T-D	D	Electric current in pump 06	EC-06-D	D
Controller output in wash tank 01	CO-01-D	D	Flow injection in treament equipment 05	FIP-05-D	D
Pump pressure 02A	PP-02A-D	D	Pump pressure for injection in Section D	PP-I-D	D
Oil flow rate 1A	FIO-1A-E	E	Pressure differential in importation gas	PDIT-IM-A	A
Oil flow rate 1	FIO-1-E	E	Pressure in treatment tank 01B	PI-01B-F	F
Tank Pressure 02	PI-02-F	F	Pressure in treatment tank 01A	PI-01A-F	F
Tank Pressure 03	PI-03-F	F	Controller output in wash tank 01	CO-01-D	D
Gas flow rate 1	FIG-1-E	E

References

Jiang, Q.; Yan, X.; Huang, B. Review and Perspectives of Data-Driven Distributed Monitoring for Industrial Plant-Wide Processes. Ind. Eng. Chem. Res. 2019, 58, 12899–12912. [Google Scholar] [CrossRef]
Yuan, Z.; Qin, W.; Zhao, J. Smart Manufacturing for the Oil Refining and Petrochemical Industry. Engineering 2017, 3, 179–182. [Google Scholar] [CrossRef]
Blum, A.L.; Langley, P. Selection of relevant features and examples in machine learning. Artif. Intell. 1997, 97, 245–271. [Google Scholar] [CrossRef] [Green Version]
Rauber, T.W.; Boldt, F.A.; Munaro, C.J. Feature selection for multivariate contribution analysis in fault detection and isolation. J. Frankl. Inst. 2020, 357, 6294–6320. [Google Scholar] [CrossRef]
Chandrashekar, G.; Sahin, F. A survey on feature selection methods. Comput. Electr. Eng. 2014, 40, 16–28. [Google Scholar] [CrossRef]
Ghosh, K.; Ramteke, M.; Srinivasan, R. Optimal variable selection for effective statistical process monitoring. Comput. Chem. Eng. 2014, 60, 260–276. [Google Scholar] [CrossRef]
Reunanen, J. Overfitting in Making Comparisons Between Variable Selection Methods. J. Mach. Learn. Res. 2003, 3, 1371–1382. [Google Scholar]
Sun, Y.; Babbs, C.; Delp, E. A Comparison of Feature Selection Methods for the Detection of Breast Cancers in Mammograms: Adaptive Sequential Floating Search vs. Genetic Algorithm. In Proceedings of the 2005 IEEE Engineering in Medicine and Biology 27th Annual Conference, Shanghai, China, 1–4 September 2005; pp. 6532–6535. [Google Scholar] [CrossRef]
Alexandridis, A.; Patrinos, P.; Sarimveis, H.; Tsekouras, G. A two-stage evolutionary algorithm for variable selection in the development of RBF neural network models. Chemom. Intell. Lab. Syst. 2005, 75, 149–162. [Google Scholar] [CrossRef]
Tibshirani, R. Regression Shrinkage and Selection via the Lasso. J. R. Stat. Society. Ser. B (Methodological) 1996, 58, 267–288. [Google Scholar] [CrossRef]
Zong, Y.B.; Jin, N.D.; Wang, Z.Y.; Gao, Z.K.; Wang, C. Nonlinear dynamic analysis of large diameter inclined oil–water two phase flow pattern. Int. J. Multiph. Flow 2010, 36, 166–183. [Google Scholar] [CrossRef]
Sugumaran, V.; Muralidharan, V.; Ramachandran, K.I. Feature selection using Decision Tree and classification through Proximal Support Vector Machine for fault diagnostics of roller bearing. Mech. Syst. Signal Process. 2007, 21, 930–942. [Google Scholar] [CrossRef]
Koller, D.; Sahami, M. Toward optimal feature selection. In Proceedings of the 13th International Conference on Machine Learning, Bari, Italy, 3–6 July 1996; pp. 284–292. [Google Scholar]
Kohavi, R.; John, G.H. Wrappers for feature subset selection. Artif. Intell. 1997, 97, 273–324. [Google Scholar] [CrossRef] [Green Version]
Dash, M.; Liu, H. Feature selection for classification. Intell. Data Anal. 1997, 1, 131–156. [Google Scholar] [CrossRef]
Kraskov, A.; Stögbauer, H.; Grassberger, P. Estimating mutual information. Phys. Rev. E 2004, 69, 066138. [Google Scholar] [CrossRef] [Green Version]
Vergara, J.R.; Estévez, P.A. A review of feature selection methods based on mutual information. Neural Comput. Appl. 2014, 24, 175–186. [Google Scholar] [CrossRef]
Tourassi, G.D.; Frederick, E.D.; Markey, M.K.; Floyd, C.E. Application of the mutual information criterion for feature selection in computer-aided diagnosis. Med. Phys. 2001, 28, 2394–2402. [Google Scholar] [CrossRef] [PubMed]
Lucke, M.; Mei, X.; Stief, A.; Chioua, M.; Thornhill, N.F. Variable Selection for Fault Detection and Identification based on Mutual Information of Alarm Series ⁎⁎Financial support is gratefully acknowledged from the Marie Skodowska Curie Horizon 2020 EID-ITN project PROcess NeTwork Optimization for efficient and sustainable operation of Europe’s process industries taking machinery condition and process performance into account PRONTO, Grant agreement No 675215. IFAC-PapersOnLine 2019, 52, 673–678. [Google Scholar] [CrossRef]
François, D.; Rossi, F.; Wertz, V.; Verleysen, M. Resampling methods for parameter-free and robust feature selection with mutual information. Neurocomputing 2007, 70, 1276–1288. [Google Scholar] [CrossRef] [Green Version]
Peng, H.; Long, F.; Ding, C. Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans. Pattern Anal. Mach. Intell. 2005, 27, 1226–1238. [Google Scholar] [CrossRef]
Huang, J.; Cai, Y.; Xu, X. A hybrid genetic algorithm for feature selection wrapper based on mutual information. Pattern Recognit. Lett. 2007, 28, 1825–1844. [Google Scholar] [CrossRef]
Mielniczuk, J.; Teisseyre, P. Stopping rules for mutual information-based feature selection. Neurocomputing 2019, 358, 255–274. [Google Scholar] [CrossRef]
Frénay, B.; Doquire, G.; Verleysen, M. Is mutual information adequate for feature selection in regression? Neural Netw. 2013, 48, 1–7. [Google Scholar] [CrossRef] [PubMed]
Bennasar, M.; Hicks, Y.; Setchi, R. Feature selection using Joint Mutual Information Maximisation. Expert Syst. Appl. 2015, 42, 8520–8532. [Google Scholar] [CrossRef] [Green Version]
Zhou, H.; Zhang, Y.; Zhang, Y.; Liu, H. Feature selection based on conditional mutual information: Minimum conditional relevance and minimum conditional redundancy. Appl. Intell. 2019, 49, 883–896. [Google Scholar] [CrossRef]
Zhou, H.; Wang, X.; Zhang, Y. Feature selection based on weighted conditional mutual information. Appl. Comput. Inf. 2020. ahead-of-print. [Google Scholar] [CrossRef]
Liang, J.; Hou, L.; Luan, Z.; Huang, W. Feature Selection with Conditional Mutual Information Considering Feature Interaction. Symmetry 2019, 11, 858. [Google Scholar] [CrossRef] [Green Version]
Liu, H.; Sun, J.; Liu, L.; Zhang, H. Feature selection with dynamic mutual information. Pattern Recognit. 2009, 42, 1330–1339. [Google Scholar] [CrossRef]
Granger, C.W.J. Investigating Causal Relations by Econometric Models and Cross-spectral Methods. Econometrica 1969, 37, 424–438. [Google Scholar] [CrossRef]
Schreiber, T. Measuring Information Transfer. Phys. Rev. Lett. 2000, 85, 461–464. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Runge, J.; Heitzig, J.; Marwan, N.; Kurths, J. Quantifying causal coupling strength: A lag-specific measure for multivariate time series related to transfer entropy. Phys. Rev. E 2012, 86, 061121. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Shannon, C.E. A Mathematical Theory of Communication. Bell Syst. Tech. J. 1948, 27, 379–423. [Google Scholar] [CrossRef] [Green Version]
Wyner, A. A definition of conditional mutual information for arbitrary ensembles. Inf. Control 1978, 38, 51–59. [Google Scholar] [CrossRef] [Green Version]
Runge, J. Quantifying information transfer and mediation along causal pathways in complex systems. Phys. Rev. E 2015, 92, 062829. [Google Scholar] [CrossRef] [Green Version]
Runge, J. Causal network reconstruction from time series: From theoretical assumptions to practical estimation. Chaos Interdiscip. J. Nonlinear Sci. 2018, 28, 075310. [Google Scholar] [CrossRef] [PubMed]
Runge, J.; Heitzig, J.; Petoukhov, V.; Kurths, J. Escaping the Curse of Dimensionality in Estimating Multivariate Transfer Entropy. Phys. Rev. Lett. 2012, 108, 258701. [Google Scholar] [CrossRef]
Spirtes, P.; Glymour, C.; Scheines, R. Causation, Prediction, and Search; MIT: Cambridge, MA, USA, 1993. [Google Scholar] [CrossRef]
Colombo, D.; Maathuis, M.H. Order-Independent Constraint-Based Causal Structure Learning. J. Mach. Learn. Res. 2014, 15, 3921–3962. [Google Scholar]
Le, T.D.; Hoang, T.; Li, J.; Liu, L.; Liu, H.; Hu, S. A Fast PC Algorithm for High Dimensional Causal Discovery with Multi-Core PCs. IEEE/ACM Trans. Comput. Biol. Bioinf. 2019, 16, 1483–1495. [Google Scholar] [CrossRef] [Green Version]
Runge, J.; Nowack, P.; Kretschmer, M.; Flaxman, S.; Sejdinovic, D. Detecting and quantifying causal associations in large nonlinear time series datasets. Sci. Adv. 2019, 5. [Google Scholar] [CrossRef] [Green Version]
Zarebavani, B.; Jafarinejad, F.; Hashemi, M.; Salehkaleybar, S. cuPC: CUDA-Based Parallel PC Algorithm for Causal Structure Learning on GPU. IEEE Trans. Parallel Distrib. Syst. 2020, 31, 530–542. [Google Scholar] [CrossRef] [Green Version]
Downs, J.J.; Vogel, E.F. A plant-wide industrial process control problem. Comput. Chem. Eng. 1993, 17, 245–255. [Google Scholar] [CrossRef]
Chiang, L.H.; Russell, E.L.; Braatz, R.D. Fault Detection and Diagnosis in Industrial Systems; Advanced Textbooks in Control and Signal Processing; Springer: London, UK, 2001. [Google Scholar] [CrossRef]
Clavijo, N.; Melo, A.; Câmara, M.M.; Feital, T.; Anzai, T.K.; Diehl, F.C.; Thompson, P.H.; Pinto, J.C. Development and Application of a Data-Driven System for Sensor Fault Diagnosis in an Oil Processing Plant. Processes 2019, 7, 436. [Google Scholar] [CrossRef] [Green Version]
Heaton, J. Introduction to Neural Networks for Java, 2nd ed.; Heaton Research, Inc.: St. Louis, MO, USA, 2008; Volume 1. [Google Scholar]
Boger, Z.; Guterman, H. Knowledge extraction from artificial neural network models. In Proceedings of the Computational Cybernetics and Simulation 1997 IEEE International Conference on Systems, Man, and Cybernetics, Orlando, FL, USA, 15–17 October 1997; Volume 4, pp. 3030–3035. [Google Scholar] [CrossRef]
Blum, A. Neural Networks in C++: An Object-Oriented Framework for Building Connectionist Systems, 1st ed.; John Wiley & Sons, Inc.: Hoboken, NJ, USA, 1992. [Google Scholar]
Sheela, K.G.; Deepa, S.N. Review on Methods to Fix Number of Hidden Neurons in Neural Networks. Math. Probl. Eng. 2013, 2013, 425740. [Google Scholar] [CrossRef] [Green Version]
Bircanoğlu, C.; Arıca, N. A comparison of activation functions in artificial neural networks. In Proceedings of the 2018 26th Signal Processing and Communications Applications Conference (SIU), Izmir, Turkey, 2–5 May 2018; pp. 1–4. [Google Scholar] [CrossRef]
Pomerat, J.; Segev, A.; Datta, R. On Neural Network Activation Functions and Optimizers in Relation to Polynomial Regression. In Proceedings of the 2019 IEEE International Conference on Big Data (Big Data), Los Angeles, CA, USA, 9–12 December 2019; pp. 6183–6185. [Google Scholar] [CrossRef]
Oshiro, T.M.; Perez, P.S.; Baranauskas, J.A. How Many Trees in a Random Forest? In Machine Learning and Data Mining in Pattern Recognition; Perner, P., Ed.; Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2012; pp. 154–168. [Google Scholar] [CrossRef]
Probst, P.; Boulesteix, A.L. To tune or not to tune the number of trees in random forest. J. Mach. Learn. Res. 2017, 18, 6673–6690. [Google Scholar]
Geurts, P.; Ernst, D.; Wehenkel, L. Extremely randomized trees. Mach. Learn. 2006, 63, 3–42. [Google Scholar] [CrossRef] [Green Version]
Kleiner, A.; Talwalkar, A.; Sarkar, P.; Jordan, M.I. A scalable bootstrap for massive data. J. R. Stat. Soc. Ser. B Stat. Methodol. 2014, 76, 795–816. [Google Scholar] [CrossRef] [Green Version]
Ku, W.; Storer, R.H.; Georgakis, C. Disturbance detection and isolation by dynamic principal component analysis. Chemom. Intell. Lab. Syst. 1995, 30, 179–196. [Google Scholar] [CrossRef]
Chen, J.; Liu, K.C. On-line batch process monitoring using dynamic PCA and dynamic PLS models. Chem. Eng. Sci. 2002, 57, 63–75. [Google Scholar] [CrossRef]
Lee, C.; Choi, S.W.; Lee, I.B. Sensor fault identification based on time-lagged PCA in dynamic processes. Chemom. Intell. Lab. Syst. 2004, 70, 165–178. [Google Scholar] [CrossRef]
Russell, E.L.; Chiang, L.H.; Braatz, R.D. Fault detection in industrial processes using canonical variate analysis and dynamic principal component analysis. Chemom. Intell. Lab. Syst. 2000, 51, 81–93. [Google Scholar] [CrossRef]
Rato, T.J.; Reis, M.S. Fault detection in the Tennessee Eastman benchmark process using dynamic principal components analysis based on decorrelated residuals (DPCA-DR). Chemom. Intell. Lab. Syst. 2013, 125, 101–108. [Google Scholar] [CrossRef]
Chiang, L.H.; Braatz, R.D. Process monitoring using causal map and multivariate statistics: Fault detection and identification. Chemom. Intell. Lab. Syst. 2003, 65, 159–178. [Google Scholar] [CrossRef]
Shu, Y.; Zhao, J. Data-driven causal inference based on a modified transfer entropy. Comput. Chem. Eng. 2013, 57, 173–180. [Google Scholar] [CrossRef]
Yu, W.; Yang, F. Detection of Causality between Process Variables Based on Industrial Alarm Data Using Transfer Entropy. Entropy 2015, 17, 5868–5887. [Google Scholar] [CrossRef] [Green Version]
Khosravani, M.R.; Nasiri, S.; Weinberg, K. Application of case-based reasoning in a fault detection system on production of drippers. Appl. Soft Comput. 2019, 75, 227–232. [Google Scholar] [CrossRef]

Figure 1. Oil and gas metering station PFD.

Figure 2. Regressors prediction for dimensionless Gas flow rate in fiscal meter 02B (Section A) along Fault-FI when the PCMCI algorithm (partial correlation) was used as the variable selection procedure.

Figure 3. Fault detection index SPE along Fault-FI when the PCMCI algorithm (partial correlation) was used as the variable selection procedure.

Figure 4. Regressors prediction for dimensionless Temperature in fiscal meter 02A (Section A) along Fault-FII when the PCStable algorithm (partial correlation) with MCI (conditional mutual information) was used as the variable selection procedure.

Figure 5. Fault detection index SPE along Fault-FII when the PCStable algorithm (partial correlation) with MCI (conditional mutual information) was used as the variable selection procedure.

Figure 6. Regressors prediction for dimensionless Temperature in fiscal meter 02A (Section A) along Fault-FIII when the PCStable algorithm (partial correlation) with MCI (conditional mutual information) was used as the variable selection procedure.

Figure 7. Fault detection index SPE along Fault-FIII when the PCStable algorithm (partial correlation) with MCI (conditional mutual information) was used as the variable selection procedure.

Figure 8. MAE in validation set considering all variable selection methods applied in Fault FI.

Table 1. Analyzed TEP process faults (see Appendix A.2).

Fault Number	Process Variable	Type	Monitored Variable
IDV(1)	A/C feed ratio, B composition constant	Step	XMEAS(23)
IDV(5)	Condenser cooling water inlet temperature	Step	XMEAS(22)

Table 2. Variable measurements in the gas-oil metering station (see Figure 1).

Variable Type	Number of Measurements
Flow rate	40
Temperature	11
Controller output	2
Differential pressures	8
Pressures	21
Levels	2
Relative density	9
BSW (water content)	6
Electric current	9
Valve aperture	4

Table 3. Faults description.

Fault	Training Set Size (Points)	Validation Set Size (Points)	Test Set Size (Points)	Monitored Variable
F-I	20,161	120,056	42,660	Gas flow rate in B
F-II	44,581	106,620	41,760	Gas temperature in A
F-III	44,581	74,727	2880	Gas temperature in A

Table 4. Applied variable selection methods.

Variable Selection Method	Class of Method
Pearson correlation-based	Filter
Spearman correlation-based	Filter
Mutual information-based	Filter
Forward feature elimination (Lasso)	Wrapper
Forward feature elimination (Random Forest)	Wrapper
Backward feature elimination (Lasso)	Wrapper
Backward feature elimination (Random Forest)	Wrapper
L1-Regularization Lasso-based	Embedded
Random Forest importance-based	Embedded
PCMCI (partial correlation)	Filter
PCStable (partial correlation)	Filter
PCStable (partial correlation) +
MCI (conditional mutual information)	Filter

Table 5. Heuristics for setting hyperparameter values during unsupervised model regressions.

Regressor	Hyperameter Heuristics
Canonical correlation analysis (CCA)	Number of components: the number of principal components needed to describe 95% of cumulative correlation.
Ridge regression (RR)	Regularization strength ( $α$ ): A cross validation procedure was required to determine the value of this parameter.
Multilayer perceptron regressor (MLPR)	Number of hidden layers: Regression problems that require two or more hidden layers are rarely encountered. Network architectures with one hidden layer can approximate any function that contains a continuous mapping from one finite space to another [46]. Number of neurons in the hidden layers: Specifying as many hidden nodes as dimensions (principal components) needed to capture 70–90% of the variance of the input dataset [47,48,49]. Activation function: ReLU is a general activation function widely used in regression problems [50,51].
Random forest regressor (RF)	Number of estimators: Sometimes the use of a large number of trees in the random forest does not lead to any significant performance gain. Previous benchmarks evaluation suggests a range between 64 and 128 trees in a forest [52,53]. Number of features to consider when looking for the best split: On average, empirical results have shown good results with sqrt(Number of features) and (Number of features) for classification and regression problems, respectively [54]. Bootstrap: To improve the robustness of forecast, use of bootstrap sampling is recommended when building trees [55].

Table 6. Reference performance when variable selection procedures were not used to analyze the industrial data.

Fault	Regressor	FDR (%)	FAR (%)	$R^{2}$ Training Set	$R^{2}$ Validation Set	$R^{2}$ Test Set
F-I	RR	0.0	10.71	0.99	−186.93	−690.37
	RF	8.4	10.42	0.99	0.96	−24.26
	MLPR	0.0	10.59	0.95	−23.69	−78.41
	CCA	0.0	10.60	0.99	−170.23	−635.27
F-II	RR	3.98	0.0	0.88	−21.76	−1.78
	RF	59.4	0.0	1.0	−0.34	−0.95
	MLPR	10.96	0.0	0.76	−19.07	−2.31
	CCA	21.53	0.0	0.43	−2.79	−0.40
F-III	RR	51.47	11.0	0.88	−20.28	−0.61
	RF	63.04	7.29	1.0	−0.14	−0.18
	MLPR	8.87	0.0	−1.41	−185.67	−1.09
	CCA	63.44	0.3	0.43	−0.51	−0.22