Variable Selection for Fault Detection Based on Causal Discovery Methods: Analysis of an Actual Industrial Case

Variable selection constitutes an essential step to reduce dimensionality and improve performance of fault detection and diagnosis in large scale industrial processes. For this reason, in this paper, variable selection approaches based on causality are proposed and compared, in terms of model adjustment of available data and fault detection performance, with several other filter-based, wrapper-based, and embedded-based variable selection methods. These approaches are applied in a simulated benchmark case and an actual oil and gas industrial case considering four different learning models. The experimental results show that obtained models presented better performance during the fault detection stage when variable selection procedures based on causality were used for purpose of model building.


Introduction
In the last decade, industrial process monitoring strategies have constantly evolved due to the technological improvements of sensors, equipment and instrumentation and, simultaneously, the increasing relevance of Industry 4.0 in the actual manufacturing process scenario [1]. As these complex industrial processes can produce large amounts of data, a large number of measured variables can be simultaneously monitored in modern plant-wide process monitoring platforms [2]. Hence, removing irrelevant and redundant variables constitutes an important data treatment step, simplifying data driven models, improving the process monitoring performance, and avoiding overfitting. In particular, Blum and Langley reviewed different definitions used to describe variable relevance in the machine learning literature [3].
Variable selection methodologies constitute an important approach to reduce dimensionality in fault detection and diagnosis problems and have become more relevant because of the recent significant increase of data-driven methods in this research area [4]. Variable selection algorithms normally try to identify the subset of measured variables that lead to the best analytical performance, being usually divided into three categories: filter-based, wrapper-based and embedded-based methods [3,5].
Filter-based methods do not depend on the employed learning algorithm and are often applied as a preprocessing step where the analyzed variables are ranked by relevance according to intrinsic properties of the data. This approach scores features in accordance with a certain statistical criterion, almost always making use of the χ 2 statistics, T statistics, Pearson correlation, Spearman correlation, Fisher criteria, and metrics derived from the Information Theory. Some of these criteria have been revised by Ghosh et al. [6].
Wrapper-based methods explore the space of variable subsets and evaluate the performance of the models built with the subsets, consequently depending on the learning algorithm. These methods usually make use of one of the two following approaches: a sequential method, when selection starts with an empty set (full set), and add features (remove features) until satisfaction of a determined regression performance [7]; and the heuristic method, when the variable subsets are generated with help of a stochastic algorithm [8,9].
Embedded-based methods combine the learning model with an optimization problem, allowing variable selection and model building to be performed simultaneously. Differently from wrapper-based methods, embedded-methods incorporate the knowledge about a specific structure of the regression function in the variable selection engine. Two distinct families of embedded methods can be usually defined: the regularization methods [10,11] and the tree-based methods [12].
From the point of view of computational costs, filter-based methods are more attractive because of their inherent simplicity, as variable ranking can be established through simple score computations for each variable. Nevertheless, the variable subset found by these methods may not correspond to the subset that, jointly, maximize the classifier-regressor performance, since the variable relevance is affected neither by the model structure nor by the remaining process variables [13].
On the other hand, wrapper-based methods are more computationally intensive, but prevail over filter-based methods in terms of prediction accuracy since they take into account the classifier-regressor performance during the variable selection step [3,6]. One possible drawback of using wrapper-based methods is that the classifiers-regressors are prone to overfitting [14].
Finally, embedded-based methods try to compensate for the drawbacks discussed previously by incorporating the variable selection procedure as a part of the training process. However, application of these methods can be very intricate and limited to some specific learning models [5].
For all previously described strategies, an important aspect in the variable selection procedure is the criterion that defines the relevance of a single variable or subset of variables. Several criteria have been investigated and can be grouped into one or more of the following categories: distance, information, dependence, consistency and classifier error [15].
Mutual information (MI) is a measure of statistical independence that can be used to evaluate the relationship between random variables, including nonlinear relationships, being invariant under transformations of the feature space [16]. Distinct variable selection algorithms based on MI have been revised by Vergara and Estévez [17] using both filterbased methods [18,19] and wrapper-based methods [20,21]. In particular, Huang et al. [22] proposed a method where MI is used initially for variable ranking, while in a second step variable selection is guided by the maximization of information criterion. Relevant issues associated with the dimension of the selected variable subset and the mathematical connection between mutual information and regression performance metrics have been discussed in the literature [23,24]. Other metrics derived from MI, including the joint mutual information (JMI) [25], conditional mutual information (CMI) [26][27][28], and dynamic mutual information (DMI) [29], have also been studied. In particular, CMI, Granger metrics [30] and Transfer Entropy (TE) [31] can be used as causality measures and can be calculated from observed time series to characterize inner cause-effect relationships.
Based on the previous paragraphs, the present manuscript discusses and proposes the use of variable selection approaches based on time-lagged causality algorithms developed and applied for causal network reconstruction [32]. The relevance of variables are defined in accordance with their causal strength in respect to the predicted and monitored variable. Consequently, the proposed methodology allows isolating the partial effect of each variable in a set over the predicted variable, quantifying the amount of information shared condi-tionally with the remaining variables. The causality quantification algorithms can adopt linear metrics (partial correlation) or nonlinear metrics (conditional mutual information). To validate the respective performance, results obtained with help of causality procedures are compared with results obtained with several other feature selection methods mentioned previously. Variable selection methodologies are applied in two scenarios: 1.
a benchmark case, where the procedures are used to evaluate some simulated faults of the Tennessee-Eastman process. 2.
a real industrial case, where the procedures are applied to actual industrial measurement datasets extracted from an oil and gas processing plant, with the objective to detect sensor faults reported by the operator.
The paper is structured as follows: in Section 2, we discuss information and causality theoretic preliminaries. In Section 3, the case studies and their respective faults scenarios are presented, and the research methodology is also discussed. In Section 4, we present and discuss several variable selection approaches applied in fault detection for both real (see Section 4.1) and artificial scenarios (see Section 4.2). This leads to recommendation of the use of variable selection methods based on causality approaches as discussed in Section 4.3. Finally, in Section 5, we conclude the paper and discuss future research.

Mutual Information and Entropy
The strategy for variable selection includes the identification of the input variables that contain the highest amount of information in relation to the output. Hence, entropy and mutual information are suitable measures in this context [23].
Entropy is a measure of uncertainty of a random variable. Considering X j as a discrete random variable, entropy can be defined as [33]: where x j is the possible value of the random variable X j , p(x j ) is the probability density function of x j . For the case of two discrete random variables, i.e., X j and Y j , the joint entropy of X j and Y j can be defined as follows [25]: where p(x j , y j ) denotes the joint probability density function of X j and Y j . Given that the value of another random variable Y j is known, the remaining uncertainty to describe the outcome of a random variable X j can be expressed by the conditional entropy [34]: where p(x j |y j ) denotes the conditional probability density function of X j and Y j . The amount of information that one variable provides about another one can be quantified by the mutual information (MI) [33]: Additionally, the MI and the entropy can be related as follows [33]: Mutual information can be interpreted as an independence or a correlation measure, being always non-negative, and equal to zero if and only if X and Y are independent [17].

Conditional Mutual Information
The conditional mutual information (CMI) can be given by [26]: and measures the conditional dependence between X j and Y j given Z i . The CMI can be interpreted as the MI between X and Y that is not contained in a third variable Z, and expressed in entropy terms as follows [26].

Conditional Independence and Causality
An important task consists of quantifying the information flow in multiviariate systems. This quantification should be directed to meet the following tasks [35]: (1) quantification of linear-nonlinear associations and (2) characterization of the directionality of information flow propagation (causal interactions). These causal interactions can be visualized as links in an interaction network map.
According to Runge (2018) [36], a pair of variables (or nodes) X i t−τ and X j t are connected by a direct causal link so that they are not independent conditionally over the past of the whole multivariable system (process) Here it is assumed that the multivariable system X X X contains N variables = (X i=1 , X i=2 , ..., X i=N , ...). The past of the entire system is denoted as . When X i t = X j t , this measure represents an autodependency at lag τ. Moreover, the set of parents of a variable (node) X j t can be defined by [36]: The parents of all variables (subprocesses) in X X X and the contemporaneous links comprise the time series graph [32].
Characterization of causal links (Equation (8)) can be performed with different linear or nonlinear independence measures. In particular, MI constitutes an important metric to measure linear and nonlinear associations between variables, but not the direction of dependence. The Granger causality [30] and, in a more general context, the Transfer Entropy (TE) [31] can provide practical means to satisfy these tasks [36].
Here TE is expressed in CMI terms and measures the aggregated influence of X i over all past lags, but lead to a problem when high-dimensional probability density functions (PDF) must be estimated [37]. Lag-specific variants of TE (relative information transfer, and momentary information transfer) have been introduced [32,35] to avoid the computation of PDFs of high dimensions. An analogous form of Equation (10) can be obtained by substituting CMI (nonlinear independence measure) by the linear partial correlation (linear independence measure) term [32].

Approaches
In this present work, causal links characterization algorithms are used in the variable selection context. Some of these algorithms are shortly described in the following sections.

PC-Stable Algorithm
The PC algorithm [38] is a well-known causal link characterization algorithm, widely used to reconstruct causal relationships which can be represented by a Directed Acyclic Graph (DAG). The algorithm consists of an iterative procedure where pairs of variables (at different time lags) conditionally independent (at some significance level) are estimated. The lagged links, computed according to Equation (10), provides the strength and orientation of these causal links. In the present work, a robust modification of the PC algorithm called PC-stable [39] is used.
In particular, the PC algorithm evaluates and removes links from the DAG and updates the network dynamically. Therefore, the resulting network is dependent on the order in which the conditional independence tests are performed. On the other hand, the PC Stable algorithm prevents the link deletion affecting the conditioning set Z of the other variables. A schematic example [40] of the PC algorithm and the PC Stable algorithm applications are presented in Appendix A.1 to introduce the main aspects and differences of the algorithms.
Algorithm 1 summarises the procedures of the PC-stable method as applied in the present work. A detailed description of this algorithm can be found in the original references [39,41,42].

Algorithm 1: PC-stable algorithm
Input: Dataset with set of variables X X X, maximum time lag τ max and a significant level α Output: List of parents of desired-output variable X j t Result: Parents sorted by dependence, P X j t Assume all variables (nodes) are connected initially; t is significant, i.e pvalue < α then add X i t−τ to initial parent list P X j t end end given the initial list of parents P X j t = X i t−τ : i = 1, .., N, τ = 1, ..., τ max ; Let max conditions dimension d max = 0; Select a new condition S from the sorted list of parents P X j t ; Considering the link is not a significant link, i.e pvalue > α then Break inner while loop; end end end Let d = d + 1 Remove non-significant parents from P Y t ; Sort parents (descending order) according value of conditional dependence metric. end

PCMCI Algorithm
Another causal link characterization algorithm is the PCMCI algorithm [36], which is aimed to circumvent some PC-stable limitations related to the optimal selection of conditioning sets, improving the accuracy of independence strength estimation. Briefly, the PCMCI algorithm considers two stages:

1.
Estimate the parents P X j t for every variable X j t ∈ X X X using the PC-Stable algorithm. 2.
Using the estimated set of parents, perform a novel independence test called momentary conditional independence (MCI), where given the variable pair (X i t−τ , X j t ): MCI : Theoretical description and practical applications of the PCMCI algorithm have been thoroughly discussed elsewhere [36,41].

Benchmark Case: Tennessee-Eastman Process
The Tennessee-Eastman Process (TEP) [43] is a widely used benchmark model, which serves as the industrial basis to assess the capability of fault detection and diagnosis methods. In the TEP process, there are five major units: reactor, condenser, compressor, separator and stripper. The process provides two products from four reactants. Also, an inert and a by-product are present in the process streams, leading to a total of 8 components denoted as A, B, C, D, E, F, G and H. TEP covers 22 process variables and 12 manipulated variables, resulting in 34 measured variables. Typically, measurements related to the mole fraction of components in the reactor feed, purge gas and product streams are not considered because their characteristic sampling intervals are too long [44]. A schematic diagram of the TEP process and the complete list of variables are presented in Figure A3 and Appendix A.2.
The benchmark presents 20 faults, which were originally defined by Downs and Vogel [43], and an additional valve fault further introduced in Chiang et al. [44]. The dataset used in the present work was generated with the original FORTRAN code available at http://brahms.scs.uiuc.edu (accessed 13 March 2018).
In the present work, 2 out of the 21 available faults were considered to validate the proposed variable selection methods. Table 1 summarizes the analyzed TEP faults.

Real Industrial Case: Oil and Gas Fiscal Metering Station
The industrial data used in the present work were acquired with helpf of online sensors of an onshore metering station located in a Petrobras field.
Briefly, the industrial process is composed of three fiscal metering stations (two gas fiscal metering stations and an oil fiscal metering station). Figure 1 describes shortly the oil and gas fiscal metering process and enumerates its respective sections. A more detailed description of this process, the observed faults and their respective phenomenological and economic consequences during the custody transfer process have been discussed elsewhere [45]. The dataset used in the present work comprises measurements of 112 process variables, collected at frequency f = 1min during three consecutive years. Table 2 summarizes these process variables. Table 2. Variable measurements in the gas-oil metering station (see Figure 1). Several types of faults were identified in the fiscal sensors of the process plant. Most of them were related to faults of temperature sensors and flow sensors. Table 3 shows the faults studied in the present work and the respective dataset sizes for training, validation, and testing of the employed regressor models. In particular, detections of faults F-II and F-III were performed with the same training data.

Methodology
The fault detection approach used in the present work is discrepancy-based, where the fault pattern is recognized using a residual measure calculated as the difference between the process sensors readings and the expected values obtained using the predictions of a model that represents the process in fault-free conditions. In the present work, the model building is based on unsupervised learning techniques (regression models) since the number of failures in each sensor and the constant process evolution makes the use of supervised techniques (classification models) unfeasible. The purpose of the unsupervised techniques is to model, based on the input and output datasets (X and Y, respectively), the dynamic behavior of the process in normal (non-fault) conditions. The process condition monitoring is based on a residual metric calculation. In the present case, the Square Prediction Error (SPE) was used, which is a fault detection index that allows the identification of abnormal conditions along the control chart.
Here, the fault detection models were configured considering a single output Y as the fault observed variable (See monitored variable in Tables 1 and 3). On the other hand, the input variable (set X) was generated applying the variable selection procedures (listed in Table 4) on the original complete dataset. Moreover, the number of variables in the input set X was previously established as the number of principal components needed to describe 95% of the cumulative variance in the respective training set. To evaluate the fault detection performance in the above-reported case studies, the machine learning regressors described in Table 5 were considered, which were evaluated for each input subset X generated by the variable selection methods. The architecture and respective hyperparameters were defined according to heuristics reported in the literature, also summarized in Table 5. Thus, the efficiencies of the variable selection algorithms were directly associated with the fault detection performance of the models trained with the same hyperparameter values, allowing observation of the variable selection effect with small influence of the hyperparameter values. Table 5. Heuristics for setting hyperparameter values during unsupervised model regressions.

Regressor Hyperameter Heuristics
Canonical correlation analysis (CCA) Number of components: the number of principal components needed to describe 95% of cumulative correlation.

Ridge regression (RR)
Regularization strength (α): A cross validation procedure was required to determine the value of this parameter.

Multilayer perceptron regressor (MLPR)
Number of hidden layers: Regression problems that require two or more hidden layers are rarely encountered. Network architectures with one hidden layer can approximate any function that contains a continuous mapping from one finite space to another [46]. Number of neurons in the hidden layers: Specifying as many hidden nodes as dimensions (principal components) needed to capture 70-90% of the variance of the input dataset [47][48][49]. Activation function: ReLU is a general activation function widely used in regression problems [50,51].

Random forest regressor (RF)
Number of estimators: Sometimes the use of a large number of trees in the random forest does not lead to any significant performance gain. Previous benchmarks evaluation suggests a range between 64 and 128 trees in a forest [52,53].
Number of features to consider when looking for the best split: On average, empirical results have shown good results with sqrt(Number of features) and (Number of features) for classification and regression problems, respectively [54]. Bootstrap: To improve the robustness of forecast, use of bootstrap sampling is recommended when building trees [55].
Overall, we considered 12 selection variable methods corresponding of which there were three filter-based, four wrapped-based, two embedded-based, and three causalitybased ones. Each selection variable method was applied in five fault detection scenarios, three faults cases of the actual industrial case and two faults cases of TEP, using four different machine learning models. Furthermore, references fault detection scenarios for the actual industrial process and TEP were obtained considering the original complete dataset (i.e., without taking into account procedures of variable selection).

Results
In this section, the effectiveness of the proposed causal link characterization approaches are evaluated as variable selection methods. Other variable selection algorithms were considered in order to compare and discuss effects on fault detection cases. Results and discussions regarding real industrial and benchmark datasets are shown in Sections 4.1 and 4.2, respectively.

Performance on Real Industrial Case
In the present work, the subsets of selected variables we assumed to have a fixed size. An estimate of the number of variables to be selected by the variable selection procedures can be calculated through principal component analysis (PCA). Thus, considering a cumulative explained variance of 95.0%, the number of required principal components corresponded to a total of 20 variables. The complete analysis is shown in Figure A4 present in Appendix A. 3.
The performance of the analyzed variable selection approaches are characterized in terms of the following performance metrics: fault detection rate (FDR %), false alarm rate (FAR %) and regression score R 2 . To establish a reference point for all the studied faults, the learning models were also trained without taking into account procedures of variable/feature selection. Table 6 shows the respective results that will be considered as the reference performance values for comparison with the performance of the models trained with use of variable selection methods. The regressor predictions obtained with these models for Faults I, II, and III are presented in Figures A6-A8 in Appendix A.4. The performance metrics corresponds to fault detection rate (FDR%), false alarm rate (FAR%) and regression score R 2 . Table 7 shows the performance of the regressors when filter-based variable selection methods were used. In general, the regressors were able to detect Fault F-III, but unable to detect Fault F-I. On the other hand, Fault F-II led to the highest detection rates (FDR %) when the variable selection method was based on mutual information. As one can see, the learning models that used variable selection procedures based on linear correlation (Pearson and Spearman) were more likely to present overfitting, as R 2 values for the validation set were negative. However, lower values of R 2 for Fault F-I validation set were expected because this set was much larger than the test set and, chronologically, was the most distant from the fault event, incorporating dynamic behaviors that had not been possibly captured in the training set. As it might already be expected, R 2 values were obtained in the test because of the presence of many faulty data. Considering the average performance of the 4 regressors, the highest FDR values and lowest FAR values were achieved when the mutual information-based variable selection method was used. This can possibly be explained because the mutual information metric is able to capture nonlinear associations among the variables, while the Pearson or Spearman correlations are unable to detect these nonlinear associations.
When compared to the reference performance, the methods based on linear correlations (Pearson and Spearman) led to worse results in the three faults, while the method based on mutual information was better in the three cases.
The regressor performance obtained when wrapper-based variable selection procedures were used are summarized in Table 8. It is possible to observe that Fault F-III was properly detected with all analyzed wrapper-based variable selection methods. On the other hand, Fault F-I was not detected, except when the Random Forest model was used, while the best detections of the F-II fault were achieved with the variable selection procedure based on the forward feature elimination (Lasso) followed by the backward feature elimination (Lasso). As it might be expected, high FDR (% ) and R 2 values were obtained with the training and validation sets when the learning model in the wrapper method coincided with the regressor model (Random Forest). Another aspect that must be highlighted regards the general performance of wrapper methods, which achieved higher R 2 values than filter methods. Regressors trained with use of wrapper methods presented better ability to correctly model new data (generalization), as observed in the regression scores of the validation sets. In addition, only the wrapper methods that used Lasso learning model exceeded the reference performance in all faults detection scenarios. Table 9 presents the regressor performance obtained with embedded-based variable selection procedures. On the whole, although Fault F-III was always properly identified, these regressors showed lower rates of failure detection than described previously for wrapper-based variable selection approaches. Besides, the selection procedures based on random forest schemes provided poorer models that were subject to overfitting. In general, the learning models that considered a variable selection step based on embedded methods did not show substantial improvements when compared to the reference performances.   Although variable selection methods based on causal relationships were classified as filter methods, Table 10 shows the independent evaluation of the respective fault detection results obtained with these methods. As one can see, causality-based approaches outperformed the other methods when tested with most of the faults in terms of selecting the subset that produces the best regression accuracy. These approaches also led to the best R 2 values for the validation set, generating more generalistic learning models and providing on average the highest FDR and lowest FAR values among all methods applied here. This better generalization capability proved to be fundamental in the analyzed context because the process is likely to be subject to dynamic changes during the operation time as a function of the variations on the plant operating conditions. In particular, the PCMCI procedure, with PCStable stage using partial correlation and MCI stage using conditional mutual information metrics, proved to be the most suitable procedure for the detection of Faults II and III, while the best Fault I detection performance was achieved using the PCMCI procedure considering partial correlation metrics in its two stages.  Figure 2 shows the predictions of Fault F-I obtained with PCMCI (partial correlation). For all analyzed regressors, it is possible to observe good R 2 values for the training and validation sets and a clear divergence between measured data and respective predictions in the test set near the failure event. Figure 3 presents the respective SPE index plot, where regression residues in the training and validation sets remained below the control limit, except for some sporadic points which were responsible for the observed FAR rates. This control limit was exceeded consistently in the reported fault event, proving the capacity of these models for fault detection. As one can see, the abnormality was detected before the fault event reported by the operation, which explains the poor FDR and monotonous FAR values obtained by all regressors regardless of the variable selection algorithm.     Finally, the prediction results and SPE index behavior in the Fault-III detection scenario are presented in Figures 6 and 7, respectively. As previously pointed out, this fault was detected appropriately, despite the oscillatory character of the predicted variable. Moreover, the event reported by the operation seemed to have occurred before the actual manifestation of the failure; consequently, the maximum reachable FDR rate corresponds (approximately) to the value of 63% reported in Tables 7-10.  An important aspect of the discussion about variable selection methods based on causality is the insertion of lagged variables in the analysis, which derives, naturally, from the discovery and reconstruction of lagged links. The inclusion of these time-shifted variables can allow for improved modelling of the dynamic behaviour of process trajectories, while using the same detection model [56][57][58].
Mutual information, which was applied in filter methods, is a metric that is similar to those used in causal methods. However, this methodology determines the relationships between pairwise variables, neglecting the effect of the remaining variables on the pair. Therefore, conditional approaches are more appropriate as they attempt to isolate the effects of variables during the discovery of causal connections. Basically, while one approach looks for correlated (nonlinearly) variables, the other approaches look for causal variables.
As previously highlighted, lagged-conditionally independence discovery procedures search for the causal connections of the predicted variable Y. Hence, the use of lagged variables seems natural to define the subset of the selected variables.

Performance on Benchmark Case
As described previously, the size of the training subset was kept constant, as determined through PCA analysis, being required 15 components to describe 99.5 % of cumulative variance. The complete PCA analysis is shown in Figure A5 in Appendix A.3. Table 11 shows the regressors performance when applying the most prominent variable selection procedures by class. According to FDR and FAR metrics, the detection of Fault IDV(1) was better when the PCMCI approach was employed, while Fault IDV(5) was correctly detected with similar performance by the PCMCI and l1-regularization (Lasso) methods. The obtained R 2 values in the test sets reflect that they are composed mostly of no-faulty data.  The performance metrics corresponds to fault detection rate (FDR%), false alarm rate (FAR%) and regression score R 2 .
The better performance of the causal methods for variable selection in this case study can be explained by the inclusion of lagged variables for model training, which according to the literature [59,60], can exert a determining role in the detection of failures in the TEP process.
It is worth mentioning that the use of variable selection methods (except the causal methods) did not lead to notable improvements in relation to the reference performance. Hence, the use of variable selection schemes in TEP case study does not constitute a limiting step for detection of the analyzed faults, as the process variables are more causally interconnected and the redundant variables do not interfere drastically in the performance of the models. However, the selection of variables allows working with less complex and computationally faster models. Moreover, it must be clear that the use of causal methods for selection of relevant variables did allow the improvement of the analyzed performance, being recommended for more involving implementations.

Analysis of Selected Variables
The oil and gas fiscal metering process constitutes an interesting case study because it involves a large number of variables measured along the different sections of the process, making it difficult to define a priori the most relevant variables for the prediction of a particular variable of interest. Intuitively, it is expected that this subset will contain variables from the same plant section to which the prediction target variable belongs and reflects phenomenological characteristics of the process. In this context, Figures A9-A12 in Appendix A.5 show the subsets selected by the most outstanding selection methods (by class) according to the previously reported results. These selections correspond to the training set used to detect Fault F-I, where the predicted variable corresponded to FIT-02B-A (gas flow rate in fiscal meter 2B in Section A of Figure 1). The process variables and respective tags are listed in Table A3 in Appendix A.6.
The ranking of relevant variables determined by the distinct variable selection methods show PDIT02B-A (differential pressure in fiscal meter 2B in Section A in Figure 1) as most important measurement, which is consistent with the inherent physical principle of the fiscal meter measurement. However, it was the causal methods that considered in their respective selected subsets the largest amount of variables geographically adjacent to the monitored fiscal meter, representing the phenomenological nature of the process.
On the other hand, in systems of high dimensionality, the causal characterization methods are useful not only for fault diagnosis ( [61][62][63]), but also for generating better models for fault detection as already shown in this work. In addition, the causal networks reconstructed from time series [36] keep some causal properties that can be intuitively extracted from the respective process flow diagram (PFD).
Another representative performance metrics is the mean absolute error (MAE). Figure 8 shows the MAE obtained by the different regressors for Fault F-I in the validation set considering all the variable selection methods studied here. As one can see, the MAE values were lower when the methods for selecting variables based on causality were used. It is important to note that better adjustments and performance can be possibly achieved if hyperparameters optimization stages are carried out during the training procedures. However, as the present work emphasized the study of the effect of the variable selection procedures and not of the effect of hyperparameters on the regression model performances during fault detection, optimization of hyperparameters was not sought. Finally, Table 12 shows the CPU times demanded by each method during the selection of variables for the detection of Fault F-I. It can be observed that the causal method were the slowest ones, given the more involving computation of causal links. However, considering that the variable selection stage must be performed before the training stage, this computational demand would not constitute a limiting factor for eventual online applications.

Final Considerations
In general, all fault detection metrics showed improvements when applied any variable selection approach studied in this work. Moreover, these approaches reduced the fault detection problem dimensionality, allowing building simple learning models in which is a desired attribute in online monitoring.
Variable selection methods based on causality led to better performance in fault detection since included lagged-time variables addressed to model the dynamic behavior of the process trajectories. Furthermore, as was discussed in Section 4.3, the selected variables subset kept causal associations in respect to the predicted variable reflecting phenomenological characteristics of the process.
The results obtained showed that the wrapper-based methods prevail over filterbased methods in terms of prediction accuracy, as similarly observed in the literature [3,6]. However, causality methods can be classified as filter-based methods because the variable selection engine is independent of the regressor model. This independence explains the homogeneity in terms of fault detection metrics observed in the four learning models along the faults scenarios studied.
The fault detection scenarios corresponding to the real industrial case provided the opportunity to work with issues rarely found in simulated or benchmark cases such as high dimensionality, real noised measures, and divergences between the fault events reported and the actual manifestation of the failure.

Conclusions
In the present paper, variable selection methods based on causality are implemented, analyzed and then obtained performances were compared with the performances obtained with several other filter-based, wrapper-based, and embedded-based variable selection methods. Two case studies were presented, corresponding to a simulated benchmark (the Tennessee-Eastman process) and an actual industrial case (a fiscal gas metering station). As shown through many examples, all learning algorithms considered in the present work provided better regression and fault detection performances when using variable selection procedures based on causality. In particular, the variable selection approaches based on causality establish the causal connections of the predicted variable, also allowing the determination of the respective lagged-conditionally dependence. Hence, the subset of the selected variables reflects phenomenological characteristics of the process, as it became evident in the industrial case study. Although the variable selection methods based on causality were more computationally demanding, the use of these methods in monitoring scenarios that involve a large number of variables is highly recommended, especially because it can be performed as pre-processing data analysis stage and thus does not compromise the characteristic computation time of the final application.
Let us also discuss some directions for future research. In the present work, we proposed the use of the causal discovery approach in variable selection methods addressed to assists the process fault detection. As discussed above, these causal methods are based on the estimation of lagged-conditional dependence of the measured variables. In high-dimensional systems, estimating these dependencies involve the computation of complicated density probability function, which can lead to inaccurate or spurious estima-tions of the causal relationships [37]. Future work should include the use of ruled-based frameworks belonging to expert knowledge [64] as equipment adjacency in the process plant, aiming at the incorporation of physical and operational restrictions in the estimation of the causal relationships. Finally, it would be useful to propose other benchmark cases in order to test these causal methods in feature/variable selection problems with high-redundancy datasets, aiming to evaluate the approach robustness in the presence of correlated measures.

Conflicts of Interest:
The authors declare no conflict of interest. The founders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; nor in the decision to publish the results.

Abbreviations
The following abbreviations are used in this manuscript:

MI
Mutual Given a Directed Acyclic Graph (DAG) G consisting of a set of nodes or vertices V (circles) and a set of edges (lines) E ⊆ V × V. A set of variables and its causal interactions can be represented, respectively, by the nodes and edges of a DAG, where the direction and measure of these interactions are obtained according to described in Section 2.3.
The PC algorithms are based on the conditional independence evaluation that establishes the existence of a link (edge) between the variables (nodes) X and Y, if given a set of conditions Z, X and Y are not independent conditioning on Z.
An intuitive approach to construct the complete network map (DAG) consists of the exhaustive search of all possible conditions Z to determine if X and Y are conditionally connected. However, this is a computationally inefficient method, turning the PC algorithm or PC-Stable algorithm as interesting approaches to address causal link characterization problems in high dimensional systems.
The PC Algorithm considers, at the beginning, a network fully connected. For each edge, tests if the pair of variables connected, X and Y, are independent conditioning on a subset Z of all neighbours/parents of X and Y and remove or retain the respective edge based on the result. The mutual conditional independence tests (MCI) are applied by levels, according to the size d of the conditions set Z. At the first level (d = 0), all pairs of nodes (variables) are tested, conditioning on the empty set Z. The algorithm can remove some of the edges (links) and only tests the remaining edges in the next level (d = 1). The size of the conditioning set, d, is progressively increased until d is greater than the size of the adjacent sets of the testing nodes. Figure A1 shows an example of PC algorithm being applied to a hypothetical dataset with four nodes, A; B; C; and D [40]. As one can see, three edges remains after level 1 tests (i.e., Z = []). At the next level, each remaining edge will be tested conditioned on each neighbour/parent of the testing variables (i.e., d = 1). For example, given the edge A − B, there are, at most, two tests which are conditioning on C and conditioning on D.
In particular, the test MCI(A, B|C) returns conditional independence, then the edge is removed from the graph and the algorithm continues testing on the other edge.
Conditional independence test is likely to present inaccurate values in high dimensional systems. Furthermore, for the PC Algorithm, removing or retaining an edge would result in changes in the condition set Z of other nodes since the network graph is updated dynamically. Therefore, the resulting network graph is dependent on the order in which the conditional independence tests are performed [40]. For example, in Figure A1, when the test MCI(A, B|C) returns conditional independence and, consequently, the edge is removed, the set of neighbors/parents of A is also updated adj(A) = C, D. Therefore, when testing the edge A − C, the conditions set not contains B. Considering the case where the conditional test MCI(A, B|C) wrongly removes the edge A − C, it misses the test MCI(A, C|B) which may remove the edge A − C. On the other hand, if the procedure tests MCI(A, C|B) first and removes the edge A − C, it would end up with a different network. Figure A1. Example of applying PC algorithm in a hypothetical dataset [40]. Figure A2. Example of applying PC-Stable algorithm in a hypothetical dataset [40].

Appendix A.2. Tennessee Eastman Process
Colombo andMaathuis [39] proposed the PC-Stable algorithm addressed to obtain a stable output skeleton that does not depend on how variables are ordered in the input dataset. In this method, the neighbour/parents (adjacent) sets of all nodes are evaluated and kept unchanged at each particular level, preventing that the edge deletion affects the conditioning set of the other nodes. Figure A2 shows the respective network update in each level when the PC-Stable algorithm is applied. Figure A3. A schematic diagram of TEP.           The variables of oil and gas metering process and its respective tags are listed in the table.