1. Introduction
In a complex large-scale process system, components and variables are interconnected through material flows and information flows. Once a fault occurs, it may easily propagate among units and cause negative impacts in broader areas, which may lead to serious consequences and compromise process safety. Therefore, it is important to detect and locate the root causes of faults as early as possible. Causality inference is a process to infer Cause-Effect relations between variables, typically in complex systems, and it is commonly used for root cause analysis in large-scale process industries. A variety of causality analysis techniques have been developed and shown to be effective for root cause diagnosis [
1].
Existing techniques for causality inference can be generally divided into two types, namely, process knowledge-based methods and data-driven methods [
1]. The former obtains connectivity and causality from prior knowledge, such as process topology and first-principle models, and convert the results into computer accessible formats, such as the adjacency matrix [
2] and signed directed graph [
3]. The latter captures Cause-Effect relations from sufficient process data; commonly used techniques include cross-correlation analysis (CCA) [
4], granger causality analysis (GCA) [
5], transfer entropy (TE) [
6], and Bayesian networks (BN) [
7,
8]. References [
9,
10,
11] compared the strengths and weaknesses of these techniques and also proposed suitable situations for their applications. In any case, to achieve better performance in root diagnosis, especially when abnormal situations are associated with unknown faults or multiple faults, integrated methods that combine process data analysis with process knowledge extraction were proposed in [
12] and demonstrated to be quite effective.
Among the data-driven causality inference methods, transfer entropy (TE) provides an information-theoretic method for causality measurement that is suitable for both linear and nonlinear processes. TE was firstly proposed by Schreiber as a measure for information transfer [
6]. As for whether TE measures causal relationships, there exists controversy. References, such as [
13,
14], discussed the distinctions between information transfer and causal effects. According to [
14], information flow is a primary tool to establish the presence of causal relations, for where this is not possible, the complete transfer entropy is an alternate inference technique. As TE can effectively distinguish driving and responding elements and detect asymmetry in the interaction of subsystems, it has been widely studied and used for causality inference.
Reference [
15] utilized TE to infer causal relations for the identification of the propagation direction of disturbances. In [
16], kernel principal component regression and transfer entropy were combined to conduct root cause diagnosis. In addition, some variants or improvements have been proposed to extend TE. For instance, in order to distinguish the direct or indirect causal relations, partial transfer entropy [
17] and direct transfer entropy [
18] were developed. Reference [
19] proposed the transfer zero-entropy for causality analysis based on the zero-entropy and zero-information without assuming a probability space. Additionally, symbolic transfer entropy [
20] and trend transfer entropy [
21] extended the TE to symbols or trends of time series instead of original continuous values. The multiple-unit symbolic dynamics and transfer entropy were used to analyze the dynamic causal relationships in longitudinal data [
22]. A symbolic dynamic-based normalized transfer entropy (SDNTE) was proposed for the root cause fault diagnosis of multivariate nonlinear processes [
23].
In the field of alarm root cause analysis, TE was adpated to analyze Cause-Effect relations among binary-valued alarm variables [
24]; moreover, a Bayesian network based on active dynamic transfer entropy (ADTE) was proposed to establish an accurate alarm propagation network during an alarm flood [
25]. For oscillation diagnosis, a workflow using TE was proposed to provide a robust procedure for accurately identifying the oscillation propagation path [
26]. In addition, TE and Granger causality were tested on an industrial case study of a plant-wide oscillation, and how to choose between the two methods in actual industrial applications was explained [
11].
As shown by the extensive studies above, transfer entropy has become a prevalent and effective way of capturing Cause-Effect relations in complex systems. However, a major problem with TE lies in its high computational complexity, which prevents it from applications in many real systems, especially for real-time tasks such as online root cause diagnosis. According to [
15,
18,
27], the computational complexity of TE is mainly restricted by the estimation of probability density functions and the calculation of the transfer entropy in a high dimensional embedding space. Motivated by the above problem, this paper proposes an improved method for causality inference based on transfer entropy and information granulation. The calculation of transfer entropy is improved with a new framework that integrates the information granulation as a critical preceding step; moreover, a window-length determination method is proposed based on delay estimation, so as to conduct appropriate data compression using information granulation. The effectiveness of the proposed method is demonstrated by both a numerical example and an industrial case with a two-tank simulation model. As shown by the results, the proposed method can reduce the computational complexity significantly while holding a strong capability for accurate casuality detection.
The advantages of the proposed method lie in two aspects: (1) Compared to Cross-Correlation [
4] and Granger Causality [
5], which work only for linear causal relations, the proposed method inherits the advantage of TE in capturing non-linear causal relations and, thus, can be applied to broader fields. (2) Compared to traditional TE methods [
14,
15,
16,
17,
18], the proposed method has much higher computational efficiency on account of the discretization and information granulation as preprocessing steps, and thus, it can be used for real-time tasks (e.g., online root cause diagnosis) that are sensitive to calculation time.
The rest of this paper is organized as follows.
Section 2 presents the preliminaries of TE and analyzes the computational complexity problem.
Section 3 proposes the improved calculation of TE.
Section 4 provides case studies to demonstrate the effectiveness of the proposed method, followed by concluding remarks in
Section 5.
2. Preliminaries on Transfer Entropy
Measures for quantifying dependency for bivariate or multivariate time series include the correlation coefficient, cross-correlation, and mutual information [
28]. Mutual information quantifies the dependency from the joint probability density function of two random variables. It measures the reduction of uncertainty of a random variable based on the knowledge of a second variable, but cannot measure its directionality or causality. The information theory measure of transfer entropy proposed in [
6] takes the concept of mutual information a step further. Transfer entropy is an asymmetric measurement method based on information theory. By calculating the conditional probability function and designing a reasonable directionality measure, the causal topology is constructed to facilitate root cause diagnosis and propagation path identification.
Based on the concept of information theory, the measure of transfer entropy proposed by Schreiber [
6] extracts the amount of information transferred from variable
x to
y as follows:
where
indicates the joint or conditional probability density function (PDF);
k,
l are the orders of variables
y,
x;
h is the prediction horizon;
and
;
are the sampling periods.
In order to remove the indirect causality caused by the intermediate variables or the false causality caused by the common variables, the direct transfer entropy (DTE) proposed in [
18] can be calculated, i.e.,
where
z represents the intermediate variable,
m denotes the order of the intermediate variable
z, and
.
Transfer entropy is effective in measuring the causality for both linear and nonlinear processes. A major problem hindering the application of TE lies in its high computational complexity, which is mainly contributed to by the estimation of probability density functions (PDFs) and the calculation of TE in a high-dimensional embedding space. In this study, the required data type is continuous valued time series, and thus, PDF estimation is a mandatory step. There are many methods for PDF estimation, such as plug-in estimators, kernel density estimators, and nearest-neighbor-based estimators. The run time of estimating TE may vary depending on the estimator chosen. The improvement of PDF estimators is not investigated here; as presented in many related studies [
15,
29], the commonly used kernel density estimator is exploited for PDF estimation. The focus of this paper is to investigate the calculation of TE in a high-dimensional embedding space, and to put forward a corresponding solution to reduce the computational complexity.
The total computational complexities for TE and DTE are
and
, respectively [
18], where
N is the sample size. Obviously, the computation complexity is mainly decided by two factors, namely, the sample size and the order. In view of this, improving the efficiency of TE needs to address two problems: (1) How to reduce the sample size processed by TE, and (2) how to reduce the orders of the cause and effect variables. The key is that it should guarantee the accuracy in causality inference while addressing the two problems of TE. Hereby, this work improves the transfer entropy with a new framework that integrates the information granulation as a critical preceding step, which conducts data compression and, thus, uses information granules in TE calculation. The details of the proposed method are presented in the next section.
3. The Proposed Method
This section presents the improved TE based on information granulation. Specifically, this section provides the framework of the proposed method, the data abstraction via information granulation, the calculation of information granulation-based TE, and the determination of the granulation window size.
3.1. The Framework
Given a pair of time series
and
, the objective is to infer their causal relation using TE. As discussed in
Section 2, to improve the efficiency of TE, the effective solution is to reduce the sample size processed by TE and to reduce the orders of cause-and-effect variables in the calculation of TE. Accordingly, this work proposes the following framework, which integrates TE and information granulation for causality inference. A diagram is shown in
Figure 1 to present the framework of the proposed method.
First, to reduce the sample size, it should compress the time series and extract a shorter sequence consisting of representative values in consecutive time windows. However, it is also noteworthy that the length of the window size is a critical parameter influencing the final analysis result. If the time series is compressed too much with a large window size, the computation is reduced, and the price is that useful information might be lost and, thus, lead to erroneous conclusions in causality inferences.
Second, to reduce the orders of the cause-and-effect variables, it only needs to use the first-order TE, where the orders of both cause and effect variables are ones. However, applying the first-order TE requires that the delay between two variables should be 1; otherwise, it might give wrong causal relations. Therefore, properly compressing the data in the previous step is critical.
The information granule obtained after granulation reduces the scale of the original data, and the amplitude also changes to a certain extent. It has been learned from previous work that this may lead to a biased conclusion. For example, Refs. [
30,
31,
32] discussed the influence of sampling rate and time scale on causal inference through the test of data such as EEG signals, and explained that this might change the causal relations. In addition, the impact of data filtering and amplitude changes on causal analysis was investigated in [
29,
33,
34,
35]. An unreasonable sampling rate and a changed series will lead to false causality. Motivated by the investigation in these previous studies, a systematic method of information granulation with delay estimation is proposed for data processing. The comparison in the case study in
Section 4 demonstrates the rationality of the method, i.e., given a proper estimated window size, the proposed method will ensure the correctness of the detected causality.
3.2. Data Abstraction via Information Granulation
The information granulation of time series is the basis for compressing the scale of time series data and using the compressed data for subsequent time series analysis, interpretation, and modeling. The information granulation of time series specifically includes two main steps (shown in
Figure 2):
Discretization: Given a time series , K non-overlapping subsequences are obtained by discretization. The data in each subsequence can be accurately described by a simple model;
Information granulation for each subsequence: The information granulation operation is performed on subsequence (where , and w indicate the window length), so as to form a time-related information granule that represents the data characteristics of this subsequence.
After the above two steps, the original time series is converted into the corresponding granular time series , where is the kth information granule.
In the past, various IG methods were proposed, such as the fuzzy set-based IG [
36,
37], clustering-based IG [
36] and intelligent optimization-based IG [
38,
39]. Among them, the amount of data contained in clustering-based IG is limited, and information loss is large [
36]; the intelligent optimization-based IG is computationally time-consuming, which conflicts with the goal of reducing computational complexity in this study. Therefore, the fuzzy set-based IG is adopted since it makes use of more effective data [
36] and has a fast calculation speed.
Zadeh [
40] gave a general definition of fuzzy information granules. It is represented, using fuzzy sets, as:
where
x is a variable in the universe
U;
G is a convex fuzzy set of
U, described by a membership function
;
is the probability. The core issue of the information granulation method based on fuzzy sets is to determine a membership function
. The representation of information granules produced by the fuzzy set-based method is closely related to
. The triangular membership function is given as:
where
a,
c, and
b are the parameters of the triangular membership function.
The information granulation method based on fuzzy sets developed in [
41] is employed. A good granulation process should satisfy two requirements: (i) the raw data are fully expressed by information granules; (ii) information granules should become specific enough. To meet these requirements, a function
with respect to the membership function
is constructed to describe the performance of the granulation process, i.e.,
where
, and maximizing
can meet the requirement (i);
, and minimizing
can meet the requirement (ii). Apparently, in light of the aforementioned requirements,
has to be maximized.
Then, the fuzzy information granules can be expressed as
, where
and
are the supports, and
is the core. By calculating the three parameters
a,
c, and
b of the triangle membership function in Equation (
4), the corresponding
are obtained. The core of the information granule is calculated by:
which is the median of subsequence. According to [
41], taking into account the triangular membership function, when
is maximum,
and
can be directly calculated by:
where
denotes the largest integer not exceeding
;
represents the
j-th sample in the
i-th subsequence
. In addition, when
w is an even number,
; otherwise,
. Through the above calculation, the granular time series is obtained for subsequent analysis.
Remark 1. Using granulation to process the original data, the granule with greatly reduced data length is obtained. The advantage of the granulation as a preprocessing step is that it cannot only reduce the size of data, but also suppress noises effectively. However, the granulation may reduce the amplitude resolution, alter the value of the TE estimates, and even change the direction of detected causal relation. This may happen when the granular time series does not hold the dynamics and the variational trend of the original data. The key lies in the selection of a proper window length in discretization. If only the window length in discretization is set properly, the granular time series can retain the dynamic characteristics of the original data and keep the main variational trend. If the window length is too small, the dynamics are retained but the data compression is not effective. By contrast, if the window length is too large, the granular data may lose the dynamics and lead to erroneous conclusions in causality inference. To achieve maximum data compression and also retain the dynamics, this work proposes taking the delay between two time series as the window size. After compressing the data via granulation, the dynamics and the variational trend are retained in the one-sample history. The casual relation reflected by such one-sample history can be measured by the first-order TE. Thus, such a preprocessing approach will not influence the TE estimates too much, and can guarantee that detected causal relation is consistent with the one detected from original data while making a much faster calculation. The discussion on the window length determination is presented in Section 3.4. The validity of the approach is verified by extensive simulations. Further, in case studies, the proposed method was compared with the traditional TE, and the causal relations were found to be consistent. 3.3. Calculation of the Information Granulation-Based Transfer Entropy
Given the granular time series and , is the ith information granule. To avert the problem caused by information loss through information graduation, the calculation of TE exploits all the three items in the information granule and takes the average as the final TE result, which is supposed to ensure the reliability of the result.
Through proper information granulation, it can offset the effect of delay and reduce the delay between two variables to only 1 sample. Therefore, only the first-order situation needs to be considered when calculating TE here. That is,
and
. It should be noticed that delay embedding is usually used in causality inference so as to include the relevant past of the time series in the estimate of TE. References [
42,
43] provided systematic methods for finding appropriate embedding lengths. This is helpful for getting more accurate estimates of TE for causal relations reflected by more than one sample history. However, in this study, the information granulation needs to ensure that the data are compressed as much as possible, while the dynamics are still retained in the granular time series. Accordingly, the multi-sample history is compressed to a one-sample history, such that the delay between the granular time series is 1. Thus, the calculation of TE only needs to consider the first order, rather than high orders. Therefore, the formula for information granulation-based TE, from
x to
y, is given by:
where
or
represent the
ith sample in the
jth dimension of the information granule, which is obtained by information granulation for
y or
x. The kernel density estimator is applied in this paper to estimate the PDFs. The three dimensions are used to calculate TE using Equation (
8), and the average of the three results is taken as the final result.
In order to remove the indirect causality caused by the intermediate variables or the false causality caused by the common variables, the DTE can be calculated. From the causal network detected by TE, some causal relations could be indirect through the influence of intermediate or confounding variables. For instance, given a pair of variables x and y holding a causal relation, if there is a third variable z making a triangle network (i.e., z is the intermediate or confounding variable holding causal rations with both x and y), it is necessary to detect whether the causal relation between x and y is direct, or indirect through a pathway from z. As a result, the calculation of DTE can simplify the causal network and obtain more accurate results.
The granular time series of an intermediate variable
z is obtained and denoted by
. Analogous to DTE [
18], the information granulation-based DTE is defined as:
where
represents the
ith sample in the
jth dimension of the information granule of
z. The three dimensions are used to calculate DTE using Equation (
9), and the average of the three results is taken as the final result.
To determine whether a causal relation holds, it needs to compare the obtained TE with a threshold. An effective method to determine the threshold is the Monte Carlo method based on surrogate data. IG-based TEs are calculated using surrogate data that are generated randomly [
44], and then their mean and standard deviations are obtained to acquire the threshold
[
15,
45]. By comparing
with the threshold
, the causal relation between
x and
y is determined. If
, it indicates that there is a causal relation from
x to
y; otherwise, it suggests no causality from
x to
y. Analogously, IG-based DTEs are calculated using surrogate data, and then their mean and standard deviations are obtained to acquire the threshold
. If
, there is a direct causal relationship from
x to
y based on
z; otherwise, there is no direct causality from
x to
y.
Here, the computational complexities of TE and IG-based TE are compared. According to
Section 2, for traditional TE, the computational complexity is
. As for the IG-based TE, the computational complexity is
, and
w denotes the window length. Thus, it can be seen that the proposed method can greatly reduce the computational complexity for the TE calculation.
3.4. Determination of the Window Length by Delay Estimation
When performing information granulation on original data, there is a key parameter that needs to be discussed, namely, the window length w during discretization. If the window length w is too large or too small, the TE calculation result will be affected. Specifically, a large window length in information granulation can reduce the computational complexity, but may also lead to information loss and, thus, compromise the accuracy of causality detection. Therefore, a reasonable choice of window length is essential for correct causality analysis.
As discussed in
Section 3.3, the first-order TE is used. Thus, the window length should be set to offset the delay between two time series in the original data, such that the delay between two compressed time series after information granulation is 1. Accordingly, this paper proposes determining the window length through delay estimation. It should be noticed that determining the window length by the delay between two time series
x and
y is based on an assumption that the history of the target time series
y should be no more than the delay between
x and
y. Otherwise, if the assumption is violated, the relevant history of
y might not be fully included in the estimate of the first-order TE and, thus, it may falsely estimate the TE, as indicated in [
46,
47].
Here, the system identification toolbox in MATLAB is used to estimate the delay between data [
11]. The main procedures are as follows:
The original data of two variables x and y are normalized by z-sorce, i.e., , where denotes the normalized sample of x; and denote the mean and standard deviation, respectively;
Given the normalized data, the estimation is conducted based on a comparison of ARX models with a range of time delays, i.e., , where denotes delay between x and y; are the coefficients of the model; is white noise.
Through the above process, the delay between the two variables is obtained. Then, the window length in information granulation is assigned with the value of time delay so as to make the delay between granular time series be 1, such that the first-order TE in
Section 3.3 is applicable. Next, an example is presented to illustrate the determination of window length through delay estimation.
Example 1. Given the relation between two nonlinearly correlated continuous random variables x and y as , where , the sampling time t is , , , and . The simulation data of 3000 samples under stationary period are collected.
Using the above method, the time lag between x and y was obtained as 2, which was consistent with the actual value. Then, the data were abstracted through information granulation by taking the time lag as the window length. The IG-based TEs were calculated with the data in three dimensions of granular time series.
To test how the window size of information granulation influences the TE calculation, a series of simulations were conducted by changing the value of the window size. Figure 3 presents the trends of IG-based TEs changing with the window length. It can be seen that the maximum TEs for three dimensions of the granular time series can be found at the point where the window length is equal to the delay, as indicated by the highlighted solid circles. Thus, it verifies the idea that determining the window length of information granulation can be based on the time delay between variables. According to the result shown in Figure 3, the proposed IG-based TE correctly detected the causal relation when the window length was set to be the time delay. By contrast, the causal strength changed and erroneous conclusions were obtained if an inappropriate window length was used, as demonstrated by the very small calculated values of TE for other window lengths in Figure 3. Thus, the validity of the proposed approach was verified. The correct causal relations can be obtained as long as a proper window length is used in discretization. After compressing the data via granulation that takes the time delay as the discritzation window length, the casual relation is reflected by one sample history and can be measured by the first-order TE rather than a higher order TE. To demonstrate this, the values of TE with different orders were calculated for the same granular time series. The results are shown in Table 1. It can be seen that the value of TE stays high and does not change too much with the increasing of the orders. Thus, the first-order TE is enough to measure the causal relation given the properly compressed data. 5. Conclusions
This proposes an improved method for causality inference based on transfer entropy and information granulation. Motivated by the problems accounting for the high computational complexity, a new framework is designed to integrate the information granulation as a critical preceding step to compress data, such that the abstracted representative features are obtained and used in TE calculation. The accuracy of the result is mainly affected by the length of the window size in information granulations. Thus, a window-length determination method is proposed based on delay estimation. Both a numerical case and an industrial case are presented to demonstrate the efficacy of the proposed method. According to the results, the proposed method is capable of detecting the causal relations correctly and promptly. In the numerical and industrial case studies, the proposed method uses only % and % of the calculation time of the traditional TE, respectively. Compared to the original TE, the proposed method shows significantly better computational efficiency, making it more appropriate in real-time applications for root cause analysis.
It should also be noticed that properly compressing the time series via granulation is critical to the correct estimate of the first-order TE, and this relies on the determination of the window length, which is set as the time delay between two time series. This paper assumes that the history of the target time series
y should be no more than the delay between
x and
y, such that the data can be properly compressed and both the histories of
x and
y can be fully included in the estimate of the first-order TE. However, it is possible that the relevant history of the target time series
y is much larger than the delay in real cases. According to the literature [
46], failing to include the relevant history of the target time series can lead to a spurious overestimation of the TE. This is a problem worthy of deep investigation and which can be considered in future work to for a better solution to obtain a more accurate estimate of transfer entropy.