Normalized Multivariate Time Series Causality Analysis and Causal Graph Reconstruction

Causality analysis is an important problem lying at the heart of science, and is of particular importance in data science and machine learning. An endeavor during the past 16 years viewing causality as a real physical notion so as to formulate it from first principles, however, seems to have gone unnoticed. This study introduces to the community this line of work, with a long-due generalization of the information flow-based bivariate time series causal inference to multivariate series, based on the recent advance in theoretical development. The resulting formula is transparent, and can be implemented as a computationally very efficient algorithm for application. It can be normalized and tested for statistical significance. Different from the previous work along this line where only information flows are estimated, here an algorithm is also implemented to quantify the influence of a unit to itself. While this forms a challenge in some causal inferences, here it comes naturally, and hence the identification of self-loops in a causal graph is fulfilled automatically as the causalities along edges are inferred. To demonstrate the power of the approach, presented here are two applications in extreme situations. The first is a network of multivariate processes buried in heavy noises (with the noise-to-signal ratio exceeding 100), and the second a network with nearly synchronized chaotic oscillators. In both graphs, confounding processes exist. While it seems to be a challenge to reconstruct from given series these causal graphs, an easy application of the algorithm immediately reveals the desideratum. Particularly, the confounding processes have been accurately differentiated. Considering the surge of interest in the community, this study is very timely.


Introduction
Recent years have seen a surge of interest in causality analysis. The main thrust is the recognition of its increasing importance in machine learning and artificial intelligence, a milestone being the connection of the principle of independent causal mechanisms to semi-supervised learning by Schölkopf et al. [1]. Different methods have been proposed for inferring the causality from data, in addition to the classical ones such as Granger causality testing. While traditionally causal inference has been categorized as a subject in statistics, and now also a subject in computer science, it merits mentioning that, during recent decades, contributions from different disciplines have augmented the subject and significant advances have been made ever since. Early efforts since Clive Granger and Judea Pearl (cf. [2] include, for example, Spirtes and Glymour (1991) [3], Schreiber (2000) [4], Paluš et al. (2001) [5], and Liang and Kleeman (2005) [6]. Recently, due to the rush in artificial intelligence, publications have been growing rapidly, among which are Zhang and Spirtes (2008) [7], Maathuis et al. (2009) [8], Pompe and Runge (2011) [9], Janzing et al. (2012) [10], Sugihara et al. (2012) [11], Schölkopf et al. (2012) [1], Sun and Bollt (2014) [12], Peters et al. (2017) [13], to name but a few; see [13] and [14] for more references.
Although causality has long been investigated ever since Granger [15], thanks to the systematic works by Pearl (e.g., [2]) and others, its "mathematization is a relatively recent development," said Peters, Janzing and Schökopf (2017) [13]. On the other hand, Liang (2016) [16] argued that it is actually "a real physical notion that can be derived ab initio." Despite the current rush, this latter line of work, which starts some 16 years ago, seems to have gone almost unnoticed. It can be traced back to a discovery of two-dimensional (2D) information flow in 2005 by Liang and Kleeman [6]. With the later efforts of, e.g., Liang (2008) [17] and Liang (2014) [18], a very easy method for bivariate time series causality analysis has been established, validated, and applied successfully to real world problems in different disciplines. More details can be found below in Section 2. Recently, the whole formalism has been put on a rigorous footing [16]; explicit formulas for multidimensional information flow have been obtained in a closed form with both deterministic and stochastic systems.
The multivariate time series causality analysis, however, has not been established since Liang (2016)'s comprehensive work [16]. Considering the enormous interest in this field, we are henceforth intented to fill the gap. The purpose of this study is hence two-fold: (1) Implement Liang (2016)'s theory into the long-due multivariate time series causality analysis; (2) along with the implementation present a brief introduction of this line of work.
The remaining of the paper is organized as follows. In Section 2, we first establish the framework, and then take a stroll through the theory of information flow and the information flow-based bivariate time series causality analysis. Section 3 presents an estimate of the information flow rates among multivariate time series, and their significance tests. These information flows can be normalized to reveal the impact of the role in question (Section 4). In order to test the power of the method, in Section 5, it is applied to infer the causal graphs with two extreme processes, one being a network with heavy noise (noise-to-signal ratio exceeding 100), another being a network of almost synchronized chaotic oscillators. Section 6 closes the paper with a brief summary of the study.

Directed Graph, Uncertainty Propagation, and Causality
In this framework, causal inference is based on information flow (rather than the other way around), which has been recognized as a real physical notion that can be put on a rigorous footing (see Liang, 2016). Consider a graph (V, E), where V and E are the sets of vertexes and edges, and the structural causal model on the graph, (C, P N ), where C is a collection of d structural assignments X i = F i (PA(X i ), i ), i = 1, . . . , d, PA(X i ) ⊆ {X \ i } = X 1 , . . . , X d }\{X i } indicating the parents or direct causes of X i , and P N being a joint distribution over the noise variables [2]. The basic idea is that this can be recast within the framework of dynamical systems, and that the causal inference problem can be carried forth to that between the coordinates in a dynamical system. This is how Liang and Kleeman (2005) [6] originally conceptualized the problem. Recently, it has also been realized by R/ oysland (2012) [19], Mooij et al. (2013) [20] and Mogensen et al. (2018) [21].
In physics there is a notion called information flow, which can be readily cast within the dynamical system framework. As Shannon entropy (simply "entropy" hereafter) is by interpretation "self information", it is natural to measure it with the propagation of entropy or uncertainty, from one component to another. (Other entropies may provide alternative choices. Particularly, a generalized permutation entropy is referred to [22].) In this light, we have the following definition: Definition 1. In a dynamical system (Ω, Φ t ) on the d-dimensional phase space Ω, where Φ t may be a continuous-time flow (t ∈ R + ) or discrete-time mapping t ∈ Z + ), the information flow from a component/coordinate X j to another component/coordinate X i , written T j→i , is defined as the contribution of entropy (uncertainty) from X j per unit time (t ∈ R + ) or per step (t ∈ Z + ) in increasing the marginal entropy of X i .
With information flow, causality can be defined, and, moreover, quantitatively defined: Definition 2. X j is causal to X i iff T j→i = 0. The magnitude of the causality from X j to X i is measured by |T j→i |.
By evaluating the information flow within a dynamical system, the underlying causal graph is henceforth determined.
For this study, we consider only the continuous flow case. The vector field that forms the structural assignments is hence differentiable. Further, we assume a Wiener process for the noise (white noise). Note that some of these assumptions can be easily relaxed, and the generalization is straightforward. However, that is outside the scope of this study.

A Brief Stroll through the Theory and Recent Advances
This line of work begins with Liang and Kleeman (2005) [6] within the framework of 2D deterministic systems. Originally, it is based on a heuristic argument, but later on it is rigorized. Its generalization to multidimensional and stochastic systems, however, has not been fulfilled until the recent theoretical work by Liang (2016) [16]. The following is just a brief review.
We begin by stating an observational fact about causality: If the evolution of an event, say, X 1 , is independent of another one, X 2 , then the information flow from X 2 to X 1 is zero.
Since it is the only quantitatively stated fact about causality, all previous empirical/halfempirical causality formalisms have attempted to verify it in applications. For this reason, it has been referred to as the principle of nil causality (e.g., [16]). We will soon see below that, within the information flow framework, this principle turns out to be a proven theorem. Consider a d-dimensional continuous-time stochastic system for X = (X 1 , . . . , X d ) where F = (F 1 , . . . , F d ) may be arbitrary nonlinear functions of X and t, W is a vector of standard Wiener processes, and B = (b ij ) is the matrix of perturbation amplitudes, which may also be any functions of X and t. Assume that F and B are both differentiable with respect to X and t. We then have the following theorem [16]: Theorem 1. For the system (1), the rate of information flowing from X j to X i (in nats per unit time) is where dx \ i\ j signifies dx 1 . . . dx i−1 dx i+1 . . . dx j−1 dx j+1 . . . dx n , E stands for mathematical expectation, g ii = ∑ n k=1 b ik b ik , ρ i = ρ i (x i ) is the marginal probability density function (pdf) of X i , ρ j|i is the pdf of X j conditioned on X i , and ρ \ j = R ρ(x)dx j .
For discrete-time mappings, the information flow is in a more complicated form; see [16].
This is the early result of Liang (2008) [17] on which the bivariate causality analysis is based; see Theorem 5 below.
There is a nice property for the above information flow: Theorem 2. If in (1) neither F 1 nor g 11 depends on X 2 , then T 2→1 = 0.
Note this is precisely the principle of nil causality. Remarkably, here it appears as a proven theorem, while the classical ansatz-like formalisms attempt to verify it in applications.
Moreover, it has been established that [23]: This is a very important result, as we will see soon in the causal graph reconstruction. On the other hand, this tells that the obtained information flow should be an intrinsic property in physical world.
For linear systems, the information flow can be greatly simplified.

Theorem 4.
In (1), if F(X) = f + AX, and B is a constant matrix, then where a ij is the (i, j) th entry of A, and σ ij the population covariance between X i and X j .
Observe that, if σ ij = 0, then T j→i = 0; but if T j→i = 0, σ ij does not necessarily vanish. Contrapositively, this means that correlation does not mean causation. We hence have the following corollary: Corollary 2. In the linear sense, causation implies correlation, but correlation does not imply causation.
This explicit mathematical expression hence provides a solution to the long-standing debate ever since George Berkeley (1710) [24] over correlation versus causation. Note, however, this is for linear systems only. For nonlinear systems, the existence of such a relation, and, if existing, how it is like, are yet to be explored. Nonetheless, as proved in [25], this relation indeed holds for some counter-examples in terms of normalized information flow (see Section 4 below).
In the case with only two time series (no dynamical system is given), we have the following result [18]: Theorem 5. Given two time series X 1 and X 2 , under the assumption of a linear model with additive noise, the maximum-likelihood estimator (mle) of (3) iŝ where C ij is the sample covariance between X i and X j , and C i,dj is the sample covariance between X i and a series derived from X j using the Euler forward differencing scheme:Ẋ j,n = (X j,n+k − X j,n )/(k∆t), with k ≥ 1 some integer.
Equation (5) is rather concise in form; it only involves the common statistics, i.e., sample covariances. In other words, a combination of some sample covariances will give a quantitative measure of the causality between the time series. This makes causality analysis, which otherwise would be complicated with the classical empirical/half-empirical methods, very easy. Nonetheless, note that Equation (5) cannot replace (3); it is just the mle of the latter. A statistical significance test must be performed before a causal inference is made based on the computedT 2→1 . For details, refer to [18].
The above formalism has been validated with many benchmark systems such as baker transformation, Hénon map, Kaplan-Yorke map, Rössler system (see [16]), to name a few. Particularly, the concise Equation (5) has been validated with problems where traditional approaches fail. An example is the mysterious anticipatory system problem discovered by Hahs and Pethel [26], which with (5) is successfully fixed in an easy way.
The formalism has been successfully applied to the studies of many real world problems, among them are the El Niño-Indian Ocean Dipole relation [18], global climate change [27], soil moisture-precipitation interaction [28], glaciology [29], and neuroscience problems [30], to name a few. Here, we particularly want to mention the study by Stips et al. [27], who, through examining with (5) the causality between the CO 2 index and the surface air temperature, identified a reversing causal relation with time scale. They found, during the past century, indeed CO 2 emission drives the recent global warming; the causal relation is one-way, i.e., from CO 2 to global mean atmosphere temperature. Moreover, they were able to find how the causality is distributed over the globe, thanks to the quantitative nature of (5). However, on a time scale of 1000 years or over, the causality is completely reversed; that is to say, on a paleoclimate scale, it is global warming that drives the CO 2 concentration to rise.

Information Flow among Time Series and Algorithm for Multivariate Causal Inference
We now estimate (2), given observations of the d components, in order to arrive at a practically easy-to-use formula for causal inference. As mentioned in Section 1, this has not been done yet; the available estimator (5) is for (3). Here, we only consider time series, but it can be easily extended to other forms of data. We further assume the series are stationary and equi-distanced. Without loss of generality, it suffices to examine T 2→1 .
As in the bivariate case considered in [18], we estimate the linear version (4). We hence examine a linear stochastic differential equation where f is a constant vector, and A = (a ij ) and B = (b ij ) are constant matrices. Initially, if X obeys a Gaussian distribution, then it is a Gaussian for ever, i.e., X ∼ N (µ, Σ), with µ = (µ 1 , . . . , µ d ) T and Σ = (σ ij ) being the mean vector and covariance matrix, respectively. Hence, X 1 ∼ N (µ 1 , σ 11 ).
The above results need to be estimated if what we are given are just d time series. That is to say, what we know is a single realization of some unknown system, which, if known, can produce infinitely many realizations. We use maximum-likelihood estimation (e.g., [31]) to achieve the goal. The procedure follows that of [18], which for easy reference, we briefly summarize here. As established before, a further assumption that B is diagonal, i.e., b ij = 0, for i = j, and hence g 11 = b 2 11 , will greatly simplify the problem, while in practice, this is quite reasonable.
Suppose that the series are equal-distanced with a time stepsize ∆t, and let N be the sample size. Consider an interval [n∆t, (n + 1)∆t], and let the transition pdf be ρ(X n+1 |X n ; θ), where θ stands for the vector of parameters to be estimated. So, the log likelihood is As N is usually large, the term ρ(X 1 ) can be dropped without causing much error. The transition pdf is, with the Euler-Bernstein approximation (see [18]), where F = f + AX. This results in a log likelihood functional a ij X j,n ), i = 1, 2, . . . , d andẊ i = {Ẋ i,n } is the Euler forward differencing approximation of dX i dt : with k ≥ 1. Usually, k = 1 should be used to ensure accuracy, but in some cases of deterministic chaos and the sampling is at the highest resolution, one needs to choose k = 2.
Maximizing N , it is easy to find that the maximizer (f 1 ,â 11 , . . .â 1d ) satisfies the following algebraic equation: where the overline signifies sample mean. After some algebraic manipulations as that in [18], this yields the maximum-likelihood estimators (mle): where are the sample covariances, ∆ ij the cofactors of the matrix C = (C ij ), and By (4), this yields an estimator of the information flow from X 2 to X 1 : where C j,d1 is the sample covariance between X j and the derived seriesẊ 1 as computed by (7). When d = 2, it is easy to show that this is reduced to (5), the 2D estimator as obtained in [18]. Information flow concerns the influence from one element to another element, i.e., the causal relation between different elements. A relation can also contain two identical elements; this corresponds to a self-loop in a graph. Historically, before establishing the information flow from, say X 2 , to another component, say X 1 , the contribution of the change of marginal entropy of X 1 by itself is first established. This contribution, denoted by dH * 1 /dt, proves to be E( ∂F 1 ∂x 1 (cf. [16]). As we can see from above, besides the estimator of information flow, in this study, we actually have also estimated dH * 1 /dt, i.e., the influence of a component (here X 1 ) on itself.
Theorem 6. Under a linear assumption, the maximum-likelihood estimator of dH * 1 /dt is Proof. Since dH * 1 /dt = E( ∂F 1 ∂x 1 ), which is a 11 in this case. The mle hence follows. This supplies information not seen in previous causality analysis along this line. As will be clear soon, this helps identify self loops in a causal graph.
From the above, an algorithm for causal inference hence can be implemented, as shown in Algorithm 1.

Algorithm 1: Quantitative causal inference
Input : d time series Output : a DG G = (V, E), and IFs along edges initialize G such that all vertexes are isolated; set a significance level α; for each (i, j) ∈ V × V do computeT i→j by (14); ifT i→j is significant at level α then add i → j to G; recordT i→j ; end end return G, together with the IFsT i→j

Normalization of the Causality among Multivariate Time Series
In many problems, just an assertion whether a causality exists is not enough; we need to know how important it is. This raises an issue of normalization. The normalization of information flow is by no means as trivial as it seemingly looks. Quite different from the case of covariance vs. correlation coefficient, no such relation as Cauchy-Schwartz inequality exists. Liang [32] listed some difficulties in the problem, and so far this is still an area of research. Hereafter, we follow [32] to propose the normalizer for (14).
The basic idea is that the normalizer for T 2→1 should be related to dH 1 /dt, as the former is by derivation a part of the contribution to the latter. However, dH 1 /dt itself cannot be the normalizer, since many terms tend to cancel; sometimes dH 1 /dt may even completely vanish, just as in the Hénon map case. We now write out the estimator of dH 1 /dt and see how the problem can be fixed.
By [16], the time rate of change of the marginal entropy of X 1 is In this linear case, The first term is dH * 1 /dt, i.e., the contribution from itself, and the last term is the effect of noise, written dH noise 1 /dt. The remaining parts are the information flows to X 1 , just as expected. We may hence propose a normalizer as follows: Hence, the normalized information flow from X 2 to X 1 is: Clearly, τ 2→1 lies on [−1, 1]. So, when |τ 2→1 | is 100%, X 2 has the maximal impact on X 1 .
Note that dH dt noise 1 = g 11 /(2σ 11 ), where g 11 = ∑ d j=1 b 2 1j is always positive. That is to say, noise always contributes to increase the marginal entropy of X 1 , agreeing with our common sense. Obviously, this term is related to the noise-to-signal ratio.

A Noisy Causal Network from Autoregressive Processes
Consider the series generated from a d-dimensional vector autoregressive (VAR) process: where X = (X 1 , . . . , X d ) T , A = (a ij ) d×d , e = (e 1 , . . . , e d ) T , and B is a diagonal matrix with diagonal entries b ii , i = 1, . . . , d. Here, the errors e i ∼ N(0, 1) are independent, and b i are the amplitudes of stochastic perturbation. Let The formed network is as shown in Figure 1a. So, by design, we have two directed cycles (X 1 ,X 2 ,X 3 ) and (X 4 ,X 5 ). The former is of length 3, while the latter are parallel edges. These cycles are driven by a common cause or confounder X 6 . Since no diagonal entries of A is 1, all nodes are self loops (trivial cycles of length 1). The resulting autocorrelation is believed to be a challenge in causal inferences for some techniques. This and the confoundingness of X 6 , have been two major issues for many causal inference methods. First, consider the case b ii = 1. Accordingly, six series of 10,000 steps are generated (randomly initialized).
By computation, the information flow rates are (only absolute values are shown), if arranged in a matrix form such that the (i, j) th entry indicates |T i→j |, then the absolute information flow rates are ) This is precisely the same as designed. So, the causal graph is accurately reconstructed. Also, by (15) dH * 1 /dt , . . . , dH * 6 /dt can be computed. They are 1.00 ± 0.01, 1.01 ± 0.01, 1.01 ± 0.01, 0.30 ± 0.01, 1.00 ± 0.01, 1.49 ± 0.02, where the errors at a 90% confidence level are shown. So, here all the nodes are self loops (trivial cycles of length 1).
It should be particularly pointed out that the confoundingness of X 6 does not make an issue here. As shown in Figure 1, there is no significant information flow between X 2 and X 5 ; in other words, they are not directly causal to each other. Nor are X 3 and X 4 . This is actually not a surprise; it is a corollary of the principle of nil causality, as proved before (see Theorem 2). Considering the difficulty of this problem, the performance of this concise Formula (14) is remarkable. The above information flows can be normalized to understand the impact of one unit on another. For example, |τ 6→2 | = 13.2%, |τ 6→5 = 12.5%. For another example, in the cycle (X 4 , X 5 ), the relative information flows are τ 4→5 = 2.4%, τ 5→4 = 8.8%, in contrast to the almost identical absolute information flows. This is understandable: though T 5→4 is comparable to T 4→5 , the parts contributing to dH 5 /dt are different from that to dH 4 /dt, and thus they may have different weights. Now, consider an extreme case when the signals are buried within heavy noise. Let, b ii = 100, and repeat the above steps. The results are, remarkably, almost the same. So, the Formula (14) is very robust in the presence of noise.
If the time series is short, the performance is still satisfactory. For example, if it has only 500 data points, the above case with heavy noise (b ii = 100) results in the following matrix of information flow rates: So, the significant (at the 90% level) information flows are still those as shown in bold face. Note we do not mean to compete with the classical method(s) in this application. Granger causality testing, for example, works well here. Nonetheless, the simplicity of the Formula (14) and the algorithm has greatly increased the performance of computation. On MATLAB, (14) is by test more than 100 times faster than the embedded matlab function gctest.

A Network of Nearly Synchronized Chaotic Series
Now, consider the following causal graph made of Rössler oscillators X, Y and Z, where X is a confounder. A Rössler oscillator has three components, so the system actually has a dimension of 9.XỸ We use for this purpose the coupled system investigated by Paluš et al. [33]. The 9 series are generated through the following Rössler systems Clearly, the first is the driving or "master" system, while the latter two are slaves which are not directly connected. We hence use them to define X, Y and Z. This system is exactly the same as the one studied in [33], except for the addition of another subsystem, Z. The parameters are also chosen to be the same as theirs, ω 1 = 1.015 and ω 2 = 0.985, but with an additional one, ω 3 = 0.95. As can be seen, X is coupled with Y and Z through the first component, and the coupling is one-way, i.e., from X to Y and from X to Z. The coupling parameter is left open for tuning.
The above equations are differenciated and the system is solved using the secondorder Runge-Kutta scheme with a time stepsize ∆t = 0.001. Initialized with random numbers, the state is integrated forward for N = 50,000 steps (t = 50). Discard the initial 10,000 steps and form the 9 time series with 40,000 data points.
The oscillators are highly chaotic. As increases, the three oscillators gradually become in pace. They become almost synchronized after > 0.15. Shown in Figure 2d is an episode of the synchronization for = 0.25.
We now apply (14) to compute the information flows among X, Y, and Z. Since this is a deterministic chaos problem, choose k = 2 in (7) and (14). Following [33], the series {x 1 (n)}, {y 1 (n)}, and {z 1 (n)} are used to represent the three oscillators. Shown in Figure 2a-c are dependencies of the computed information flows on the coupling strength . Clearly, among the six information flows, only T X→Y and T X→Z are significant, indicating (1) that the causal relation between X and Y is unidirectionally from X to Y, (2) that the causality between X and Z is also one-way, i.e., from X to Z, and, mostly importantly (3) that no direct causality exists between Y and Z, although they are highly correlated (c.f. Figure 2d). So, here the confoundingness is not at all an issue.
After exceeds 0.15, the systems begin to synchronize (see [33]), and it is impossible to infer the causal relation using traditional methods. This is understandable, as the series gradually approach toward one series. Here, however, even with > 0.15, i.e., even after the series are almost synchronized, in this framework, the inference still performs remarkably well, as clearly seen in Figure 2a-c. This attests to the power of the information flow-based causal inference technique, which is concise and very easy to implement. : red y 1 : blue z 1 : black , the initial conditions are randomly chosen, some of which may happen to make a highly singular matrix and hence cause large errors. In that case, simply re-run the program.)

Conclusions
Recent years have seen a surge of interest in causality analysis. This study introduced a line of work starting some 16 years ago which has gone almost unnoticed, and implemented the state-of-the-art theory [16] into an easy-to-use algorithm. Particularly, this study extended the bivariate time series analysis of [18] to the long-due multivariate time series causal inference.
In a multivariate stochastic system, the information flow from one component to another proves to be (2). When only time series are available, it can be estimated using (14) under a linear assumption. Ideally if it is not zero, then there exists causality between the components, but practically statistical significance needs to be tested. These have been easily implemented as an algorithm for use.
More than just finding the information flows, hence the causalities, among the units (as in [18]), we have also estimated the influence of a unit to itself. This results in autocorrelation, which becomes an issue in some causal inferences. The consequence is that, in a causal graph, those nodes which are self loops (cycles of length 1) can be easily identified. Also different from previous studies, in a unified treatment, the role of noise has been quantified along with the causality analysis. This quantity has an easy physical interpretation, namely, the ratio of noise to signal. Besides, the obtained causalities can be normalized to measure the importance of the respective parental nodes.
The above very concise and transparent formulas have been applied to examine two problems in extreme situations: (1) a network of multivariate processes with heavy noise (stochastic perturbation amplitude 100 times the signal amplitude); (2) a network with nearly synchronized oscillators. Besides, confounding processes exist in both causal graphs. Case (1) is made of vector autoregressive processes. By applying the algorithm, the causal graph is accurately recovered in a very easy and efficient way. In particular, the confounding processes have been accurately clarified.
Note Granger causality testing works well in case (1). Nonetheless, the simplicity of the Formula (14) allows for an increase of performance by at least two orders.
In case (2), the network is formed with three chaotic Rössler oscillators. When the coupling coefficient exceeds a threshold, synchronization occurs. However, even with the almost completely synchronized time series, the information flow approach still performs remarkably well, with the causalities accurately inferred, and the causal graph accurately reconstructed. In particular, the one-way causalities between the master-slave systems have been recovered. Moreover, it is accurately shown that the two highly correlated, almost identical series due to the confounder are not causally linked.
It should be mentioned that, in arriving at the concise formula for causal inference, an assumption of linearity has been invoked. For some nonlinear problems, the inference may not be precisely as expected. For example, in Figure 2a,b, the red dashed lines are supposed to be zero, but here they are not. However, qualitatively, the inference is still good, as the one-way causality is clearly seen. Such success has already been evidenced in the bivariate case of [18], where a highly nonlinear problem defying classical approaches is examined. Nonetheless, the power of the information flow-based causality analysis will not be fully realized until the linear assumption is relaxed. To generalize to the fully nonlinear case is hence the goal of future work.