Abstract
Causal inference is perhaps one of the most fundamental concepts in science, beginning originally from the works of some of the ancient philosophers, through today, but also weaved strongly in current work from statisticians, machine learning experts, and scientists from many other fields. This paper takes the perspective of information flow, which includes the Nobel prize winning work on Granger-causality, and the recently highly popular transfer entropy, these being probabilistic in nature. Our main contribution will be to develop analysis tools that will allow a geometric interpretation of information flow as a causal inference indicated by positive transfer entropy. We will describe the effective dimensionality of an underlying manifold as projected into the outcome space that summarizes information flow. Therefore, contrasting the probabilistic and geometric perspectives, we will introduce a new measure of causal inference based on the fractal correlation dimension conditionally applied to competing explanations of future forecasts, which we will write . This avoids some of the boundedness issues that we show exist for the transfer entropy, . We will highlight our discussions with data developed from synthetic models of successively more complex nature: these include the Hénon map example, and finally a real physiological example relating breathing and heart rate function.
1. Introduction
Causation Inference is perhaps one of the most fundamental concepts in science, underlying questions such as “what are the causes of changes in observed variables”. Identifying, indeed even defining causal variables of a given observed variable is not an easy task, and these questions date back to the Greeks [1,2]. This includes important contributions from more recent luminaries such as Russel [3], and from philosophy, mathematics, probability, information theory, and computer science. We have written that [4], “a basic question when defining the concept of information flow is to contrast versions of reality for a dynamical system. Either a subcomponent is closed or alternatively there is an outside influence due to another component”. Claude Granger’s Nobel prize [5] winning work leading to Granger Causality (see also Wiener [6]) formulates causal inference as a concept of quality of forecasts. That is, we ask, does system X provide sufficient information regarding forecasts of future states of system X or are there improved forecasts with observations from system Y? We declare that X is not closed, as it is receiving influence (or information) from system Y, when data from Y improve forecasts of X. Such a reduction of uncertainty perspective of causal inference is not identical to the interventionists’ concept of allowing perturbations and experiments to decide what changes indicate influences. This data oriented philosophy of causal inference is especially appropriate when (1) the system is a dynamical system of some form producing data streams in time, and (2) a score of influence may be needed. In particular, contrasting forecasts is the defining concept underlying Granger Causality (G-causality), and it is closely related to the concept of information flow as defined by transfer entropy [7,8], which can be proved as a nonlinear version of Granger’s otherwise linear (ARMA) test [9]. In this spirit, we find methods such as Convergent Cross-Mapping method (CCM) [10], and causation entropy (CSE) [11] to disambiguate direct versus indirect influences [11,12,13,14,15,16,17,18]. On the other hand, closely related to information flow are concepts of counter factuals: “what would happen if...” [19] that are foundational questions for another school leading to the highly successful Pearl “Do-Calculus” built on a specialized variation of Bayesian analysis [20]. These are especially relevant for nondynamical questions (inputs and outputs occur once across populations), such as a typical question of the sort, “why did I get fat” may be premised on inferring probabilities of removing influences of saturated fats and chocolates. However, with concepts of counter-factual analysis in mind, one may argue that Granger is less descriptive of causation inference, but rather more descriptive of information flow. In fact, there is a link between the two notions for so-called “settable" systems under a conditional form of exogeneity [21,22].
This paper focuses on the information flow perspective, which is causation as it relates to G-causality. The role of this paper is to highlight connections between the probabilistic aspects of information flow, such as Granger causality and transfer entropy, to a less often discussed geometric picture that may underlie the information flow. To this purpose, here we develop both analysis and data driven concepts to serve in bridging what have otherwise been separate philosophies. Figure 1 illustrates the two nodes that we tackle here: causal inference and geometry. In the diagram, the equations that are most central in serving to bridge the main concepts are highlighted, and the main role of this paper then could be described as building these bridges.
Figure 1.
Summary of the paper and relationship of causation and geometry.
When data are derived from a stochastic or deterministic dynamical system, one should also be able to understand the connections between variables in geometric terms. The traditional narrative of information flow is in terms of comparing stochastic processes in probabilistic terms. However, the role of this paper is to offer a unifying description for interpreting geometric formulations of causation together with traditional statistical or information theoretic interpretations. Thus, we will try to provide a bridge between concepts of causality as information flow to the underlying geometry since geometry is perhaps a natural place to describe a dynamical system.
Our work herein comes in two parts. First, we analyze connections between information flow by transfer entropy to geometric quantities that describe the orientation of underlying functions of a corresponding dynamical system. In the course of this analysis, we have needed to develop a new “asymmetric transfer operator” (asymmetric Frobenius–Perron operator) evolving ensemble densities of initial conditions between spaces whose dimensionalities do not match. With this, we proceed to give a new exact formula for transfer entropy, and from there we are able to relate this Kullback–Leibler divergence based measure directly to other more geometrically relevant divergences, specifically total variation divergence and Hellinger divergence, by Pinsker’s inequality. This leads to a succinct upper bound of the transfer entropy by quantities related to a more geometric description of the underlying dynamical system. In the second part of this work, we present numerical interpretations of transfer entropy in the setting of a succession of simple dynamical systems, with specifically designed underlying densities, and eventually we include a heart rate versus breathing rate data set. Then, we present a new measure in the spirit of G-causality that is more directly motivated by geometry. This measure, , is developed in terms of the classical fractal dimension concept of correlation dimension.
In summary, the main theme of this work is to provide connections between probabilistic interpretations and geometric interpretations of causal inference. The main connections and corresponding sections of this paper are summarized as a dichotomy: Geometry and Causation (information flow structure) as described in Figure 1. Our contribution in this paper is as follows:
- In traditional methods, causality is estimated by probabilistic terms. In this study, we present analytical and data driven approach to identify causality by geometric methods, and thus also a unifying perspective.
- We show that a derivative (if it exists) of the underlining function of the time series has a close relationship to the transfer entropy (Section 2.3).
- We provide a new tool called to identify the causality by geometric terms (Section 3).
- Correlation dimension can be used as a measurement for dynamics of a dynamical system. We will show that this measurement can be used to identify the causality (Section 3).
Part I: Analysis of Connections between Probabilistic Methods and Geometric Interpretations
2. The Problem Setup
For now, we assume that are real valued scalars, but the multi-variate scenario will be discussed subsequently. We use a shorthand notation, , for any particular time n, where the prime () notation denotes “next iterate”. Likewise, let denote the composite variable, and its future composite state, . Consider the simplest of cases, where there are two coupled dynamical systems written as discrete time maps,
The definition of transfer entropy [7,8,23], measuring the influence of coupling from variables y onto the future of the variables x, denoted by is given by:
This hinges on the contrast between two alternative versions of the possible origins of and is premised on deciding one of the following two cases: Either
is descriptive of the actual function . The definition of is defined to decide this question by comparing the deviation from a proposed Markov property,
The Kullback–Leibler divergence used here contrasts these two possible explanations of the process generating . Since may be written in terms of mutual information, the units are as any entropy, bits per time step. Notice that we have overloaded the notation writing and . Our practice will be to rely on the arguments to distinguish functions as otherwise different (likewise distinguishing cases of versus .
Consider that the coupling structure between variables may be characterized by the directed graph illustrated in Figure 2.
Figure 2.
A directed graph presentation of the coupling stucture questions corresponding to Equations (1) and (2).
In one time step, without loss of generality, we may decide on Equation (4), the role of y on , based on , exclusively in terms of the details of the argument structure of . This is separate from the reverse question of as to whether . In geometric terms, assuming , it is clear that, unless the partial derivative is zero everywhere, then the y argument in is relevant. This is not a necessary condition for , which is a probabilistic statement, and almost everywhere is sufficient.
2.1. In Geometric Terms
Consider a manifold of points as the graph over , which we label . In the following, we assume . Our primary assertion here is that the geometric aspects of the set projected into distinguishes the information flow structure. Refer to Figure 3 for notation. Let the level set for a given fixed y be defined,
Figure 3.
Ω2 = X × X′ manifold and Ly level set for (a) x′ = f1(x) = −0.005x2 + 100, (b) x′ = f1(x, y) = −0.005x2 + 0.01y2 + 50. The dimension of the projected set of (x, x′) depends on the causality as just described. Compare to Figure 4 and Equation (27).
When these level sets are distinct, then the question of the relevance of y to the outcome of is clear:
- If for all , then for all .
Notice that, if the y argument is not relevant as described above, then better describes the associations, but if we nonetheless insist to write , then for all . The converse is interesting to state explicitly,
- If for some , then for some , and then is not a sufficient description of what should really be written . We have assumed throughout.
2.2. In Probabilistic Terms
Considering the evolution of x as a stochastic process [8,24], we may write a probability density function in terms of all those variables that may be relevant, . Contrasting the role of the various input variables requires us to develop a new singular transfer operator between domains that do not necessarily have the same number of variables. Notice that the definition of transfer entropy (Equation (3)) seems to rely on the absolute continuity of the joint probability density . However, that joint distribution of is generally not absolutely continuous, noticing its support is , a measure 0 subset of . Therefore, the expression is not well defined as a differential entropy and hence there is a problem with transfer entropy. We expand upon this important detail in the upcoming subsection. To guarantee existence, we interpret these quantities by convolution to smooth the problem. Adding an “artificial noise” with standard deviation parameter allows definition of the conditional entropy at the singular limit approaches to zero, and likewise the transfer entropy follows.
The probability density function of the sum of two continuous random variables () can be obtained by convolution, . Random noise (Z with mean and variance ) added to the original observable variables regularizes, and we are interested in the singular limit, . We assume that Z is independent of . In experimental data from practical problems, we argue that some noise, perhaps even if small, is always present. Additionally, noise is assumed to be uniform or normally distributed in practical applications. Therefore, for simplicity of the discussion, we mostly focused on those two distributions. With this concept, Transfer Entropy can now be calculated by using and when
where now we assume that are independent random variables and we assume that is a component-wise monotonic (we will consider the monotonically increasing case for consistent explanations, but one can use monotonically decreasing functions in similar manner) continuous function of and .
Relative Entropy for a Function of Random Variables
Calculation of transfer entropy depends on the conditional probability. Hence, we will first focus on conditional probability. Since for any particular values the function value is fixed, we conclude that is just a linear function of Z. We see that
where is the probability density function of Z.
Note that the random variable is a function of . To write , let . Therefore, convolution of densities of U and Z gives the density function for (See Section 4.1 for examples). Notice that a given value of the random variable, say , is a parameter in U. Therefore, we will denote . We will first focus on the probability density function of U, , using the Frobenius–Perron operator,
In the multivariate setting, the formula is extended similarly interpreting the derivative as the Jacobian matrix, and the absolute value is interpreted as the absolute value of the determinant. Denote and ; and the vector such that for . Then, the absolute value of the determinate of the Jacobian matrix is given by: . As an aside, note that J is lower triangular with diagonal entries for . The probability density function of U is given by
where S is the support set of the random variable .
Since the random variable can be written as a sum of U and Z, we find the probability density function by convolution as follows:
Now, the conditional differential entropy is in terms of these probability densities. It is useful that translation does not change the differential entropy, . In addition, Z is independent from , . Now, we define
if this limit exists.
We consider two scenarios: (1) Z is a uniform random variable or (2) Z is a Gaussian random variable. If it is uniform in the interval then the differential entropy is . If specifically, Z is Gaussian with zero mean and standard deviation, then . Therefore, as in both cases. Therefore, is not finite in this definition (Equation (12)) as well. Thus, instead of calculating , we need to use a noisy version of data . For that case,
where is the uniform distribution in the interval and is a Gaussian distribution with zero mean and standard deviation.
Now, we focus on . If is just a function of X, then we can similarly show that: if then
In addition, notice that, if , then will exist, and most of the cases will be finite. However, when we calculate , we need to use the noisy version to avoid the issues in calculating . We will now consider the interesting case and calculate . We require and Equation (11) can be used to calculate this probability. Let us denote then,
where . Notice that, if Q does not depend on then because (since is a probability density function). Therefore, we can calculate by four steps. First, we calculate the density function for (by using Equation (9) or (10)). Then, we calculate by using Equation (11). Next, we calculate the value of and finally we calculate the value of .
Thus, the transfer entropy from y to x follows in terms of comparing conditional entropies,
This quantity is not well defined when , and therefore we considered the case. This interpretation of transfer entropy depends on the parameter , as we define
if this limit exists.
Note that
Thus, we see that a finite quantity is ensured by the noise term. We can easily find an upper bound for the transfer entropy when is a random variable with finite support (with all the other assumptions mentioned earlier) and suppose . First, notice that the uniform distribution maximizes entropy amongst all distributions of continuous random variables with finite support. If f is component-wise monotonically increasing continuous function, then the support of is for all . Here, and are minimum and maximum values of Y. Then, it follows that
where is the maximum x value. We see that an interesting upper bound for transfer entropy follows:
2.3. Relating Transfer Entropy to a Geometric Bound
Noting that transfer entropy and other variations of the G-causality concept are expressed in terms of conditional probabilities, we recall that
Again, we continue to overload the notation on the functions , the details of the arguments distinguishing to which of these functions we refer.
Now, consider the change of random variable formulas that map between probability density functions by smooth transformations. In the case that (in the special case that is one-one), then
In the more general case, not assuming one-one-ness, we get the usual Frobenius–Perron operator,
in terms of a summation over all pre-images of . Notice also that the middle form is written as a marginalization across x of all those x that lead to . This Frobenius–Perron operator, as usual, maps densities of ensembles of initial conditions under the action of the map .
Comparing to the expression
we assert the interpretation that
where is the Dirac delta function. In the language of Bayesian uncertainty propagation, describes the likelihood function, if interpreting the future state as data, and the past state x as parameters, in a standard Bayes description, . As usual for any likelihood function, while it is a probability distribution over the data argument, it may not necessarily be so with respect to the parameter argument.
Now, consider the case where is indeed nontrivially a function with respect to not just x, but also with respect to y. Then, we require the following asymmetric space transfer operator, which we name here an asymmetric Frobenius–Perron operator for smooth transformations between spaces of dissimilar dimensionality:
Theorem 1
(Asymmetric Space Transfer Operator). If , for , given bounded open domain , and range , and , and the Jacobian matrices, , and are not both rank deficient at the same time, then taking the initial density , the following serves as a transfer operator mapping asymmetrically defined densities
The proof of this is in Appendix A. Note also that, by similar argumentation, one can formulate the asymmetric Frobenius–Perron type operator between sets of dissimilar dimensionality in an integral form.
Corollary 1
(Asymmetric Transfer Operator, Kernel Integral Form). Under the same hypothesis as Theorem 1, we may alternatively write the integral kernel form of the expression,
This is in terms of a line integration along the level set, . See Figure 4:
In Figure 4, we have shown a typical scenario where a level set is a curve (or it may well be a union of disjoint curves), whereas, in a typical FP-operator between sets of the same dimensionality, generally the integration is between pre-images that are usually either singletons, or unions of such points, .
Contrasting standard and the asymmetric forms of transfer operators as described above, in the next section, we will compute and bound estimates for the transfer entropy. However, it should already be apparent that, if in probability with respect to , then .
Comparison to other statistical divergences reveals geometric relevance: Information flow is quite naturally defined by the KL-divergence, in that it comes in the units of entropy, e.g., bits per second. However, the well-known Pinsker’s inequality [25] allows us to more easily relate the transfer entropy to a quantity that has a geometric relevance using the total variation, even if this is only by an inequality estimate.
Recall that Pinsker’s inequality [25] relates random variables with probability distributions p and q over the same support to the total variation and the KL-divergence as follows:
written as probability measures P, Q. The total variation distance between probability measures is a maximal absolute difference of possible events,
but it is well known to be related to 1/2 of the L1-distance in the case of a common dominating measure, , . In this work, we only need absolute continuity with respect to Lebesgue measure, , ; then,
here with respect to Lebesgue measure. In addition, we write ; therefore,
Thus, with the Pinsker inequality, we can bound the transfer entropy from below by inserting the definition Equation (3) into the above:
The assumption that the two distributions correspond to a common dominating measure requires that we interpret as a distribution averaged across the same as . (Recall by definition [26] that is a common dominating measure of P and Q if and describe corresponding densities). For the sake of simplification, we interpret transfer entropy relative to a uniform initial density, , for both entropies of Equation (16). With this assumption, we interpret
In the special case that there is very little information flow, we would expect that , and , almost every ; then, a power series expansion in small b gives
which serves approximately as the TV-lower bound for transfer entropy where have used the notation to denote an average across the domain. Notice that, therefore, as . While Pinsker’s inequality cannot guarantee that , since TV is only an upper bound, it is clearly suggestive. In summary, comparing inequality Equation (35) to the approximation (36) suggests that, for , for , for a.e. , then as .
Now, we change to a more computational direction of this story of interpreting information flow in geometric terms. With the strong connection described in the following section, we bring to the problem of information flow between geometric concepts to information flow concepts, such as entropy, it is natural to turn to studying the dimensionality of the outcome spaces, as we will now develop.
Part II: Numerics and Examples of Geometric Interpretations
Now, we will explore numerical estimation aspects of transfer entropy for causation inference in relationship to geometry as described theoretically in the previous section, and we will compare this numerical approach to geometric aspects.
3. Geometry of Information Flow
As theory suggests, see the sections above, there is a strong relationship between the information flow (causality as measured by transfer entropy) and the geometry, encoded for example in the estimates leading to Equation (36). The effective dimensionality of the underlying manifold as projected into the outcome space is a key factor to identify the causal inference between chosen variables. Indeed, any question of causality is in fact observer dependent. To this point, suppose only depends on and , where . We noticed that (Section 2) . Now, notice that Therefore, in the case that is two-dimensional, then would be a one-dimensional, manifold if and only if . See Figure 3. With these assumptions,
Likewise, for more general dimensionality of the initial , the story of the information flow between variables is in part a story of how the image manifold is projected. Therefore, our discussion will focus on estimating the dimensionality in order to identify the nature of the underlying manifold. Then, we will focus on identifying causality by estimating the dimension of the manifold, or even more generally of the resulting set if it is not a manifold but perhaps even a fractal. Finally, this naturally leads us to introduce a new geometric measure for characterizing the causation, which we will identify as .
3.1. Relating the Information Flow as Geometric Orientation of Data
For a given time series , consider the and contrast the dimensionalities of versus , in order to identify that or . Thus, in mimicking the premise of Granger causality, or likewise of Transfer entropy, contrasting these two versions of the explanations of , in terms of either or x, we decide the causal inference, but this time, by using only the geometric interpretation. First, we recall how fractal dimensionality evolves under transformations, [27].
Theorem 2
([27]). Let A be a bounded Borel subset of . Consider the function such that for some . The correlation dimension , if and only if there exists a function such that with .
The idea of the arguments in the complete proof found in Sauer et. al., [27], are as follows. Let A be bounded Borel subset of and with . Then, , where is the correlation dimension [28]. Note that . Therefore, , with if and only if .
Now, we can describe this dimensional statement in terms of our information flow causality discussion, to develop an alternative measure of inference between variables. Let and . We assert that there is a causal inference from y to x, if and , (Theorem 1). In this paper, we focus on time series which might also depend on time series , and we will consider the geometric causation from y to x, for . We will denote geometric causation by and assume that are Borel subsets of . Correlation dimension is used to estimate the dimensionality. First, we identify the causality using the dimensionality of on and . Say, for example, that and ; then, clearly we would enumerate a correlation dimension causal inference from y to x, if and (Theorem 1).
3.2. Measure Causality by Correlation Dimension
As we have been discussing, the information flow of a dynamical system can be described geometrically by studying the sets (perhaps they are manifolds) and . As we noticed in the last section, comparing the dimension of these sets can be interpreted as descriptive of information flow. Whether dimensionality be estimated from data or by a convenient fractal measure such as the correlation dimension (), there is an interpretation of information flow when contrasting versus , in a spirit reminiscent of what is done with transfer entropy. However, these details are geometrically more to the point.
Here, we define (geometric information flow) by as conditional correlation dimension.
Definition 1
(Conditional Correlation Dimensional Geometric Information Flow). Let be the manifold of data set and let be the data set . Suppose that the are bounded Borel sets. The quantity
is defined as “Conditional Correlation Dimensional Geometric Information Flow". Here, is the usual correlation dimension of the given set, [29,30,31].
Definition 2
(Correlation Dimensional Geometric Information Flow). Let be two time series. The correlation dimensional geometric information flow from y to x as measured by the correlation dimension and denoted by is given by
A key observation is to notice that, if is a function of , then ; otherwise, (Theorem 1). If X is not influenced by y, then and therefore . In addition, notice that , where . For example, if , then . Since we assume that influence of any time series to is relatively small, we can conclude that , and, if , then . Additionally, the dimension () in the data scores how much additional (other than X) information is needed to describe the variable. Similarly, the dimension in the data describes how much additional (other than ) information is needed to define . However, when the number of data points , the value is not negative (equal to the dimension of X data). Thus, theoretically, identifies a causality in the geometric sense we have been describing.
4. Results and Discussion
Now, we present specific examples to contrast the transfer entropy with our proposed geometric measure to further highlight the role of geometry in such questions. Table 1 provides a summary of our numerical results. We use synthetic examples with known underlining dynamics to understand the accuracy of our model. Calculating transfer entropy has theoretical and numerical issues for those chosen examples while our geometric approach accurately identifies the causation. We use the correlation dimension of the data because data might be fractals. Using a Hénon map example, we demonstrate that fractal data will not affect our calculations. Furthermore, we use a real-world application that has a positive transfer entropy to explain our data-driven geometric method. Details of these examples can be found in the following subsections.
Table 1.
Summary of the results. Here, we experiment our new approach by synthetics and real world application data.
4.1. Transfer Entropy
In this section, we will focus on analytical results and numerical estimators for conditional entropy and transfer entropy for specific examples (see Figure 5 and Figure 6). As we discussed in previous sections starting with Section 2.2, computing the transfer entropy for has technical difficulties due to the singularity of the quantity . First, we will consider the calculation of for , and then we will discuss the calculation for noisy data. In the following examples, we assumed that are random variables such that . A summary of the calculations for a few examples are listed in Table 2.
Figure 5.
Conditional entropy h(X′|X). Note that these numerical estimates for the conditional entropy by the KSG method [32], converge (as N → ∞) to the analytic solutions (see Table 2).
Figure 6.
Numerical results and analytical results for transfer entropy Ty→x(ε; b) to the problem X′ = X + bY + ε. Transfer entropy vs. ε shows in (a) for fixed b value. (b) and (c) show the behavior of the transfer entropy for b values with fixed e values. Notice that convergence of numerical solution is slow when epsilon is small.
Table 2.
Conditional entropy for , for specific parametric examples listed, under the assumption that .
We will discuss the transfer entropy with noisy data because making well defined requires absolute continuity of the probability density function . Consider, for example, the problem form , where are uniformly distributed independent random variables over the interval (the same analysis can be extend to any finite interval) with b being a constant, and g a function of random variable X. We will also consider C to be a random variable, which is distributed uniformly on . Note that it follows that . To calculate the , we need to find the conditional probability and observe that , where . Therefore,
and
By the definition of transfer entropy, we can show that
and hence transfer entropy of this data are given by
Therefore, when , the transfer entropy . In addition, notice that as . Therefore, convergence of the numerical estimates is slow when is small (see Figure 6).
4.2. Geometric Information Flow
Now, we focus on quantifying the geometric information flow by comparing dimensionalities of the outcomes’ spaces. We will contrast this to the transfer entropy computations for a few examples of the form .
To illustrate the idea of geometric information flow, let us first consider a simple example, . If , we have and, when , we have the case. Therefore, dimensionality of the data set will change with parameter b (see Figure 7). When the number of data points and , then . Generally, this measure of causality depends on the value of b, but also the initial density of initial conditions.
Figure 7.
Manifold of the data (x′, x) with x′ = by and y is uniformly distributed in the interval [0, 1]. Notice that, when (a) b = 0, we have a 1D manifold, (b) b ≠ 0 we have 2D manifold, in the (x′, x) plane.
In this example, we contrast theoretical solutions with the numerically estimated solutions (Figure 8). Theoretically, we expect as . In addition, the transfer entropy for noisy data can be calculated by Equation (42).
Figure 8.
Geometric information flow vs. Transfer entropy for X′ = bY data.
4.3. Synthetic Data: with
The role of the initial density of points in the domain plays an important role in how the specific information flow values are computed depending on the measure used. To illustrate this point, consider the example of a unit square, , that is uniformly sampled, and mapped by
This fits our basic premise that data embeds in a 2D manifold, by ansatz of Equations (1) and (43), assuming for this example that each of and are scalar. As the number of data point grows, , we can see that because data are on 2D manifold iff (numerical estimation can be seen in Figure 9b). On the other hand, the conditional entropy is not defined, becoming unbounded when defined by noisy data. Thus, it follows that transfer entropy shares this same property. In other words, boundedness of transfer entropy depends highly on the conditional data structure, while, instead, our geometric information flow measure highly depends on conditional data structure. Figure 9c demonstrates this observation with estimated transfer entropy and analytically computed values for noisy data. The slow convergence can be observed, Equation (42), Figure 6.
Figure 9.
(a) shows the geometric information flow and (b) represents the Transfer entropy for x′ = x + by data. The figures show the changes with parameter b. We can notice that the transfer entropy has similar behavior to the geometric information flow of the data.
4.4. Synthetic Data: Nonlinear Cases
Now, consider the Hénon map,
as a special case of a general quadratic relationship, , for discussing how may depend on . Again, we do not worry here if may or may not depend on x and or y when deciding dependencies for . We will discuss two cases, depending on how the data are distributed. For the first case, assume is uniformly distributed in the square, . The second and dynamically more realistic case will assume that lies on the invariant set (the strange attractor) of the Hénon map. The geometric information flow is shown for both cases in Figure 10. We numerically estimate the transfer entropy for both cases, which gives and , respectively. (However, recall that the first case for transfer entropy might not be finite analytically, and there is slow numerical estimation—see Table 3).
Figure 10.
Consider the Hénon map, Equation (44), within the domain [−1.5, 1.5]2 and the invariant set of Hénon map. (a) the uniform distribution case (green) as well as the natural invariant measure of the attractor (blue) are shown regarding the (x, y, x′) data for both cases; (b) when (x, y) ∈ [−1.5, 1.5]2, notice that GeoCy→x = 0.9, and (c) if (x, y) is in an invariant set of Hénon map, then GeoCy→x = 0.2712.
Table 3.
Hénon Map Results. Contrasting geometric information flow versus transfer entropy in two different cases, 1st relative to uniform distribution of initial conditions (reset each time) and 2nd relative to the natural invariant measure (more realistic).
4.5. Application Data
Now, moving beyond bench-marking with synthetic data, we will contrast the two measures of information flow in a real world experimental data set. Consider heart rate () vs. breathing rate () data (Figure 11) as published in [33,34], consisting of 5000 samples. Correlation dimension of the data X is , and . Therefore, depends not only on x, but also on an extra variable (Theorem 2). In addition, correlation dimension of the data and is computed and , respectively. We conclude that depends on extra variable(s) other that (Theorem 2) and the correlation dimension geometric information flow, , is computed by Equations (38) and (37). Therefore, this suggests the conclusion that there is a causal inference from breathing rate to heart rate. Since breathing rate and heart rate share the same units, the quantity measured by geometric information flow can be described without normalizing. Transfer entropy as estimated by the KSG method [32] with parameter is , interestingly relatively close to the value. In summary, both measures for causality () are either zero or positive together. It follows that there exists a causal inference (see Table 4).
Figure 11.
Result for heart rate(xn) (a,c) vs. breathing rate(yn) data (b,d). The top row is the scatter plot of the data, and the second row represents the dimension of the data.
Table 4.
Heart rate vs. breathing rate data—contrasting geometric information flow versus transfer entropy in breath rate to heart rate.
5. Conclusions
We have developed here a geometric interpretation of information flow as a causal inference as usually measured by a positive transfer entropy, . Our interpretation relates the dimensionality of an underlying manifold as projected into the outcome space and summarizes the information flow. Furthermore, the analysis behind our interpretation involves standard Pinsker’s inequality that estimates entropy in terms of total variation, and, through this method, we can interpret the production of information flow in terms of details of the derivatives describing relative orientation of the manifolds describing inputs and outputs (under certain simple assumptions).
A geometric description of causality allows for new and efficient computational methods for causality inference. Furthermore, this geometric perspective provides a different view of the problem and facilitates the richer understanding that complements the probabilistic descriptions. Causal inference is weaved strongly throughout many fields and the use of transfer entropy has been a popular black box tool for this endeavor. Our method can be used to reveal more details of the underling geometry of the data-set and provide a clear view of the causal inference. In addition, one can use the hybrid method of this geometric aspect and existing other methods in their applications.
We provided a theoretical explanation (part I: Mathematical proof of the geometric view of the problem) and numerical evidence (part 2: A data-driven approach for mathematical framework) of a geometric view for the causal inference. Our experiments are based on synthetic (toy problems) and practical data. In the case of synthetic data, the underlining dynamics of the data and the actual solution to the problem are known. For each of these toy problems, we consider a lot of cases by setting a few parameters. Our newly designed geometric approach can successfully capture these cases. One major problem may be if data describes a chaotic attractor. We prove theoretically (Theorem 2) and experimentally (by Hénon map example: in this toy problem, we also know actual causality) that correlation dimension serves to overcome this issue. Furthermore, we present a practical example based on heart rate vs. breathing rate variability, which was already shown to have positive transfer entropy, and here we relate this to show positive geometric causality.
Furthermore, we have pointed out that transfer entropy has analytic convergence issues when future data () are exactly a function of current input data () versus more generally . Therefore, referring to how the geometry of the data can be used to identify the causation of the time series data, we develop a new causality measurement based on a fractal measurement comparing inputs and outputs. Specifically, the correlation dimension is a useful and efficient way to define what we call correlation dimensional geometric information flow, . The offers a strongly geometric interpretable result as a global picture of the information flow. We demonstrate the natural benefits of versus , in several synthetic examples where we can specifically control the geometric details, and then with a physiological example using heart and breathing data.
Author Contributions
Conceptualization, Sudam Surasinghe and E.M.B.; Data curation, Sudam Surasinghe and E.M.B.; Formal analysis, S.S. and E.M.B.; Funding acquisition, E.M.B.; Methodology, S.S. and E.M.B.; Project administration, E.M.B.; Resources, E.M.B.; Software, S.S. and E.M.B.; Supervision, E.M.B.; Validation, S.S. and E.M.B.; Visualization, S.S. and E.M.B.; Writing—original draft, S.S. and E.M.B.; Writing—review and editing, S.S. and E.M.B. All authors have read and agreed to the published version of the manuscript.
Funding
Erik M. Bollt gratefully acknowledges funding from the Army Research Office W911NF16-1-0081 (Samuel Stanton) as well as from DARPA.
Acknowledgments
We would like to thank Ioannis Kevrekidis for his generous time, feedback, and interest regarding this project.
Conflicts of Interest
The authors declare no conflict of interest.
Appendix A. On the Asymmetric Spaces Transfer Operators
In this section we prove Theorem 1 concerning a transfer operator for smooth transformations between sets of perhaps dissimilar dimensionality. In general, the marginal probability density can be found by integrating (or summation in the case of a discrete random variable) to marginalize the joint probability densities. When , the joint density is non-zero only at points on . Therefore, and notice that (By Bayes theorem). Hence, and we only need to show the following claims. We will discuss this by two cases. First, we consider and then we consider more general case . In higher dimensions we can consider similar scenarios of input and output variables, and correspondingly the trapezoidal bounding regions would need to be specified in which we can analytically control the variables.
Proposition A1
(Claim). Let be a random variable with probability density function . Suppose are Radon–Nikodym derivatives (of induced measure with respect to some base measure μ) which is bounded above and bounded away from zero. In addition, let for some function . Then,
where
Proof.
Let and . Since is a Radon–Nikodym derivative with bounded above and bounded away from zero, where m is the infimum of the Radon–Nikodym derivative. Similarly where M is the supremum of the Radon–Nikodym derivative. In addition, for . Therefore, when . Hence, and . Therefore, □
Proposition A2
(Claim). 2 Let be random variables with joint probability density function . Suppose and are Radon–Nikodym derivatives (of induced measure with respect to some base measure μ) which is bounded above and bounded away from zero. In addition, let for some function . Then,
where
Proof.
Let and . Since is a Radon–Nikodym derivative with bounded above and bounded away from zero, where m is the infimum of the Radon–Nikodym derivative. Similarly, where M is the supremum of the Radon–Nikodym derivative. In addition, for . Therefore, when . Hence, and . Therefore, . □
If f only depends on x, then the partial derivative of f with respect to y is equal to zero and which leads to the same result as clam 1.
References
- Williams, C.J.F. “Aristotle’s Physics, Books I and II”, Translated with Introduction and Notes by W. Charlton. Mind 1973, 82, 617. [Google Scholar] [CrossRef]
- Falcon, A. Aristotle on Causality. In The Stanford Encyclopedia of Philosophy, Spring 2019 ed.; Zalta, E.N., Ed.; Metaphysics Research Lab, Stanford University: Stanford, CA, USA, 2019. [Google Scholar]
- Russell, B. I.—On the Notion of Cause. Proc. Aristot. Soc. 1913, 13, 1–26. [Google Scholar] [CrossRef]
- Bollt, E.M. Open or closed? Information flow decided by transfer operators and forecastability quality metric. Chaos Interdiscip. J. Nonlinear Sci. 2018, 28, 075309. [Google Scholar] [CrossRef] [PubMed]
- Hendry, D.F. The Nobel Memorial Prize for Clive W. J. Granger. Scand. J. Econ. 2004, 106, 187–213. [Google Scholar] [CrossRef]
- Wiener, N. The theory of prediction. In Mathematics for the Engineer; McGraw-Hill: New York, NY, USA, 1956. [Google Scholar]
- Schreiber, T. Measuring Information Transfer. Phys. Rev. Lett. 2000, 85, 461–464. [Google Scholar] [CrossRef] [PubMed]
- Bollt, E.; Santitissadeekorn, N. Applied and Computational Measurable Dynamics; Society for Industrial and Applied Mathematics: Philadelphia, PA, USA, 2013. [Google Scholar] [CrossRef]
- Barnett, L.; Barrett, A.B.; Seth, A.K. Granger Causality and Transfer Entropy Are Equivalent for Gaussian Variables. Phys. Rev. Lett. 2009, 103, 238701. [Google Scholar] [CrossRef]
- Sugihara, G.; May, R.; Ye, H.; Hsieh, C.h.; Deyle, E.; Fogarty, M.; Munch, S. Detecting Causality in Complex Ecosystems. Science 2012, 338, 496–500. [Google Scholar] [CrossRef] [PubMed]
- Sun, J.; Bollt, E.M. Causation entropy identifies indirect influences, dominance of neighbors and anticipatory couplings. Phys. D Nonlinear Phenom. 2014, 267, 49–57. [Google Scholar] [CrossRef]
- Sun, J.; Taylor, D.; Bollt, E. Causal Network Inference by Optimal Causation Entropy. SIAM J. Appl. Dyn. Syst. 2015, 14, 73–106. [Google Scholar] [CrossRef]
- Bollt, E.M.; Sun, J.; Runge, J. Introduction to Focus Issue: Causation inference and information flow in dynamical systems: Theory and applications. Chaos Interdiscip. J. Nonlinear Sci. 2018, 28, 075201. [Google Scholar] [CrossRef]
- Runge, J.; Bathiany, S.; Bollt, E.; Camps-Valls, G.; Coumou, D.; Deyle, E.; Glymour, C.; Kretschmer, M.; Mahecha, M.D.; Muñoz-Marí, J.; et al. Inferring causation from time series in Earth system sciences. Nat. Commun. 2019, 10, 1–13. [Google Scholar] [CrossRef] [PubMed]
- Lord, W.M.; Sun, J.; Ouellette, N.T.; Bollt, E.M. Inference of causal information flow in collective animal behavior. IEEE Trans. Mol. Biol. Multi-Scale Commun. 2016, 2, 107–116. [Google Scholar] [CrossRef]
- Kim, P.; Rogers, J.; Sun, J.; Bollt, E. Causation entropy identifies sparsity structure for parameter estimation of dynamic systems. J. Comput. Nonlinear Dyn. 2017, 12, 011008. [Google Scholar] [CrossRef]
- AlMomani, A.A.R.; Sun, J.; Bollt, E. How Entropic Regression Beats the Outliers Problem in Nonlinear System Identification. arXiv 2019, arXiv:1905.08061. [Google Scholar] [CrossRef] [PubMed]
- Sudu Ambegedara, A.; Sun, J.; Janoyan, K.; Bollt, E. Information-theoretical noninvasive damage detection in bridge structures. Chaos Interdiscip. J. Nonlinear Sci. 2016, 26, 116312. [Google Scholar] [CrossRef]
- Hall, N. Two Concepts of Causation. In Causation and Counterfactuals; Collins, J., Hall, N., Paul, L., Eds.; MIT Press: Cambridge, MA, USA, 2004; pp. 225–276. [Google Scholar]
- Pearl, J. Bayesianism and Causality, or, Why I Am Only a Half-Bayesian. In Foundations of Bayesianism; Corfield, D., Williamson, J., Eds.; Kluwer Academic Publishers: Dordrecht, The Netherlands, 2001; pp. 19–36. [Google Scholar]
- White, H.; Chalak, K.; Lu, X. Linking Granger Causality and the Pearl Causal Model with Settable Systems. JMRL Workshop Conf. Proc. 2011, 12, 1–29. [Google Scholar]
- White, H.; Chalak, K. Settable Systems: An Extension of Pearl’s Causal Model with Optimization, Equilibrium, and Learning. J. Mach. Learn. Res. 2009, 10, 1759–1799. [Google Scholar]
- Bollt, E. Synchronization as a process of sharing and transferring information. Int. J. Bifurc. Chaos 2012, 22. [Google Scholar] [CrossRef]
- Lasota, A.; Mackey, M. Chaos, Fractals, and Noise: Stochastic Aspects of Dynamics; Springer: New York, NY, USA, 2013. [Google Scholar]
- Pinsker, M.S. Information and information stability of random variables and processes. Dokl. Akad. Nauk SSSR 1960, 133, 28–30. [Google Scholar]
- Boucheron, S.; Lugosi, G.; Massart, P. Concentration Inequalities: A Nonasymptotic Theory of Independence; Oxford University Press: Oxford, UK, 2013. [Google Scholar]
- Sauer, T.; Yorke, J.A.; Casdagli, M. Embedology. Stat. Phys. 1991, 65, 579–616. [Google Scholar] [CrossRef]
- Sauer, T.; Yorke, J.A. Are the dimensions of a set and its image equal under typical smooth functions? Ergod. Theory Dyn. Syst. 1997, 17, 941–956. [Google Scholar] [CrossRef]
- Grassberger, P.; Procaccia, I. Measuring the strangeness of strange attractors. Phys. D Nonlinear Phenom. 1983, 9, 189–208. [Google Scholar] [CrossRef]
- Grassberger, P.; Procaccia, I. Characterization of Strange Attractors. Phys. Rev. Lett. 1983, 50, 346–349. [Google Scholar] [CrossRef]
- Grassberger, P. Generalized dimensions of strange attractors. Phys. Lett. A 1983, 97, 227–230. [Google Scholar] [CrossRef]
- Kraskov, A.; Stögbauer, H.; Grassberger, P. Estimating mutual information. Phys. Rev. E 2004, 69, 066138. [Google Scholar] [CrossRef]
- Rigney, D.; Goldberger, A.; Ocasio, W.; Ichimaru, Y.; Moody, G.; Mark, R. Multi-channel physiological data: Description and analysis. In Time Series Prediction: Forecasting the Future and Understanding the Past; Addison-Wesley: Boston, MA, USA, 1993; pp. 105–129. [Google Scholar]
- Ichimaru, Y.; Moody, G. Development of the polysomnographic database on CD-ROM. Psychiatry Clin. Neurosci. 1999, 53, 175–177. [Google Scholar] [CrossRef] [PubMed]
© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).