On Geometry of Information Flow for Causal Inference

Causal inference is perhaps one of the most fundamental concepts in science, beginning originally from the works of some of the ancient philosophers, through today, but also weaved strongly in current work from statisticians, machine learning experts, and scientists from many other fields. This paper takes the perspective of information flow, which includes the Nobel prize winning work on Granger-causality, and the recently highly popular transfer entropy, these being probabilistic in nature. Our main contribution will be to develop analysis tools that will allow a geometric interpretation of information flow as a causal inference indicated by positive transfer entropy. We will describe the effective dimensionality of an underlying manifold as projected into the outcome space that summarizes information flow. Therefore, contrasting the probabilistic and geometric perspectives, we will introduce a new measure of causal inference based on the fractal correlation dimension conditionally applied to competing explanations of future forecasts, which we will write GeoCy→x. This avoids some of the boundedness issues that we show exist for the transfer entropy, Ty→x. We will highlight our discussions with data developed from synthetic models of successively more complex nature: these include the Hénon map example, and finally a real physiological example relating breathing and heart rate function.


Introduction
Causation Inference is perhaps one of the most fundamental concepts in science, underlying questions such as "what are the causes of changes in observed variables."Identifying, indeed even defining causal variables of a given observed variable is not an easy task, and these questions date back to the Greeks [1,2].This includes important contributions from more recent luminaries such as Russel [3], and from philosophy, mathematics, probability, information theory, and computer science.We have written that, [4], "a basic question when defining the concept of information flow is to contrast versions of reality for a dynamical system.Either a subcomponent is closed or alternatively there is an outside influence due to another component."Claude Granger's Nobel prize [5] winning work leading to Granger Causality (see also Wiener,[6]) formulates causal inference as a concept of quality of forecasts.That is, we ask, does system X provide sufficient information regarding forecasts of future states of system X or are there improved forecasts with observations from system Y.We declare that X is not closed, as it is receiving influence (or information) from system Y, when data from Y improves forecasts of X.Such a reduction of uncertainty perspective of causal inference is not identical to the interventionists' concept of allowing perturbations and experiments to decide what changes indicate influences.This data oriented philosophy of causal inference is especially appropriate when (1) the system is a dynamical system of some form producing data streams in time, and (2) a score of influence may be needed.In particular, contrasting forecasts is the defining concept underlying Granger Causality (G-causality) and it is closely related to the concept of information flow as defined by transfer entropy [7,8], which can be proved as a nonlinear version of Granger's otherwise linear (ARMA) test [9].In this spirit we find methods such as Convergent Cross-Mapping method (CCM) [10], and causation entropy (CSE) [11] to disambiguate direct versus indirect influences, [11][12][13][14][15][16][17][18].On the other hand, closely related to information flow are concepts of counter factuals: "what would happen if.." [19] that are foundational questions for another school leading to the highly successful J.
Pearl "Do-Calculus" built on a specialized variation of Bayesian analysis, [20].These are especially relevant for nondynamical questions (inputs and outputs occur once across populations), such as a typical question of the sort, "why did I get fat " may be premised on inferring probabilities of removing influences of saturated fats and chocolates.However, with concepts of counter-factual analysis in mind, one may argue that Granger is less descriptive of causation inference, but rather more descriptive of information flow.In fact, there is a link between the two notions for so-called "settable" systems under a conditional form of exogeneity [21,22].This paper focuses on the information flow perspective, which is causation as it relates to G-causality.The role of this paper is to highlight connections between the probabilistic aspects of information flow, such as Granger causality and transfer entropy, to a less often discussed geometric picture that may underlie the information flow.To this purpose, here we develop both analysis and data driven concepts to serve in bridging what have otherwise been separate philosophies.Figure . 1

Geometry
illustrates the two nodes that we tackle here: causal inference and geometry.In the diagram, the equations that are most central in serving to bridge the main concepts are highlighted, and the main role of this paper then could be described as building these bridges.
When data is derived from a stochastic or deterministic dynamical system, one should also be able to understand the connections between variables in geometric terms.The traditional narrative of information flow is in terms of comparing stochastic processes in probabilistic terms.However, the role of this paper is to offer a unifying description for interpreting geometric formulations of causation together with traditional statistical or information theoretic interpretations.Thus we will try to provide a bridge between concepts of causality as information flow to the underlying geometry since geometry is perhaps a natural place to describe a dynamical system.
Our work herein comes in two parts.First, we analyze connections between information flow by transfer entropy to geometric quantities that describe the orientation of underlying functions of a corresponding dynamical system.In the course of this analysis, we have needed to develop a new "asymmetric transfer operator" (asymmetric Frobenius-Perron operator) evolving ensemble densities of initial conditions between spaces whose dimensionalities do not match.With this, we proceed to give a new exact formula for transfer entropy, and from there we are able to relate this Kullback-Leibler divergence based measure directly to other more geometrically relevant divergences, specifically total variation divergence and Hellinger divergence, by Pinsker's inequality.This leads to a succinct upper bound of the transfer entropy by quantities related to a more geometric description of the underlying dynamical system.In the second part of this work, we present numerical interpretations of transfer entropy TE y!x in the setting of a succession of simple dynamical systems, with specifically designed underlying densities, and eventually we include a heart rate versus breathing rate data set.Then we present a new measure in the spirit of G-causality that is more directly motivated by geometry.
This measure, GeoC y!x , is developed in terms of the classical fractal dimension concept of correlation dimension.
In summary the main theme of this work is to provide connections between probabilistic interpretations and geometric interpretations of causal inference.The main connections and corresponding sections of this paper are summarized as a dichotomy: Geometry and Causation (information flow structure) as described in Fig. (1).

The Problem Setup
Consider the simplest of cases, where there are two coupled dynamical systems written as discrete time maps, For now, we assume that x, y are real valued scalars, but the multi-variate scenario will be discussed subsequently.We use a shorthand notation, x := x n , x 0 := x n+1 for any particular time n, where the ' notation denotes "next iterate."Likewise, let z = (x, y) denote the composite variable, and its future composite state, z 0 .
The definition of transfer entropy, [7,8,23], measuring the influence of coupling from variables y onto the future of the variables x, denoted x 0 is g.ven by: ( This hinges on the contrast between two alternative versions of the possible origins of x 0 and is premised on deciding one of the following two cases: Either is descriptive of the actual function f 1 .The definition of T y!x is defined to decide this question by comparing the deviation from a proposed Markov property, The Kullback-Leibler divergence used here contrasts these two possible explanations of the process generating x 0 .Since D KL may be written in terms of mutual information, the units are as any entropy, bits per time step.Notice that we have overloaded the notation writing p(x 0 |x) and p(x 0 |x, y).Our practice will be to rely on the arguments to distinguish functions as otherwise different (likewise Consider that the coupling structure between variables may be caricatured by the directed graph illustrated in Fig. 2. In one time step, without loss of generality, we may decide Eq. ( 4), the role of y on x 0 , based on T y!x > 0, exclusively in terms of the details of the argument structure of f 1 .This is separate from the reverse question of f 2 as to whether T x!y > 0. In geometric terms, assuming , it is clear that unless the partial derivative ∂ f 1 ∂y is zero everywhere, then the y argument in f 1 (x, y) is relevant.This is not a necessary condition for T y!x > 0 which is a probabilistic statement, and almost everywhere is sufficient.

In Geometric Terms
Consider a manifold of points, (x, y, x 0 ) 2 X ⇥ Y ⇥ X 0 as the graph over W 1 , which we label M 2 .In the following we assume f 1 2 C 1 (W 1 ), W 1 ⇢ X ⇥ Y.Our primary assertion here is that the geometric aspects of the set (x, y, x 0 ) projected into (x, x 0 ) distinguishes the information flow structure.Refer to Fig. 3 for notation.Let the level set for a given fixed y be defined, The dimension of the projected set of (x, x 0 ) depends on the causality as just described.Compare to Fig. 4, and Eq.(27) .When these level sets are distinct, then the question of the relevance of y to the outcome of x 0 is clear: Notice that if the y argument is not relevant as described above, then x 0 = f 1 (x) better describes the associations, but if we nonetheless insist to write x 0 = f 1 (x, y), then ∂ f 1 ∂y = 0 for all (x, y) 2 W 1 .The converse is interesting to state explicitly, • If L y 6 = L ỹ for some y, ỹ, then ∂ f 1 ∂y 6 = 0 for some (x, y) 2 W 1 , and then x 0 = f 1 (x) is not a sufficient description of what should really be written x 0 = f 1 (x, y).We have assumed throughout.

In Probabilistic Terms
Considering the evolution of x as a stochastic process [8,24], we may write a probability density function in terms of all those variables that may be relevant, p(x, y, x 0 ).To contrast the role of the various input variables requires us to develop a new singular transfer operator between domains that do not necessarily have the same number of variables.Notice that the definition of transfer entropy (Eq. 3) seems to rely on the absolute continuity of the joint probability density p(x, y, x 0 ).
However, that joint distribution of p(x, y, f (x, y)) is generally not absolutely continuous, noticing its ) is not well defined as a differential entropy and hence there is a problem with transfer entropy.We expand upon this important detail in the upcoming subsection.To guarantee existence, we interpret these quantities by convolution to smooth the problem.Adding an "artificial noise" with standard deviation parameter e allows definition of the conditional entropy at the singular limit e approaches to zero, and likewise the transfer entropy follows.
The probability density function of the sum of two continuous random variables (U, Z) can be obtained by convolution, P U+Z = P U ⇤ P Z .Random noise (Z with mean E(Z) = 0 and variance V(Z) = Ce 2 ) added to the original observable variables regularizes, and we are interested in the singular limit, e !0. We assume that Z is independent of X, Y.In experimental data from practical problems, we argue that some noise, perhaps even if small, is always present.With this concept, Transfer Entropy can now be calculated by using h(X 0 |X, Y) and h(X 0 |X) when where now we assume that X, Y, Z 2 R are independent random variables and we assume that f : W x ⇥ W y !R is a component-wise monotonic (we will consider the monotonically increasing case for consistent explanations, but one can use monotonically decreasing functions in similar manner) continuous function of X, Y and W x , W y ✓ R.

Relative Entropy for a Function of Random Variables
Calculation of transfer entropy depends on the conditional probability.Hence we will first focus on conditional probability.Since for any particular values x, y, the function value f (x, y) is fixed, we conclude that X 0 |x, y is just a linear function of Z.We see that where p Z is the probability density function of Z.
Note that the random variable X 0 |x is a function of (Y, Z).To write U + Z, let U = f (x, Y).Therefore convolution of densities of U and Z gives the density function for p(x 0 |x).Notice that a given value of the random variable, say X = a, is a parameter in U. Therefore, we will denote U = f (Y; a).We will first focus on the probability density function of U, p U (u), using the Frobenius-Perron operator, In the multivariate setting, the formula is extended similarly interpreting the derivative as the Jacobian matrix, and the absolute value is interpreted as the absolute value of the determinant.Denote Then the absolute value of the determinate of the Jacobian matrix is given by: |J g (y)| = | ∂g 1 (y;a) ∂y 1 |.As an aside, note that J is lower triangular with diagonal entries d ii = 1 for i > 1.The probability density function of U is given by where S is the support set of the random variable V.
Since the random variable X 0 |x can be written as a sum of U and Z, we find the probability density function by convolution as follows: Now the conditional differential entropy h(Z|X, Y) is in terms of these probability densities.It is useful that translation does not change the differential entropy, h e ( f (X, We consider two scenarios: (1) Z is a uniform random variable or (2) Z is a Gaussian random variable.If it is uniform in the interval [ e/2, e/2], then the differential entropy is h(Z) = ln(e).If specifically, Z is Gaussian with zero mean and e standard deviation, then h 1 2 ln 2pee 2 ; Z ⇠ N (0, e 2 ) where U( e/2, e/2) is the uniform distribution in the interval [ e/2, e/2], and N (0, e 2 ) is a Gaussian distribution with zero mean and e standard deviation.Now, we focus on h(X 0 |X).If X 0 is just a function of X, then we can similarly show that: if Also notice that if X 0 = f (X, Y) then h(X 0 |X) will exist, and most of the cases will be finite.But when we calculate T y!x we need to use the noisy version to avoid the issues in calculating h(X 0 |X, Y).We will now, consider the interesting case X 0 = f (X, Y) + Z and calculate h(X 0 |X).We require p X 0 |X and Eq. ( 11) can be used to calculate this probability.Let us denote where p x is a probability density function).Therefore, we can calculate h e (X 0 |X) by four steps.First we calculate the density function for U = f (x, Y) (by using Eq. ( 9) or ( 10) ).Then, we calculate I = p X 0 |X by using Eq.(11).Next, we calculate the value of Q, and finally we calculate the value of h e (X 0 |X).
Thus the transfer entropy from y to x follows in terms of comparing conditional entropies, This quantity is not well defined when X 0 = f (X, Y), and therefore we considered the X 0 = f (X, Y) + Z case.This interpretation of transfer entropy depends on the parameter e, as we define, Note that, Thus we see that a finite quantity is ensured by the noise term.We can easily find an upper bound for the transfer entropy when X 0 = f (X, Y) + Z is a random variable with finite support (with all the other assumptions mentioned earlier) and suppose Z ⇠ U( e/2, e/2).First, notice that the uniform distribution maximizes entropy amongst all distributions of continuous random variables with finite support.If f is component-wise monotonically increasing continuous function then the support of Here y min and y max are minimum and maximum values of Y. Then it follows that where x max is the maximum x value.We see that an interesting upper bound for transfer entropy follows:

Relating Transfer Entropy to a Geometric Bound
Noting that transfer entropy and other variations of the G-causality concept are expressed in terms of conditional probabilities, we recall that, r(x 0 |x, y)r(x, y) = r(x, y, x 0 ).
Again we continue to overload the notation on the functions r, the details of the arguments distinguishing to which of these functions we refer.Now consider the change of random variable formulas that map between probability density functions by smooth transformations.In the case that In the more general case, not assuming one-one-ness, we get the usual Frobenius-Perron operator, in terms of a summation over all pre-images of x 0 .Notice also that the middle form is written as a marginalization across x of all those x that lead to x 0 .This Frobenius-Perron operator, as usual, maps densities of ensembles of initial conditions under the action of the map f 1 .
Comparing to the expression r(x, we assert the interpretation that r(x 0 |x) := 1 where d is the Dirac delta function.In the language of Bayesian uncertainty propagation, p(x 0 |x) describes the likelihood function, if interpreting the future state x 0 as data, and the past state x as parameters, in a standard Bayes description, p(data|parameter) ⇥ p(parameter).As usual for any likelihood function, while it is a probability distribution over the data argument, it may not necessarily be so with respect to the parameter argument.Now consider the case where x 0 is indeed nontrivially a function with respect to not just x, but also with respect to y.Then we require the following asymmetric space transfer operator, which we name here an asymmetric Frobenius-Perron operator for smooth transformations between spaces of dissimilar dimensionality: , and range x 0 2 U ⇢ R d , and f 1 2 C 1 (W 1 ), and the Jacobian matrices, ∂ f 1 ∂x (x, y), and ∂ f 1 ∂y (x, y) are not both rank deficient at the same time, then taking the initial density r(x, y) 2 L 1 (W 1 ), the following serves as a transfer operator mapping asymmetrically defined densities P : The proof of this is in Appendix A.1.Note also that by similar argumentation, one can formulate the asymmetric Frobenius-Perron type operator between sets of disimilar dimensionality in an integral form.
Corollary 1 (Asymmetric Transfer Operator, Kernel Integral Form).Under the same hypothesis as Theorem 1, we may alternatively write the integral kernel form of the expression, This is in terms of a line integration along the level set, L x 0 .See Fig. 4.
In Fig. 4, we have shown a typical scenario where a level set is a curve (or it may well be a union of disjoint curves), whereas in a typical FP-operator between sets of the same dimensionality generally the integration is between pre-images that are usually either singletons, or unions of such points, The asymmetric transfer operator, Eq. ( 27), is written in terms of intefration over the level set, L x 0 of x 0 = f 1 (x, y) associated with a fixed value x 0 , Eq. ( 29).
Contrasting standard and the asymmetric forms of transfer operators as described above, in the next section we will compute and bound estimates for the transfer entropy.However, it should already be apparent that, if Contrast to other statistical divergences reveals geometric relevance: Information flow is quite naturally defined by the KL-divergence, in that it comes in the units of entropy, e.g.bits per second.
However, the well known Pinsker's inequality [25] allows us to more easily relate the transfer entropy to a quantity that has a geometric relevance using the total variation, even if this is only by an inequality estimate.
Recall Pinsker's inequality [25] relates random variables with probability distributions p and q over the same support to the total variation and the KL-divergence as follows, written as probability measures P, Q.The total variation distance between probability measures is a maximal absolute difference of possible events, but it is well known to be related to 1/2 of the L1-distance in the case of a common dominating measure, p(x)dµ = dP, q(x)dµ = dQ.In this work, we only need absolute continuity with respect to Lebesgue measure, p(x) = dP(x), q(x) = dQ(x); then, here with respect to Lebesgue measure.Also, we write D KL (P||Q) = R p(x) log p(x) q(x) dx, therefore, Thus, with the Pinsker inequality, we can bound the transfer entropy from below by inserting the definition Eq. (3) into the above: The assumption that the two distributions correspond to a common dominating measure requires that we interpret p(x 0 |x) as a distribution averaged across the same r(x, y) as p(x 0 |x, y).(Recall by definition [26] that l is a common dominating measure of P and Q if p(x) = dP/dl and q(x) = dQ/dl describe corresponding densities).For the sake of simplification we interpret transfer entropy relative to a uniform initial density, r(x, y) for both entropies of Eq. ( 16).With this assumption we interpret In the special case that there is very little information flow, we would expect that ∂x |, a.e.x, y; then a power series expansion in small b gives which serves approximately as the TV-lower bound for transfer entropy where have used the notation < • > to denote an average across the domain.Notice that therefore, d(p(x 0 |x, y), p( While Pinsker's inequality cannot guarantee that therefore T y!x #, since TV is only an upper bound, it is clearly suggestive.In summary, comparing inequality Eq. ( 35) to the approximation (36) suggests Now, we change to a more computational direction of this story of interpreting information flow in geometric terms.With the strong connection described in the following section we bring to the problem of information flow between geometric concepts to information flow concepts, such as entropy, it is natural to turn to studying the dimensionality of the outcome spaces, as we will now develop.

Part II: Numerics and Examples of Geometric Interpretations
Now we will explore numerical estimation aspects of transfer entropy for causation inference in relationship to geometry as described theoretically in the previous section, and we will compare this numerical approach to geometric aspects.

Geometry of Information Flow
As theory suggests, see above sections, there is a strong relationship between the information flow (causality as measured by transfer entropy) and the geometry, encoded for example in the estimates leading to Eq. (36).The effective dimensionality of the underlying manifold as projected into the outcome space is a key factor to identify the causal inference between chosen variables.Indeed any question of causality is in fact observer dependent.To this point, suppose x 0 only depends on x, y and x 0 = f (x, y) where f 2 C 1 (W 1 ).We noticed that (Section 2) Therefore, in the case that W 1 is two dimensional, then (x, x 0 ) would be a one dimensional manifold if and only if ∂ f ∂y = 0, 8(x, y) 2 W 1 .See Fig. 3.With these assumptions, T y!x = 0 () (x, x 0 ) data lie on a 1 D manifold.
Likewise, for more general dimensionality of the initial W 1 , the story of the information flow between variables is in part a story of how the image manifold is projected.Therefore, our discussion will focus on estimating the dimensionality in order to identify the nature of the underlying manifold.Then, we will focus on identifying causality by estimating the dimension of the manifold, or even more generally of the resulting set if it is not a manifold but perhaps even a fractal.Finally, this naturally leads us to introduce a new geometric measure for characterizing the causation, which we will identify as Geo y!x .

Relating the Information Flow as Geometric Orientation of Data.
For a given time series x := x n 2 R d 1 , y := y n 2 R d 2 , consider the x 0 := x n+1 and contrast the dimensionalities of (x, y, x 0 ) versus (x, x 0 ), to identify that x 0 = f (x) or x 0 = f (x, y).Thus, in mimicking the premise of Granger causality, or likewise of Transfer entropy, contrasting these two versions of the explanations of x 0 , in terms of either (x, y) or x we decide the causal inference, but this time, by using only the geometric interpretation.First we recall how fractal dimensionality evolves under transformations, [27].
The idea of the arguments in the complete proof found in Sauer et.al., [27], are as follows.Let A be bounded Borel subset of R d 1 and f : where D 2 is the correlation dimension [28].Note that D 2 (A) Now, we can describe this dimensional statement in terms of our information flow causality discussion, to develop an alternative measure of inference between variables.Let (x, x 0 ) 2 W 2 ⇢ R 2d 1 and (x, y, x 0 ) 2 W 3 ⇢ R 2d 1 +d 2 .We assert that there is a causal inference from y to x, if dim(W 2 ) > d 1 and d 1 < dim(W 3 )  d 1 + d 2 , (Theorem 1).In this paper we focus on time series x n 2 R which might also depend on time series y n 2 R and we will consider the geometric causation from y to x, for We will denote geometric causation by GeoC y!x and assume that A, B are Borel subsets of R. Correlation dimension is used to estimate the dimensionality.First, we identify the causality using the dimensionality of on (x, x 0 ) and (x, y, x 0 ).Say, for example that (x, x 0 ) 2 W 2 ⇢ R 2 and (x, y, x 0 ) 2 W 3 ⇢ R 3 , then clearly we would enumerate a correlation dimension causal inference from y to x, if dim(W 2 ) > 1 and 1 < dim(W 3 )  2, (Theorem 1).

Measure Causality by Correlation Dimension
As we have been discussing.the information flow of a dynamical system can be described geometrically by studying the sets (perhaps they are manifolds) X ⇥ X 0 and X ⇥ Y ⇥ X 0 .As we noticed in the last section, comparing the dimension of these sets can be interpreted as descriptive of information flow.Whether dimensionality be estimated from data or by a convenient fractal measure such as the correlation dimension (D 2 (.)), there is an interpretation of information flow when contrasting X ⇥ X 0 versus X ⇥ Y ⇥ X 0 , in a spirit reminiscent of what is done with transfer entropy.
However, these details are geometrically more to the point.

Definition 1 (Conditional Correlation Dimensional Geometric Information Flow
).Let M be the manifold of data set (X 1 , X 2 , . . ., X n , X 0 ) and let W 1 be the data set (X 1 , X 2 , . . ., X n ).Suppose that the M, W 1 are bounded Borel sets.The quantity is defined as "Conditional Correlation Dimensional Geometric Information Flow".Here, D 2 (.) is the usual correlation dimension of the given set, [29][30][31].
Definition 2 (Correlation Dimensional Geometric Information Flow).Let x := x n , y = y n 2 R be two time series.The correlation dimensional geometric information flow from y to x as measured by the correlation dimension and denoted by GeoC y!x is given by A key observation is to notice that, if GeoC(X 0 |X, Y) = 0 and therefore GeoC y!x = 0. Also, notice that GeoC y!x  D 2 (X), where X = {x n |n = 1, 2, . . .}.For example if x n 2 R then GeoC y!x  1.Since we assume that influence of any time series z n 6 = x n , y n to x n is relatively small , we can conclude that GeoC y!x 0, and if then GeoC(X 0 |X, Y) = 0. Additionally the dimension (GeoC(X 0 |X)) in the (X, X 0 ) data scores how much additional (other than X) information is needed to describe X 0 variable.Similarly, the dimension GeoC(X 0 |X, Y) in the (X, Y, X 0 ) data describes how much additional (other than X, Y) information is needed to define X 0 .However, when the number of data points N !•, the value GeoC y!x is nonegative (equal to the dimension of X data).Thus, theoretically GeoC identifies a causality in the geometric sense we have been describing.

Results and discussion
Now, we present specific examples to contrast the transfer entropy with our proposed geometric measure to further highlight the role of geometry in such questions.

Transfer Entropy
In this section we will focus on analytical results and numerical estimators for conditional entropy and transfer entropy for specific examples.As we discussed in previous sections starting with 2.2, computing the transfer entropy for X 0 = f (X, Y) has technical difficulties due to the singularity of the quantity h(X 0 |X, Y).First, we will consider the calculation of h(X 0 |X) for X 0 = f (X, Y), and then we will discuss the calculation for noisy data.In the following examples we assumed that X, Y are ]).A summary of the calculations for a few examples are listed in the Table 1.
We will discuss the transfer entropy with noisy data because to make h(X 0 |X, Y) well defined, requires absolute continuity of the probability density function p(x, y, x 0 ).Consider for example the problem form X 0 = g(X) + bY + C where X, Y are uniformly distributed independent random Note that these numerical estimate for the conditional entropy by the KSG method, [32], converge(as N !•) to the analytic solutions (see Table 1) variables over the interval [1,2] (the same analysis can be extend to any finite interval) with b being a constant, and g a function of random variable X.We will also consider C to be a random variable which is distributed uniformly on [ e/2, e/2].Note that it follows that h(X 0 |X, Y) = ln e.To calculate the h(X 0 |X), we need to find the conditional probability p(X 0 |x) and observe that X 0 |x = U + C where U = g(x) + bY.Therefore, and x 0 +e/2 g(x) be ; g(x) e/2  X 0  g(x) + e/2 as e !0. Therefore convergence of the numerical estimates is slow when e > 0 is small (See Fig. 6).

Geometric Information Flow
Now we focus on quanitfying the geometric information flow by comparing dimensionalities of the outcomes spaces.We will contrast this to the transfer entropy computations for a few examples of the form X 0 = g(X) + bY + C.
To illustrate the idea of geometric information flow, let us first consider a simple example, as a special case of a general quadratic relationship, x 0 = ax + by 2 + c, for discussing how x 0 may depend on (x, y) 2 W 1 .Again we do not worry here if y 0 may or may not depend on x and or y when deciding dependencies for x 0 .We will discuss two cases, depending on how the (x, y) 2 W 1 data is distributed.For the first case, assume (x, y) is uniformly distributed in the square, [ 1.5, 1.5] 2 .
The second and dynamically more realistic case will assume that (x, y) lies on the invariant set (the strange attractor) of the Hénon map.The geometric information flow is shown for both cases, in Fig. 10.
We numerically estimate the transfer entropy for both cases which gives T y!x = 2.4116 and 0.7942 respectively.(But recall that the first case for transfer entropy might not be finite analytically, and there is slow numerical estimation.), see Table 2.

Experimental Data
Now, moving beyond bench-marking with synthetic data, we will contrast the two measures of information flow in a real world experimental data set.Consider heart rate (x n ) vs breathing rate (y n ) data (Fig. 11) as published in [33,34], consisting of 5000 samples.Correlation dimension of the data X is D 2 (X) = 1.00, and D 2 (X, X 0 ) = 1.8319 > D 2 (X).Therefore, X 0 = X n+1 depends not only x but also on an extra variable (Thm.2).Also correlation dimension of the data (X, Y) and (X, Y, X 0 ) is computed We conclude that X 0 depends on extra variable(s) other that (x, y) (Thm.2) and the correlation dimension geometric information flow, GeoC y!x = 0.0427, is computed by Eqs. ( 38)-(37).Therefore, this suggests the conclusion that there is a causal inference from breathing rate to heart rate.Since breathing rate and heart rate share the same units, the quantity measured by geometric information flow can be described without normalizing.
Transfer entropy as estimated by the KSG method ( [32]) with parameter k = 30 is T y!x = 0.0485, interestingly relatively close to the GeoC value.In summary, both measures for causality (GeoC, T) are either zero or positive together.It follows that there exists a causal inference.

Conclusion
We have developed here a geometric interpretation of information flow as a causal inference as usually measured by a positive transfer entropy.T y!x .Our interpretation relates the dimensionality of an underlying manifold as projected into the outcome space summarizes the information flow.Further, the analysis behind our interpretation involves standard Pinsker's inequality that estimates entropy in terms of total variation, and through this method we can interpret the production of information flow in terms of details of the derivatives describing relative orientation of the manifolds describing inputs and outputs (under certain simple assumptions).
Further, we have pointed out that transfer entropy has analytic convergence issues when future data (X 0 ) is exactly a function of current input data (X, Y) versus more generally (X, Y, X 0 ).Therefore, referring to how the geometry of the data can be used to identify the causation of the time series data, we develop a new causality measurement based on a fractal measurement comparing inputs and outputs.Specifically, the correlation dimension is a useful and efficient way to define what we call correlation dimensional geometric information flow, GeoC y!x .The GeoC y!x offers a strongly geometric interpretable result as a global picture of the information flow.We demonstrate the natural benefits of GeoC y!x versus T y!x , in several synthetic examples where we can specifically control the geometric details, and then with a physiological example using heart and breathing data.

Figure 1 .
Figure 1.Summary of the paper and relationship of causation and geometry.
for b > 0, for a.e.x, y, then T y!x # as b #.

1 b;
g(x) + e/2  X 0  b + g(x) e/2 x 0 +e/2+g(x)+b be ; b + g(x) e/2  X 0  b + g(x) + e/2 the definition of transfer entropy we can show that h(X 0 |X) = ln b + e 2b (41) and hence transfer entropy of this data is given by T y!x (e; b) b = 0, the transfer entropy T y!x = ln e ln e = 0. Also notice that T y!x (e; b) !•

(a) b = 1 .Figure 6 .
(b) e = 0.01 (c) e = 10 6 Numerical results and analytical results for transfer entropy T y!x (e; b).Transfer entropy vs e shows in (a) for fixed b value.(b) and (c) figures shows the behavior of the transfer entropy for b values with fixed e values.Noticed that convergence of numerical solution is slow when epsilon is small.

x 0 =
ax + by + c.If b = 0, we have x 0 = f (x) and when b 6 = 0 we have x 0 = f (x, y) case.Therefore, dimensionality of the data set (x 0 , x) will change with parameter b.When the number of data points N !• and b 6 = 0, then GeoC y!x ! 1. Generally this measure of causality depends on the value of b, but also the initial density of initial conditions.In this example we contrast theoretical solutions with the numerically estimated solutions, Fig. 7. Theoretically we expect T y!x = ( 0 ;b = 0 • ; b 6 = 0 as N !•.Also the transfer entropy for noisy data can be calculated by Eq. (42).

( a )Figure 7 .
GeoC y!x (b) T y!x (Numerical results) Geometric information flow vs Transfer entropy.

Figure 8 .
Manifold of the data (x 0 , x) with x 0 = by and y is uniformly distributed in the interval [0, 1].Noticed that when (a) b = 0 we have a 1-D manifold, (b) b 6 = 0 we have 2-D manifold, in the (x 0 , x) plane.4.3.Example: X 0 = aX + bY with a 6 = 0The role of the initial density of points in the domain plays an important role in how the specific information flow values are computed depending on the measure used.To illustrate this point, consider the example of a unit square, [0, 1] 2 , that is uniformly sampled, and mapped by,X 0 = aX + bY, with a 6 = 0. (43)This fits our basic premise that (x, y, x 0 ) data embeds in a 2-D manifold, by ansatz of Eqs.(1), (43), assuming for this example that each of x, y and x 0 are scalar.As the number of data point grows, N !•, we can see that GeoC y!x = ( 0 ;b = 0 1 ;b 6 = 0 , because (X, X 0 ) data is on 2-D manifold iff b 6 = 0, (numerical estimation can be seen in Fig. 9(b) ).On the other hand, the conditional entropy h(X 0 |X, Y) is not defined, becoming unbounded when defined by noisy data.Thus, it follows that transfer entropy shares this same property.In other words, boundedness of transfer entropy depends highly on the X 0 |X, Y conditional data structure; while instead, our geometric information flow measure highly depends on X 0 |X conditional data structure.Figure.9C demonstrates this observation with estimated transfer entropy and analytically computed values for noisy data.The slow convergence can be observed, Eq. 42, Fig. 6.(a) GeoC y!x (b) T y!x Figure 9. Figure (a) shows the geometric information flow and (b) represent the Transfer entropy for x 0 = x + by data.Figures shows the changes with parameter b.We can noticed the transfer entropy has similar behavior to the geometric information flow of the data.

[ 1 Figure 10 .
Figure 10.Consider the Hénon map, Eq. (44), within the domain [ 1.5, 1.5] 2 and the invariant set of Hénon map.a) The uniform distribution case (green) as well as the natural invariant measure of the attractor (blue) are shown regarding the (x, y, x 0 ) data for both cases.b) when (x, y) 2 [ 1.5, 1.5] 2 notice that GeoC y!x = 0.9 and c) if (x, y) in invariant set of Hénon map, then GeoC y!x = 0.2712

Figure 11 .
Result for heart rate(x n ) vs breathing rate(y n ) data.Top raw is the scatter plot of the data and second raw represent the dimension of the data.

Table 2 .
Hénon Map Results.Contrasting geometric information flow versus transfer entropy in two diferent cases, 1st relative to uniform distribution of initial conditions (reset each time) and 2nd relative to the natural invariant measure (more realistic).

Table 3 .
Heart rate vs breathing rate data.Contrasting geometric information flow versus transfer entropy in breath rate to heart rate.