A Recipe for the Estimation of Information Flow in a Dynamical System

: Information-theoretic quantities, such as entropy and mutual information (MI), can be used to quantify the amount of information needed to describe a dataset or the information shared between two datasets. In the case of a dynamical system, the behavior of the relevant variables can be tightly coupled, such that information about one variable at a given instance in time may provide information about other variables at later instances in time. This is often viewed as a flow of information, and tracking such a flow can reveal relationships among the system variables. Since the MI is a symmetric quantity; an asymmetric quantity, called Transfer Entropy (TE), has been proposed to estimate the directionality of the coupling. However, accurate estimation of entropy-based measures is notoriously difficult. Every method has its own free tuning parameter(s) and there is no consensus on an optimal way of estimating the TE from a dataset. We propose a new methodology to estimate TE and apply a set of methods together as an accuracy cross-check to provide a reliable mathematical tool for any given data set. We demonstrate both the variability in TE estimation across techniques as well as the benefits of the proposed methodology to reliably estimate the directionality of coupling among variables.


Introduction
Complex dynamical systems consisting of nonlinearly coupled subsystems can be found in many application areas ranging from biomedicine [1] to engineering [2,3].Teasing apart the subsystems and identifying and characterizing their interactions from observations of the system's behavior can be extremely difficult depending on the magnitude and nature of the coupling and the number of variables involved.In fact, the identification of a subsystem can be an ill-posed problem since the definition of strong or weak coupling is necessarily subjective.
The direction of the coupling between two variables is often thought of in terms of one variable driving another so that the values of one variable at a given time influence the future values of the other.This is a simplistic view based in part on our predilection for linear or "intuitively understandable" systems.In nonlinear systems, there may be mutual coupling across a range of temporal and spatial scales so that it is impossible to describe one variable as driving another without specifying the temporal and spatial scale to be considered.
Even in situations where one can unambiguously describe one variable as driving another, inferring the actual nature of the coupling between two variables from data can still be misleading since co-varying variables could reflect either a situation involving coupling where one variable drives another with a time delay or a situation where both variables are driven by an unknown third variable each with different time delays.While co-relation (we use the term co-relation to describe the situation where there is a relationship between the dynamics of the two variables; this is to be distinguished from correlation, which technically refers only to a second-order statistical relationship) cannot imply causality [4], one cannot have causality without co-relation.Thus co-relation can serve as a useful index for a potential causal interaction.
However, if past values of one variable enable one to predict future values of another variable, then this can be extremely useful despite the fact that the relationship may not be strictly causal.The majority of tests to identify and quantify co-relation depend on statistical tests that quantify the amount of information that one variable provides about another.The most common of these are based on linear techniques, which rely exclusively on second-order statistics, such as correlation analysis and Principal Component Analysis (PCA), which is called Empirical Orthogonal Functions (EOFs) in geophysical studies [5].However, these techniques are insensitive to higher-order nonlinear interactions, which can dominate the behavior of a complex coupled dynamical system.In addition, such linear methods are generally applied by normalizing the data, which implies that they do not depend on scaling effects.
Information-theoretic techniques rely on directly estimating the amount of information contained in a dataset and, as such, rely not only on second-order statistics, but also on statistics of higher orders [6].Perhaps most familiar is the Mutual Information (MI), which quantifies the amount of information that one variable provides about another variable.Thus MI can quantify the degree to which two variables co-relate.However, since it is a symmetric measure MI cannot distinguish potential directionality, or causality, of the coupling between variables [7].
The problem of finding a measure that is sensitive to the directionality of the flow of information has been widely explored.Granger Causality [8] was introduced to quantify directional coupling between variables.However, it is based on second-order statistics, and as such, it focuses on correlation, which constrains its relevance to linear systems.For this reason, generalizations to quantify nonlinear interactions between bi-variate time-series have been studied [9].Schreiber proposed an information-theoretic measure called Transfer Entropy (TE) [7], which can be used to detect the directionality of the flow of information.Transfer Entropy, along with other information-based approaches, is included in the survey paper by Hlavackova-Schindler et al. [10] and differentiation between the information transfer and causal effects are discussed by Lizier and Propenko [11].Kleeman presented both TE and time-lagged MI as applied to ensemble weather prediction [12].In [13], Liang explored the information flow in dynamical systems that can be modeled by equations obtained by the underlying physical concepts.In such cases, the information flow has been analyzed by the evolution of the joint probability distributions using the Liouville equations and by the Fokker-Planck equations, in the cases of the deterministic and stochastic systems, respectively [13].
TE has been applied in many areas of science and engineering, such as neuroscience [1,14], structural engineering [2,3], complex dynamical systems [15,16] and environmental engineering [17,18].In each of these cases, different approaches were used to estimate TE from the respective datasets.TE essentially quantifies the degree to which past information from one variable provides information about future values of the other variable based solely on the data without assuming any model regarding the dynamical relation of the variables or the subsystems.In this sense TE is a non-parametric method.The dependency of the current sample of a time series on its past values is formulated by k th and l th order Markov processes in Schreiber [7] to emphasize the fact that the current sample depends only on its k past values and the other process's past l values.There also exist parametric approaches where the spatio-temporal evolution of the dynamical system is explicitly modeled [15,16].However, in many applications it is precisely this model that we would like to infer from the data.For this reason, we will focus on non-parametric methods.
Kaiser and Schreiber [19], Knuth et al. [20], and Ruddell and Kumar [17,18] have expressed the TE as a sum of Shannon entropies [21].In [17,18], individual entropy terms were estimated from the data using histograms with bin numbers chosen using a graphical method.However, as we discuss in Appendix A1, TE estimates are sensitive to the number of bins used to form the histogram.Unfortunately, it is not clear how to optimally select the number of bins in order to optimize the TE estimate.
In the literature, various techniques have been proposed to efficiently estimate information-theoretic quantities, such as the entropy and MI.Knuth [22] proposed a Bayesian approach, implemented in Matlab and Python and known as the Knuth method, to estimate the probability distributions using a piecewise constant model incorporating the optimal bin-width estimated from data.Wolpert and Wolf [23] provided a successful Bayesian approach to estimate the mean and the variance of entropy from data.Nemenman et al. [24] utilized a mixture of Dirichlet distributions-based prior in their Bayesian Nemenman, Shafee, and Bialek (NSB) entropy estimator.In another study, Kaiser and Schreiber [19] give different expressions for TE as a summation and subtraction of various (conditional/marginal/joint) Shannon entropies and MI terms.However, it has been pointed out that summation and subtraction of information-theoretic quantities can result in large biases [25,26].Prichard and Theiler [25] discuss the "bias correction" formula proposed by Grassberger [27] and conclude that it is better to estimate MI utilizing a "correlation integral" method by performing a kernel density estimation (KDE) of the underlying probability density functions (pdfs).KDE tends to produce a smoother pdf estimate from data points as compared to its histogram counterpart.In this method, a preselected distribution of values around each data point is averaged to obtain an overall, smoother pdf in the data range.This preselected distribution of values within a certain range, which is known as a "kernel", can be thought of as a window with a bandwidth [28].Commonly-used examples of kernels include "Epanechnikov", "Rectangular", and "Gaussian" kernels.Prichard and Theiler showed that pdf models obtained by KDE can be utilized to estimate entropy [25] and other information theoretic quantities, such as the generalized entropy and the Time Lagged Mutual Information (TLMI), using the correlation integral and its approximation through the correlation sums [7].In [25], Prichard and Theiler demonstrated that the utilization of correlation integrals corresponds to using a kernel that is far from optimal, also known as the "naïve estimator" described in [28].It is also shown that the relationship between the correlation integral and information theoretic statistics allows defining "local" versions of many information theoretical quantities.Based on these concepts, Prichard and Theiler demonstrated the interactions among the components of a three-dimensional chaotic Lorenz model with a fractal nature [25].The predictability of the dynamical systems, including the same Lorenz model have been explored by Kleeman in [29,30], where a practical approach for estimating entropy was developed for dynamical systems with non-integral information dimension.
In the estimation of information-theoretical quantities, the KDE approach requires estimation of an appropriate radius (aka bandwidth or rectangle kernel width) for the estimation of the correlation integral.In general cases, this can be accomplished by the Garassberger-Procaccia algorithm, as in [31][32][33].In order to compute the TE from data using a KDE of the pdf, Sabesan and colleagues proposed a methodology to explore an appropriate region of radius values to be utilized in the estimation of the correlation sum [14].
The TE can be expressed as the difference between two relevant MI terms [19], which can be computed by several efficient MI estimation techniques using variable bin-width histograms.Fraser and Swinney [34] and Darbellay and Vajda [35] proposed adaptive partitioning of the observation space to estimate histograms with variable bin-widths thereby increasing the accuracy of MI estimation.However, problems can arise due to the subtraction of the two MI terms as described in [19] and explained in [25,26].
Another adaptive and more data efficient method was developed by Kraskov et al. [36] where MI estimations are based on k-nearest neighbor distances.This technique utilizes the estimation of smooth probability densities from the distances between each data sample point and its k-th nearest neighbor and as well as bias correction to estimate MI.It has been demonstrated [36] that no fine tuning of specific parameters is necessary unlike the case of the adaptive partitioning method of Darbellay and Vajda [35] and the efficiency of the method has been shown for Gaussian and three other non-Gaussian distributed data sets.Herrero et al. extended this technique to TE in [37] and this has been utilized in many applications where TE is estimated [38][39][40] due to its advantages.
We note that a majority of the proposed approaches to estimate TE rely on its specific parameter(s) that have to be selected prior to applying the procedure.However, there are no clear prescriptions available for picking these ad hoc parameter values, which may differ according to the specific application.Our main contribution is to synthesize three established techniques to be used together to perform TE estimation.With this composite approach, if one of the techniques does not agree with the others in terms of the direction of information flow between the variables, we can conclude that method-specific parameter values have been poorly chosen.Here, we propose using three methods to validate the conclusions drawn about the directions of the information flow between the variables, as we generally do not possess a priori facts about any physical phenomenon we explore.
In this paper, we propose an approach that employs efficient use of histogram based methods, adaptive partitioning technique of Darbellay and Vajda, and KDE based TE estimations, where fine tuning of parameters is required.We propose a Bayesian approach to estimate the width of the bins in a fixed bin-width histogram method to estimate the probability distributions from data.
In the rest of the paper, we focus on the demonstration of synthesizing three established techniques to be used together to perform TE estimation.As the TE estimation based on the k-th nearest neighbor approach of Kraskov et al. [36] is demonstrated to be robust to parameter settings, it does not require fine tunings to select parameter values.Thus it has been left for future exploration, as our main goal is to develop a strategy for the selection of parameters in the case of non-robust methods.
The paper is organized as follows.In Section 2, background material is presented on the three TE methods utilized.In Section 3, the performance of each method is demonstrated by applying it to both a linearly coupled autoregressive (AR) model and the Lorenz system equations [41] in both the chaotic and sub-chaotic regimes.The latter represents a simplified model of atmospheric circulation in a convection cell that exhibits attributes of non-linear coupling, including sensitive dependence on model parameter values that can lead to either periodic or chaotic variations.Finally conclusions are drawn in Section 4.

Estimation of Information-Theoretic Quantities from Data
The Shannon entropy: can be used to quantify the amount of information needed to describe a dataset [21].It can be thought of as the average uncertainty for finding the system at a particular state "x" out of a possible set of states "X", where p(x) denotes the probability of that state.Another fundamental information-theoretic quantity is the mutual information (MI), which is used to quantify the information shared between two datasets.Given two datasets denoted by X and Y, the MI can be written as: This is a special case of a measure called the Kullback-Leibler divergence, which in a more general form is given by: which is a non-symmetric measure of the difference between two different probability distributions p(x) and q(x).We can see that in Equation ( 2), the MI represents the divergence between the joint distribution p(x,y) of variables x and y and the product p(x)p(y) of the two marginal distributions.The MI is a symmetric quantity and can be rewritten as a sum and difference of Shannon entropies by: where is the joint Shannon entropy [21,42].
To define the transfer entropy (TE), we assume that there are two Markov processes such that the future value of each process either depends only on its past samples or on both its past samples and the past samples of the other process.Thus, the TE is defined as the ratio of the conditional distribution of one variable depending on the past samples of both processes versus the conditional distribution of that variable depending only on its own past values [7].Thus the asymmetry of TE results in a differentiation of the two directions of information flow.This is demonstrated by the difference between Equation (5a), which defines the transfer entropy in the direction from X to Y and Equation (5b), which defines the transfer entropy in the direction from Y to X: where } are past states, and X and Y are k th and l th order Markov processes, respectively, such that X depends on the k previous values and Y depends on the l previous values.In the literature, k and l are also known as the embedding dimensions [33].As an example, Equation (5b) describes the degree to which information about Y allows one to predict future values of X.Thus, the TE can be used as a measure to quantify the amount of information flow from the subsystem Y to the subsystem X. TE, as a conditional mutual information, can detect synergies between Y and ( ) in addition to removing redundancies [43,44].In the following sections, we briefly introduce three methods used in the literature to estimate the quantities in Equation ( 5) from data.

Fixed Bin-Width Histogram Approaches
To estimate the quantities in Equation ( 5), conditional distributions are generally expressed in terms of their joint counterparts as in: In this sense, the TE estimation problem can be cast as a problem of density estimation from data.One of the most straightforward approaches to density estimation is based on histogram models [28,45].However, histograms come with a free parameter-the number of bins.Unfortunately, the estimation of entropy-based quantities varies dramatically as the number of bins is varied.Numerous methods to identify the number of bins that optimally describes the density of a data set have been published [45,46].However, most of these techniques assume that the underlying density is Gaussian.In this paper, we rely on a generalization of a method introduced by Knuth [20,22], which we refer to as the Generalized Knuth H X,Y ( ) method.In this method, each of N observed data points is placed into one of M fixed-width bins, where the number of bins is selected utilizing a Bayesian paradigm.If the volume and the bin probabilities of each multivariate bin are denoted by V and for the i th bin, respectively, then the likelihood of the data is given by the following multinomial distribution: where = [ , , … , ] denote the N observed data points, , , … , denote the number of data samples in each bin and = [ , , … , ] denote the bin probabilities.Given M bins and the normalization condition that the integral of the probability density equals unity, we are left with M-1 bin probabilities, denoted by , , … , .The normalization condition requires that = (1 − ∑ ) [22].The non-informative prior [20] is chosen to represent the bin probabilities: which is a Dirichlet prior conjugate to the multinomial likelihood function and Γ denotes the Gamma function [56].The non-informative uniform prior models a priori belief regarding the number of bins where C denotes the maximum number of bins considered: The posterior distribution of the bin probabilities and the bin numbers are given by Bayes theorem, which is written here as a proportionality where the Bayesian evidence, ( ), is the implicit proportionality constant: Since the goal is to obtain the optimal number of constant-width bins one can marginalize over each of the bin probabilities resulting in the posterior of the bin number, which can be logarithmically written as follows [22]: where K is a constant.To find the optimal number of bins, the mode of the posterior distribution in Equation ( 11) is estimated as follows: In Appendix II, we present the performance of entropy estimation based on the selection of the Dirichlet exponent, chosen as 0.5 in Equation (8).Below, we generalize this exponent of the Dirichlet prior to relax the constraint as follows: In the literature, the prior in Equation ( 13) has been utilized to estimate the discrete entropy given by Equation (1), where the number of bins are assumed to be known, whereas here, we try to approximate a continuous pdf, thus the entropy, using a piecewise-constant model, where the number of bins is not known.In these publications, the main concern is to estimate Equation (1) as efficiently as possible for a small number of data samples.Different estimators have been named.For example, the assignment of results in the Krichevsky-Trofimov estimator and the assignment of = 1 results in the Schurman-Grassberger estimator [23,24].Here, we aim to approximate the continuous-valued differential entropy of a variable shown by using finite-precision data: Using the same prior for the number of bins in Equation ( 9) and the procedures given by Equation (10) through Equation ( 12), the marginal posterior distribution of the bin numbers under the general Dirichlet prior Equation ( 13) is given by: Again, the point estimate for the optimal bin number can be found by identifying the mode of the above equation, that is, = max {log ( | )} where ( | ) is obtained from Equation (15).
After the estimation of the optimal number of bins, the most important step is the accurate calculation of TE from the data.In [19], the TE is expressed as a summation of Shannon entropy terms: = where  ) , where x denotes a particular value of the variable X and boldface is utilized to represent vectors.If k = l = 1 is selected, Equation ( 16) takes the following simplified form [20]: In the best scenario, the above TE estimation requires the three-dimensional joint Shannon entropy, whereas its general expression in Equation ( 16) needs a k+l+1-dimensional entropy estimation.
According to our tests, if we use the prior in Equation ( 13), when = 10 , posterior pdf estimates are biased significantly (see Appendix III), especially in high-dimensional problems.Thus, we aim to overcome this problem by using the generalized prior in Equation ( 13) for the Dirichlet prior with β values around 0.1.Using the generalized prior Equation (13), after selecting the number of bins by Equation (15), the mean value of the posterior bin height probability can be estimated by [22]: As the prior is Dirichlet and the likelihood function is multinomial-distributed, the posterior distribution of bin heights is Dirichlet-distributed with the mean given in Equation ( 18) above [22,47].

⋅
This allows us to sample from the Dirichlet posterior distribution of the bin heights to estimate the joint and marginal pdf's in the TE equations and then estimate their Shannon entropies and their uncertainties, too.The schematic in Figure 1 illustrates this procedure for estimating the entropies and their associated uncertainties.
Figure 1.This schematic illustrates the procedure for estimating entropy as well as the uncertainty from data.First the number of bins is selected using the mode of the logarithm of the Dirichlet posterior in Equation (15).The Dirichlet posterior is then sampled resulting in multiple estimates of the pdf.The entropy of each pdf is estimated and the mean and standard deviation computed and reported.
As previously described, this method is known to produce biases, especially as higher dimensions are considered.There are a couple of reasons for this.As the pdf is modeled by a uniform distribution within a single bin, this corresponds to the maximum entropy for that bin.Additionally, Equation (18) tells us that, even if there is no data sample in a specific bin, an artificial amount β is added to the average bin probability.On the other hand, this addition mitigates the entropy underestimation encountered in the case of many empty bins, which is prevalent in higher dimensions.Moreover, the TE is estimated by the addition and subtraction of the marginal and joint Shannon entropies, as shown in Equation ( 16).Prichard and Theiler describe the artifacts originating from this summation procedure and advise using KDE methods instead [25].However, before considering the KDE method, we discuss an alternative histogram method that has been proposed to overcome some of the drawbacks of the fixed-bin-width histogram approaches.In addition to the conjugate pairs of multinomial likelihood and Dirichlet prior model, the research topic of exploring other models has always been interesting.In addition to this conjugate pair, optimal binning in the case of other models, such as that of [24] including a mixture of Dirichlets provides a challenging research for optimal binning of data with the goal of efficient pdf estimation from the data.

Adaptive Bin-Width Histogram Approaches
The fixed bin-width histogram approaches are not very effective for estimating information-theoretic quantities from data due to the inaccurate filling of the bins with zero sampling frequency.Instead of generating a model based on bins with equal width, one can design a model consisting of bins with varying widths, determined according to a statistical criterion.Fraser and Swinney [34] and Darbellay and Vajda [35] proposed the adaptive partitioning of the observation space into cells using the latter approach and estimated the mutual information (MI) directly.Here, we will focus on the method proposed by Darbellay and Vajda.This approach relies on iteratively partitioning the cells on the observation space, based on a chi-square statistical test to ensure conditional independence of the proposed partitioned cells from the rest of the cells.We explain the details of this method schematically on Figure 2. Here, observation space of (X, Y) is shown by the largest rectangle.The partitioning of the observation space is done as follows: 1. Initially, we start with the largest rectangle containing all data samples.2. Any cell containing less than two observations (data pairs) is not partitioned.The cell, which is partitioned into smaller blocks, is known as the parent cell; whereas each smaller block after partitioning is named as a child cell.3. Every cell containing at least two observations is partitioned by dividing each one of its edges into two halves.It means four new cells are generated (according to the independence test, which will be described below).4. In order to test whether we need to partition the upper cell (parent cell) into more cells (child cells), we rely on the Chi-Square test of independence, where the null hypothesis is phrased as follows: H0: Sample numbers N1, N2, N3, N4 in four child cells are similar (in other words, the sample distribution in the parent cell was uniform) The Chi-Square (χ 2 ) test statistic for a 5% significance level with 3 degrees of freedom is given as follows: If we happen to find that > 7.81, we decide that the numbers of samples in each child cell are not similar and therefore we continue partitioning.Otherwise, we conclude that the numbers are similar and partitioning is stopped at this level.The data samples in this cell are used in the MI estimation.In this method, the level of statistical significance can be chosen according to the design, thus raising as a parameter to be tuned according to the application.After the partitioning is completed, the MI is estimated as shown below: where N denotes the total number of data samples with showing the subset of these samples that fall into the i th cell, , after the partitioning process is completed.Above, , and , represent the numbers of observations having the same x and y coordinates as observations in the cell , respectively.The partitioning process is illustrated below using a similar discussion to that in [35].The observation space is first divided into four child cells, namely C1, C2, C3, C4, to maintain equiprobable distributions in each cell.This forms the first set of branches shown in Figure 2b.Then, according to the independence test, C1 is divided into four cells whereas C2, C3, C4 are retained to be included in the MI estimation and they are not divided into more child cells, forming the second layer of the partitioning tree shown in Figure 2b.Finally, the third child cell of partition C3 is divided into four child cells, namely C131, C132, C133 and C134.In the last step, each cell is utilized in the MI estimation formula given by Equation (20).
It should be noted that the partitioning is performed symbolically here for the sake of a better explanation without showing the actual data samples on the observation space, as done in [35].
As a result, finer resolution is used to describe larger MI regions and lower resolution is used for smaller MI regions [35].Having estimated the MI from the data efficiently, the TE can be calculated using the expressions from [19] denotes the MI between X and the joint process denoted by ( ) , ( ) [19].
Because the MI is estimated more efficiently by this method, the overall TE estimation becomes less biased compared to the previous methods.However, the subtraction operation involved in Equation ( 21) can still produce a significant bias in the TE calculations.To overcome problems related to the addition and subtraction of information-theoretic quantities, KDE estimation methods have been utilized in the literature to estimate MI and redundancies [25], and TE in [49].

Kernel Density Estimation Methods
Kernel Density Estimation (KDE) is utilized to produce a smoothed pdf estimation using the data samples, which stands in contrast to the histogram model which has sharp edges resulting from a uniform distribution within each bin.In this method, a preselected distribution of values around each data sample is summed to obtain an overall, smoother pdf in the data range.This preselected distribution of values within a certain range is known as a "kernel".Some of the most commonly used kernels are "Epanechnikov", "Rectangular" and "Gaussian" [28].Each kernel can be thought of as a window with a bandwidth or radius.Prichard and Theiler [25] showed that KDE can also be utilized to estimate entropy by the computation of the generalized correlation integral [7], which is approximated by the correlation sum.Even if a rectangular kernel is used, the resulting entropy estimation is more accurate where ( , ) denotes the normal distribution with mean μ and standard deviation σ and the constant c denotes the coupling coefficient.First, we generate the log versus log ,  Here, l = 1 is selected [14].It is known that the optimal radius lies in the linear region of these curves, where its logarithm is a point on the horizontal axis [14].Above, we notice that the range of the radius values corresponding to the linear section of each curve varies significantly.As the k value increases, the linear region for each curve moves right, toward higher ε values [33].With the increasing embedding dimensions, the embedding vectors include data, which are sampled with a lower frequency, i.e., undersampling, leading to an increase in ε to achieve the same correlation sum obtained with a smaller radius.For example, a set of radius values within the range of −3 ≤ log ≤ 0 provides the linear section of the log curve for k = 1, whereas these values are not within the range of radius values used in forming the log C -log curve for an embedding dimension of k = 10.Thus, selection of an embedding dimension k first and then a radius value from the corresponding linear region on the curve can help us search for the radius in a more constrained and efficient way.
As seen in Figure 3, we end up with different radius ranges to select, based on the determination of the embedding dimension, k.Sabesan et al. [14] provide an approach to select the radius (ε) and k together.
According to [14,52], the embedding dimension, k, can be selected by considering the first local minimum of the Time-Lagged MI (TLMI) of the destination signal, followed by the determination of a radius value.The radius is selected such that it falls into the linear region of the curve for the corresponding k value, given in Figure 3.The k value, corresponding to the first local minima of MI(k), provides us with the time-lag k, where the statistical dependency between the current sample xi and its k past value xi-k is small.TLMI is defined by the following equation for the AR signal given in Equation ( 24): Below, we provide an estimate of the MI(k) of the AR signal, xi, for different time lags ∈ [1, … ,50].The adaptive partitioning algorithm of Darbellay and Vajda [35] was utilized to estimate the MI.As the MI is not bounded from above, we normalize its values between 0 and 1 as recommended in the literature, using the following formula [53]: In Figure 4, we show the normalized MI for different lags after taking an ensemble of 10 members of the AR process x and utilizing an averaging to estimate MI.
Above, the first local minimum value of MI(k) is obtained at k = 10.Thus, we turn to Figure 3 to select a radius value on the linear region of the curve with the embedding dimension k = 10.This region can be described by the following values: 0.8 ≤ log ≤ 1.4 .Thus, we can choose k = 10 and log = 0.85 along with l = 1.If k = l = 1 is selected, the corresponding linear region on Figure 3 changes and a selection of log = −1 can be chosen, instead.Once the radius and the embedding dimensions are determined, TE is estimated by Equation ( 23) using the correlation sums.These estimates are illustrated in Figure 5A,B for k = l = 1 and k = l0, l = 1; respectively.
In the next section, we will elaborate on the performance of the three methods used in TE estimation and emphasize the goal of our approach, which is to use all three methods together to fine tune their specific parameters.24) using the KDE method.(A) Both TE YX (blue-solid) and TE XY (red-dash dot) are estimated using the KDE method and illustrated along with the analytical solution (black-dotted) for k = l = 1.As there is no coupling from X to Y, analytically = 0; (B) TE YX (blue-solid) and TE XY (red-dash dot) are estimated using the KDE method for k = 10, l = 1.

Experiments
In the preceding section, we described three different methods for estimating the TE from data, namely: the Generalized Knuth method, the adaptive bin-width histogram and the KDE method.We emphasized that we can compute different TE values by these three different methods, as the TE estimations depend on various factors, such as the value of the selected fixed bin-width, the bias resulting due to the subtraction and addition of various Shannon entropies, embedding dimensions and the value of the chosen KDE radius value.Due to this uncertainty in the TE estimations, we propose to use these three main techniques together to compute the TE values and to consistently identify the direction of relative information flows between two variables.With this approach, if one of the techniques does not agree with the others in terms of the direction of information flows between the variables, we determine that we need to fine tune the relevant parameters until all three methods agree with each other in the estimation of the NetTE direction between each pair of variables.The NetTE between two variables X and Y is defined to be the difference between and , which is defined as the difference of the TE magnitudes with opposite directions between X and Y: The NetTE allows us to compare the relative values of information flow in both directions and conclude which flow is larger than the other, giving a sense of main interaction direction between the two variables X and Y.
In order to use three methods together, we demonstrate our procedure on a synthetic dataset generated by a bivariate autoregressive model given by Equation (24).In Section 2.3.1, we have already described the KDE method using this autoregressive model example and we have explored different radius values in the KDE method by utilizing the Grassberger-Procaccia approach in conjunction with different selections of values.In Section 3.1, we continue demonstrating the results using the same bivariate autoregressive model.We focus on the analysis of the adaptive partitioning and the Generalized Knuth methods.First, we analyze the performance of the adaptive partitioning method at a preferred statistical significance level.Then, we propose to investigate different values to estimate the optimal fixed bin-width using Equation (15) in the Generalized Knuth method.
If an information flow direction consensus is not reached among the three methods, we try different values for the fine-tuning parameters until we get a consensus in the NetTE directions.
When each method has been fine-tuned to produce the same NetTE estimate, we conclude that the information flow direction has been correctly identified.
In Section 3.2, we apply our procedure to explore the information flow among the variables of the nonlinear dynamical system used by Lorenz to model an atmospheric convection cell.

Linearly-Coupled Bivariate Autoregressive Model
In this section, we apply the adaptive partitioning and the Generalized Knuth methods to estimate the TE among the processes defined by the same bivariate linearly-coupled autoregressive model (with variable coupling values) given by the equations in Equation (24).We demonstrate the performance of each TE estimation method using an ensemble of 10 members to average.The length of the synthetically generated processes is taken to be 1000 samples after eliminating the first 10,000 samples as the transient.For each method, TE estimations versus the value of coupling coefficients are shown in Figures 5-7 for both directions between processes X and Y.It should be noted that the process X is coupled to Y through coefficient c.Thus, there is no information flow from X to Y for this example, i.e., = 0 analytically.The analytical values of have been obtained using the equations in [19] for k = 1 and l = 1.The performance of the three methods have been compared for the case of k = 1 and l = 1.
Below, TE is estimated for both directions using coupling values ranging from c = 0.01 to c = 1 in Equation ( 24).The information flows are consistently estimated to be in the same direction for all three methods, i.e., ≥ .If we compare the magnitudes of these TE estimates, we observe that the biases between the analytic solution and the of the adaptive partitioning method, KDE and the Generalized Knuth method increase as the coefficient of the coupling in the autoregressive model increases to c = 1.
Above, we demonstrate the TE estimations using the KDE method with different embedding dimensions and different radius values.In Figure 5A, B, we observe that the directions of each TE can be estimated correctly, i.e., ≥ for the model given in Equation ( 24), demonstrating that we can obtain the same information flow directions, but with different bias values.
Below, results in Figure 5A are compared with the other two techniques for k = l = 1.
Figure 6.This figure illustrates TE estimation versus the coupling coefficient c in Equation ( 24) using the adaptive partitioning method.Both TE YX (blue-solid) and TE XY (red-dash dot) are estimated using the adaptive partitioning method and illustrated along with the analytical solution (black-dotted).As there is no coupling from X to Y, analytically = 0.A statistical significance level of 5% has been utilized in the χ 2 test Equation ( 19) for a decision of partitioning with k = l = 1.When the magnitudes of the estimates are compared in Figures 5A and 6, we observe bias both in and , whereas there is no bias in the estimate in the Generalized Knuth method using = 10 .On the other hand, the adaptive partitioning method provides the least bias for whereas KDE seems to produce larger bias for low coupling values and lower bias for high coupling values in Figure 5A, compared to the Generalized Knuth method with = 10 in Figure 7.For example, for = 1, we note from the three graphs that the estimated transfer entropies are ≅ 0.52, ≅ 0.43, ≅ 0.2, for the adaptive partitioning, the KDE with k = l = 1 and the Generalized Knuth method with = 10 , respectively.As the bias is the difference between the analytical value ( = 0.55 = = 1) and the estimates, it obtains its largest value in the case of the Generalized Knuth method with = 10 .On the other hand, we know that there is no information flow from the variable X to variable Y, i.e., = 0.This fact is reflected in Figure 7, but not in Figures 5A and 6 where TEXY is estimated to be non-zero, implying bias.As the same computation is also utilized to estimate (in the other direction), we choose to analyze the NetTE, which equals the difference between and , which is defined in Equation (27).Before comparing the NetTE obtained by each method, we present the performance of the proposed Generalized Knuth method for different β values.

Fine-Tuning the Generalized Knuth Method
In this sub-section, we investigate the effect of β on the TE estimation bias in the case of the Generalized Knuth method.The piecewise-constant model of the Generalized Knuth method approaches a pure likelihood-dependent model, which has almost a constant value as goes to zero in

TE YX
Equation (18).In this case, the mean posterior bin heights approach their frequencies in a bin, i.e., 〈 〉 = .
In this particular case, empty bins of the histogram cause large biases in entropy estimation, especially in higher dimensions as the data becomes sparser.This approach can only become unbiased asymptotically [54].However, as shown in Equation ( 18), the Dirichlet prior with exponent β artificially fills each bin by an amount, β, reducing the bias problem.In Appendix III, Figure A3 illustrates the effect of the free parameter β on the performance of the marginal and joint entropy estimates.We find that the entropy estimates fall within one to two standard deviations for ≅ 0.1.The performance degrades for much smaller and much larger β values.Figures 8 and 9 illustrate less bias in TEYX estimates for = 0.1 and = 0.5 unlike the case in shown in Figure 7 where we use = 10 .However, the bias increases for low coupling values in these two cases.To illustrate the net effect of the bias, we explore NetTE estimates of Equation ( 27) for these cases in Section 3.1.2.27) between each pair of variables in Equation (24).Estimations are performed using all three methods and considering different β values in the case of the Generalized Knuth method.
In the KDE (Figure 5A), Adaptive partitioning (Figure 6) and the Generalized Knuth method with = 0.1 and = 0.5, (Figures 8 and 9) a non-zero is observed.The NetTE between the variables X and Y of the bivariate auroregressive model in Equation ( 24) still behaves similarly giving a net information flow in the direction of the coupling from Y to X as expected.Thus, in this case we find that the NetTE behaves in the same way, even though the individual TE estimates of each method have different biases.Above, we observe that the NetTE estimate of the adaptive partitioning outperforms the Generalized Knuth method with = 0.1 and = 0.5 and KDE.The largest bias in NetTE is achieved by the Generalized Knuth method with = 10 .However, all methods agree that the information flow from Y to X is greater than that of X to Y, which is in agreement with the theoretical result obtained from Equation ( 24) using the equations in [19].In the literature, the bias in the estimation has been obtained using surrogates of TE's estimated by shuffling the data samples [38].These approaches will be explored in future work.

Lorenz System
In this section, the three methods of Section 2 are applied to a more challenging problem involving the detection of the direction of information flow among the three components of the Lorenz system, which is a simplified atmospheric circulation model that exhibits significant non-linear behavior.The Lorenz system is defined by a set of three coupled first-order differential equations [41]: where = 10, = 8 3, = 24 ( − ℎ ) = 28 ( ℎ ) ⁄ . These equations derive from a simple model of an atmospheric convection cell, where the variables x, y, and z denote the convective velocity, vertical temperature difference and the mean convective heat flow, respectively.These equations are used to generate a synthetic time series, which is then used to test our TE estimation procedure.In the literature, the estimation of the TE of two Lorenz systems with nonlinear couplings have found applications in neuroscience [14,39,55].Here, we explore the performance of our approach on a single Lorenz system which is not coupled to another one.Our goal is to estimate the interactions among the three variables of a single Lorenz system-not coupling from one system to another.
In our experiments, we tested the adaptive partitioning, KDE and Generalized Knuth methods in the case where the Rayleigh number, R = 28, which is well-known to result in chaotic dynamics and also for the sub-chaotic case where R = 24.For each variable, we generated 15,000 samples and used the last 5000 samples after the transient using a Runge-Kutta-based differential equation solver in MATLAB (ode45).Both in the chaotic and sub-chaotic cases, = 0.1 was used at the Generalized Knuth method and a 5% significance level was selected in the adaptive partitioning method.Embedding dimensions of k = l = 1 have been selected in these two methods.

TE XY
The embedding dimension values were implemented according to Section 2.3.2 at the KDE method: The log versus log , ; curves have been estimated for the chaotic and sub-chaotic cases.
In the chaotic case, the first minimum of TLMI was found to be at k = 17 and ε = occured in the middle of the radii range of the linear part of the curve.The value of l = 1 was selected for both the chaotic and sub-chaotic cases.The curves for different k values have been illustrated in Figure 11 for the analysis of the interaction between X and Y. Similar curves have been observed for the analysis of the interactions between the other pairs in the model.
In the sub-chaotic case, values around k = 15 have been observed to provide the first local minimum of TLMI(k).However, the NetTE direction consistency cannot be obtained with the other two techniques, namely, the adaptive partitioning and the Generalized Knuth method.Therefore, as we propose in our method, k value has been fine-tuned along with the radius until we obtain consistency of NetTE directions among the three methods.Selection of = 3, = 1, = has provided this consistency, where the NetTE directions are illustrated in Figure 15.We estimated TE for both directions for each pair of variables ( , ), ( , ), and ( , ) using each of the three methods described in Section 2. Similar to the MI normalization of Equation ( 26) recommended in [53], we adapt the normalization for the NetTE as follows: where denotes the normalized NetTE between variables X and Y, having values in the range of [0,1].In Figures 13 and 14, we illustrate the information flow between each pair of the Lorenz equation variables using both the un-normalized TE values obtained by the each of the three methods and the normalized NetTE estimates showing the net information flow between any pair of variables.( .)) = 0.31 and shows a net information flow from variable Y to Z. Thus, we conclude that variable Y affects variable Z.
In Figure 14, we illustrate the estimates of TE's between each variable of the Lorenz Equation (28) in sub-chaotic regime with R = 24.
Above, we demonstrated the concept of our method: If the directions of information flows are not consistent with the three methods, then we can explore new parameter values to provide consistency in the directions.Above, for the selected parameters, the Generalized Knuth method and the adaptive partitioning provided consistent NetTE directions between the pairs of variables in the chaotic case.However, in the sub-chaotic case, we needed to explore a new parameter set for the KDE method as the NetTE directions were different than the other two consistent methods.
Based on the fact that the directions of the NetTE estimations obtained using each of the three methods agree, we conclude that information flow direction between the pairs of the Lorenz equation variables are as shown in Figure 15.Note that these information flow directions are not only not obvious, but also not obviously obtainable, given the Lorenz system equations in Equation ( 28) despite the fact that these equations comprise a complete description of the system (sensitive dependence on initial conditions not withstanding).However, given the fact that this system of equations is derived from a well-understood physical system, one can evaluate these results based on the corresponding physics.In an atmospheric convection roll, it is known that both the velocity (X) and the heat flow (Z) are driven by the temperature difference (Y), and that it is the velocity (X) that mediates the heat flow (Z) in the system.This demonstrates that complex nonlinear relationships between different subsystems can be revealed by a TE analysis of the time series of the system variables.Furthermore, such an analysis reveals information about the system that is not readily accessible even with an analytic model, such as Equation ( 28), in hand.76% and 91% of the entropy estimates were within 1 or 2 standard deviations, respectively, of the true entropy value of 1.4189.However, this means that not every attempt at entropy estimation in this ensemble was successful.
We illustrate this with a specific data set that was found to lie outside of 76% percentile success rate.In this case, the optimal number of bins was estimated to be = 11 using Equation (12).In

Appendix 3
In Appendix 2, we estimated the entropy of a one-dimensional Gaussian variable using the Generalized Knuth method with the prior shown in Equations ( 8) and (9).We notice that, even in the one-dimensional case, some of the entropy estimates lie outside the confidence intervals.If we estimate the joint entropy of two variables or more, the quality of the estimation decreases further due to empty bins.To overcome this problem, we proposed a different prior Equation ( 13) and computed the percentages of the relevant entropy estimates falling into 1 and 2 standard deviations (sigma's) within a total of 100 data-sets sampled from the same two-dimensional Gaussian distribution given by 0 0 , 1 0.5 0.  8) above.It is also observed that in both cases, the confidence interval statistics are lower for the joint entropies, due to the increase of the dimensionality of the space.As a result of this analysis, we observe the largest percentage of getting an entropy estimate within its one-sigma and two-sigma intervals from the true values take place for = 0.1.Above, both joint and marginal Shannon entropies of 100 Gaussian-distributed data-sets are estimated using the Generalized Knuth method for the illustrated values.Subfigures denote the percentage of estimates within one-and two-standard deviations from their analytical values.

Figure 2 .
Figure 2. Illustration of the Adaptive Partitioning algorithm of Darbellay and Vajda (A) The observation space of two-dimensional data (X,Y) and its illustrative partitioning according to the independence test; (B) The corresponding tree showing the partitioning of each cell.
in Figure3for different k values and c = 1.

Figure 3 .
Figure 3. Exploration of the optimal radius for the KDE of a pdf using the Grassberger-Procaccia method.The figure illustrates the Correlation Sum, defined in Equation (22), estimated at different radius values represented by ε for the coupled AR model.

Figure 4 .Figure 5 .
Figure 4. Ensemble averaged and normalized Time-lagged MI(k).As described in the text, the first local minima of the MI leads to the condition = 10.

Figure 7 .
Figure 7.This figure illustrates TE estimation versus the coupling coefficient c in Equation (24) using the Generalized Knuth method.Both TE YX (blue solid) and TE XY (red-dash dot) are estimated for = 10 and illustrated along with the analytical solution (black dotted) where k = l = 1 is chosen.

Figure 8 .
Figure 8.This figure illustrates TE estimation versus the coupling coefficient c in Equation (24) using the Generalized Knuth method method for = 0.1, = = 1.These are illustrated along with the analytical solution.

Figure 9 .
Figure 9.This figure illustrates TE estimation versus the coupling coefficient c in Equation (24) using the generalized piecewise-constant method (Knuth method) for = 0.5, = = 1.These are illustrated along with the analytical solution.

Figure 10 .
Figure 10.This figure illustrates the NetTE difference, given by Equation (27) between each pair of variables in Equation(24).Estimations are performed using all three methods and considering different β values in the case of the Generalized Knuth method.

Figure 12
the selection of the appropriate region for , in the sub-chaotic case.

Figure 11 .
Figure 11.Exploration of the optimal radius for the KDE of a pdf using the Grassberger-Procaccia method.The figure illustrates the Correlation Sum (22) estimated at different radius values represented by ε for the Lorenz model in the chaotic regime (R = 28).

Figure 12 .
Figure 12.Exploration of the optimal radius for the KDE of a pdf using the Grassberger-Procaccia method.The figure illustrates the Correlation Sum (22) estimated at different radius values represented by ε for the Lorenz model in the sub-chaotic regime (R = 24).

Figure 13 .
Figure 13.The un-normalized TE estimates between the variables of the Lorenz equations defined in Equation (28) for the chaotic case (R = 28) along with the normalized NetTE direction and magnitudes.Estimations were obtained using (A) Kernel Density Estimate method with k = 17, l = 1, = ; (B) Generalized Knuth method method with = 0.1, = = 1; and (C) Adaptive Partitioning method with 5% significance level and k = l = 1.Solid arrows denote the information flow (or TE) from X to Y or Y to X. Dashed lines show the direction of the normalized NetTE estimates.

Figure 14 .
Figure 14.The un-normalized TE estimates between the variables of the Lorenz equations defined in Equation (28) for the sub-chaotic case (R = 24) along with the normalized NetTE direction and magnitudes.Estimations were obtained using: (A) Kernel Density Estimate method with = 3, = 1, = ; (B) Generalized Knuth method where = 0.1, = = 1; (C) Adaptive Partitioning method with 5% significance level and k = l =1.Solid arrows denote the information flow (or TE) from X to Y or Y to X. Dashed lines illustrate the direction of the normalized NetTE estimates.

Figure 15 .
Figure 15.Information flow directions among the variables of the Lorenz equations, where X, Y, Z denote the velocity, temperature difference and the heat flow, respectively, in the case of the atmospheric convection roll model.These are also the NetTE directions, showing the larger influence among the bi-directional flows.

Figure A. 2 .
Figure A.2.1a,b, we illustrate the resulting histogram model of the pdf and the non-normalized log posterior probability of the number of bins in the model given the data.

Figure A. 2 . 1 .
Figure A.2.1.(a) Histogram model of the pdf of the data set with error-bars on the bin heights; (b) The non-normalized log posterior probability of the number of bins in the model given the data.

Figure A. 2 . 2 .
Figure A.2.2.Entropy estimate of a data-set outside of 76% percentile success rate (for one of the data-sets in the remaining 24% of 100 trials).

Figure A. 2
Figure A.2.2 shows that the true value of the entropy does not fall into the one standard deviation interval of the mean estimate using = 11, implying that the required number of bins is different for an optimal pdf model and an optimal entropy estimation.It is seen that M = 19 is the smallest number 5 1 versus = [0.001,0.01,0.05,0.1,0.3,0.5,0.7,1] .Figure.A.3 illustrates these percentages as a function of different β values.Approximately 50% of the time, the marginal entropy estimate falls into the one-sigma interval for = 0.05, and 80% of the time within the two-sigma interval (compare the first and second columns of Figure A.3).As a comparison, the corresponding statistics are approximately 10% for the marginal entropies falling into the one-sigma and 30% for marginal entropies falling into the two-sigma confidence intervals when we use the Krichevsky-Trofinov Dirichlet prior ( = 0.5), as in Equation (

Figure A. 3 .
Figure A.3.Percentage performance (one-and two-standard deviation confidence intervals) of marginal and joint entropy estimates as a function of .