Contribution to Transfer Entropy Estimation via the k-Nearest-Neighbors Approach

This paper deals with the estimation of transfer entropy based on the k-nearest neighbors (k-NN) method. To this end, we first investigate the estimation of Shannon entropy involving a rectangular neighboring region, as suggested in already existing literature, and develop two kinds of entropy estimators. Then, applying the widely-used error cancellation approach to these entropy estimators, we propose two novel transfer entropy estimators, implying no extra computational cost compared to existing similar k-NN algorithms. Experimental simulations allow the comparison of the new estimators with the transfer entropy estimator available in free toolboxes, corresponding to two different extensions to the transfer entropy estimation of the Kraskov–Stögbauer–Grassberger (KSG) mutual information estimator and prove the effectiveness of these new estimators.


Introduction
Transfer entropy (TE) is an information-theoretic statistic measurement, which aims to measure an amount of time-directed information between two dynamical systems.Given the past time evolution of a dynamical system A, TE from another dynamical system B to the first system A is the amount of Shannon uncertainty reduction in the future time evolution of A when including the knowledge of the past evolution of B. After its introduction by Schreiber [1], TE obtained special attention in various fields, such as neuroscience [2][3][4][5][6][7][8], physiology [9][10][11], climatology [12] and others, such as physical systems [13][14][15][16][17].
More precisely, let us suppose that we observe the output X i ∈ R, i ∈ Z, of some sensor connected to A. If the sequence X is supposed to be an m-th order Markov process, i.e., if considering subsequences X (k) i = (X i−k+1 , X i−k+2 , • • • , X i ), k > 0, the probability measure P X (defined on measurable subsets of real sequences) attached to X fulfills the m-th order Markov hypothesis: then the past information X (m) i (before time instant i + 1) is sufficient for a prediction of X i+k , k ≥ 1, and can be considered as an m-dimensional state vector at time i (note that, to know from X the hidden dynamical evolution of A, we need a one-to-one relation between X (m) i and the physical state of A at time i).For the sake of clarity, we introduce the following notation: X p i , X − i , Y − i , i = 1, 2, . . ., N , is an independent and identically distributed (IID) random sequence, each term following the same distribution as a random vector (X p , X − , Y − ) ∈ R 1+m+n whatever i (in X p , X − , Y − , the upper indices "p" and "-" correspond to "predicted" and "past", respectively).This notation will substitute for the notation X i+1 , X (m) i , Y (n) i , i = 1, 2, . . ., N , and we will denote by S X p ,X − ,Y − , S X p ,X − , S X − ,Y − and S X − the spaces in which (X p , X − , Y − ), (X p , X − ), (X − , Y − ) and X − are respectively observed.Now, let us suppose that a causal influence exists from B on A and that an auxiliary random process Y i ∈ R, i ∈ Z, recorded from a sensor connected to B, is such that, at each time i and for some n > 0, is an image (not necessarily one-to-one) of the physical state of B. The negation of this causal influence implies: If Equation (2) holds, it is said that there is an absence of information transfer from B to A. Otherwise, the process X can be no longer considered strictly a Markov process.Let us suppose the joint process (X, Y ) is Markovian, i.e., there exist a given pair (m , n ), a transition function f and an independent random sequence e i , i ∈ Z, such that , e i+1 , where the random variable e i+1 is independent of the past random sequence (X j , Y j , e j ), j ≤ i, whatever i.As where g is clearly a non-injective function, the pair X (m) i , Y (n) i , X i , i ∈ Z, corresponds to a hidden Markov process, and it is well known that this observation process is not generally Markovian.
The deviation from this assumption can be quantified using the Kullback pseudo-metric, leading to the general definition of TE at time i: where the ratio in Equation (3) corresponds to the Radon-Nikodym derivative [18,19] (i.e., the density) of the conditional measure dP Considering "log" as the natural logarithm, information is measured in natural units (nats).Now, given two observable scalar random time series X and Y with no a priori given model (as is generally the case), if we are interested in defining some causal influence from Y to X through TE analysis, we must specify the dimensions of the past information vectors X − and Y − , i.e., m and n.Additionally, even if we impose them, it is not evident that all of the coordinates in X (m) i and Y (n) i will be useful.To deal with this issue, variable selection procedures have been proposed in the literature, such as uniform and non-uniform embedding algorithms [20,21].
If the joint probability measure P or: where H (U ) denotes the Shannon differential entropy of a random vector U .Note that, if the processes Y and X are assumed to be jointly stationary, for any real function g : R m+n+1 → R, the expectation E g X i+1 , X does not depend on i.Consequently, TE Y →X,i does not depend on i (and so can be simply denoted by TE Y →X ), nor all of the quantities defined in Equations (3) to (5).In theory, TE is never negative and is equal to zero if and only if Equation (2) holds.
According to Definition (3), TE is not symmetric, and it can be regarded as a conditional mutual information (CMI) [3,22] (sometimes also named partial mutual information (PMI) in the literature [23]).Recall that mutual information between two random vectors X and Y is defined by: and TE can be also written as: Considering the estimation Ÿ TE Y →X of TE, TE Y →X , as a function defined on the set of observable occurrences (x i , y i ), i = 1, . . ., N , of a stationary sequence (X i , Y i ), i = 1, . . ., N , and Equation (5), a standard structure for the estimator is given by (see Appendix B): where U 1 , U 2 , U 3 and U 4 stand respectively for (X − , Y − ), (X p , X − ), (X p , X − , Y − ) and X − .Here, for each n, ¤ log (p U (u n )) is an estimated value of log (p U (u n )) computed as a function f n (u 1 , . . ., u N ) of the observed sequence u n , n = 1, . . ., N .With the k-NN approach addressed in this study, f n (u 1 , . . ., u N ) depends explicitly only on u n and on its k nearest neighbors.Therefore, the calculation of ÷ H(U ) definitely depends on the chosen estimation functions f n .Note that if, for N fixed, these functions correspond respectively to unbiased estimators of log (p (u n )), then Ÿ TE Y →X is also unbiased; otherwise, we can only expect that Ÿ TE Y →X is asymptotically unbiased (for N large).This is so if the estimators of log (p U (u n )) are asymptotically unbiased.Now, the theoretical derivation and analysis of the most currently used estimators for the estimation of H (U ) generally suppose that u 1 , . . ., u N are N independent occurrences of the random vector U , i.e., u 1 , . . ., u N is an occurrence of an independent and identically distributed (IID) sequence U 1 , . . ., U N of random vectors (∀i = 1, . . ., N : P U i = P U ).Although the IID hypothesis does not apply to our initial problem concerning the measure of TE on stationary random sequences (that are generally not IID), the new methods presented in this contribution are extended from existing ones assuming this hypothesis, without relaxing it.However, the experimental section will present results not only on IID observations, but also on non-IID stationary autoregressive (AR) processes, as our goal was to verify if some improvement can be nonetheless obtained for non-IID data, such as AR data.
If we come back to mutual information (MI) defined by Equation ( 6) and compare it with Equations ( 5), it is obvious that estimating MI and TE shares similarities.Hence, similarly to Equation (8) for TE, a basic estimation ⁄ I (X; Y ) of I (X; Y ) from a sequence (x i , y i ), i = 1, . . ., N , of N independent trials is: In what follows, when explaining the links among the existing methods and the proposed ones, we refer to Figure 1.In this diagram, a box identified by a number k in a circle is designed by box k .
Improving performance (in terms of bias and variance) of TE and MI estimators (obtained by choosing specific estimation functions ⁄ log (p (•))) in Equations ( 8) and (9), respectively) remains an issue when applied on short-length IID (or non-IID) sequences [3].In this work, we particularly focused on bias reduction.For MI, the most widely-used estimator is the Kraskov-Stögbauer-Grassberger (KSG) estimator [24,31], which was later extended to estimate transfer entropy, resulting in the k-NN TE estimator [25][26][27][32][33][34][35] (adopted in the widely-used TRENTOOL open source toolbox, Version 3.0).Our contribution originated in the Kozachenko-Leonenko entropy estimator summarized in [24] and proposed beforehand in the literature to get an estimation ◊ H (X) of the entropy H(X) of a continuously-distributed random vector X, from a finite sequence of independent outcomes x i , i = 1, . . ., N .This estimator, as well as another entropy estimator proposed by Singh et al. in [36] are briefly described in Section 2.1, before we introduce, in Section 4, our two new TE estimators based on both of them.In Section 2.2, Kraskov MI and standard TE estimators derived in literature from the Kozachenko-Leonenko entropy estimator are summarized, and the passage from a square to rectangular neighboring region to derive new entropy estimation is detailed in Section 3. Our methodology is depicted in Figure 1. Figure 1.Concepts and methodology involved in k-nearest-neighbors transfer entropy (TE) estimation.Standard k-nearest-neighbors methods using maximum norm for probability density and entropy non-parametric estimation introduce, around each data point, a minimal (hyper-)cube (Box 1 ), which includes the first k-nearest neighbors, as is the case for two entropy estimators, namely the well-known Kozachenko-Leonenko estimator (Box 3 ) and the less commonly used Singh's estimator (Box 2 ).The former was used in [24] to measure mutual information (MI) between two signals X and Y by Kraskov et al., who propose an MI estimator (Kraskov-Stögbauer-Grassberger (KSG) MI Estimator 1, Box 11 ) obtained by summing three entropy estimators (two estimators for the marginal entropies and one for the joint entropy).The strategy was to constrain the three corresponding (hyper-)cubes, including nearest neighbors, respectively in spaces S X , S Y and S X,Y , to have an identical edge length (the idea of projected distances, Box 14 ) for a better cancellation of the three corresponding biases.The same approach was used to derive the standard TE estimator [25][26][27][28][29] (Box 10 ), which has been implemented in the TRENTOOL toolbox, Version 3.0.In [24], Kraskov et al. also suggested, for MI estimation, to replace minimal (hyper-)cubes with smaller minimal (hyper-)rectangles equal to the product of two minimal (hyper-)cubes built separately in subspaces S X and S Y (KSG MI Estimator 2, Box 12 ) to exploit more efficiently the Kozachenko-Leonenko approach.An extended algorithm for TE estimation based on minimal (hyper-)rectangles equal to products of (hyper-)cubes was then proposed in [27] (extended TE estimator, Box 9 ) and implemented in the JIDT toolbox [30].Boxes 10 and 9 are marked as "standard algorithm" and "extended algorithm".The new idea extends the idea of the product of cubes (Box 13 ).It consists of proposing a different construction of the neighborhoods, which are no longer minimal (hyper-)cubes, nor products of (hyper-)cubes, but minimal (hyper-)rectangles (Box 4 ), with possibly a different length for each dimension, to get two novel entropy estimators (Boxes 5 and 6 ), respectively derived from Singh's entropy estimator and the Kozachenko-Leonenko entropy estimator.These two new entropy estimators lead respectively to two new TE estimators (Box 7 and Box 8 ) to be compared with the standard and extended TE estimators.Let us consider a sequence x i , i = 1, . . ., N in R d X (in our context, this sequence corresponds to an outcome of an IID sequence X 1 , . . ., X N , such that the common probability distribution will be equal to that of a given random vector X).The set of the k nearest neighbors of x i in this sequence (except for x i ) and the distance between x i and its k-th nearest neighbor are respectively denoted by χ k i and

Kozachenko-Leonenko Entropy Estimator
The Kozachenko-Leonenko entropy estimator is given by (Box 3 in Figure 1): where v i is the volume of Γ(k) denotes the digamma function.Note that using Equation (10), entropy is measured in natural units (nats).
To come up with a concise presentation of this estimator, we give hereafter a summary of the different steps to get it starting from [24].First, let us consider the distance d x i ,k between x i and its k-th nearest neighbor (introduced above) as a realization of the random variable D x i ,k , and let us denote by q x i ,k (x), x ∈ R, the corresponding probability density function (conditioned by X i = x i ).Secondly, let us consider the quantity h x i (ε) = u−x i ≤ε/2 dP X (u).This is the probability mass of the (hyper-)ball with radius equal to ε/2 and centered on x i .This probability mass is approximately equal to: if the density function is approximately constant on the (hyper-)ball.The variable c d is the volume of the unity radius d-dimensional (hyper-)ball in R d (c d = 1 with maximum norm).Furthermore, it can be established (see [24] for details) that the expectation E log h X i (D X i ,k ) , where h X i is the random variable associated with h x i , D X i ,k (which must not be confused with the notation D x i χ k i introduced previously) denotes the random distance between the k-th neighbor selected in the set of random vectors and: Finally, by using the law of large numbers, when N is large, we get: where v i is the realization of the random (hyper-)volume . Moreover, as observed in [24], it is possible to make the number of neighbors k depend on i by substituting the mean 1 N N i=1 ψ(k i ) for the constant ψ(k) in Equation ( 14), so that ◊ H (X) KL becomes:

Singh's Entropy Estimator
The question of k-NN entropy estimation is also discussed by Singh et al. in [36], where another estimator, denoted by ◊ H(X) S hereafter, is proposed (Box 2 in Figure 1): Using the approximation ψ(N ) ≈ log(N ) for large values of N , the estimator given by Equation ( 16) is close to that defined by Equation (10).This estimator was derived by Singh et al. in [36] through the four following steps: (1) Introduce the classical entropy estimator structure: where: (2) Assuming that the random variables T i , i = 1, . . ., N are identically distributed, so that E ◊ H(X) = E (T 1 ) (note that E (T 1 ) depends on N , even if the notation does not make that explicit), compute the asymptotic value of E (T 1 ) (when N is large) by firstly computing its asymptotic cumulative probability distribution function and the corresponding probability density p T 1 , and finally, compute the expectation where B is a constant, which is identified with the bias.
(4) Subtract this bias from ◊ H(X) to get ◊ H(X) S = ◊ H(X) − B and the formula given in Equation ( 16).
Note that the cancellation of the asymptotic bias does not imply that the bias obtained with a finite value of N is also exactly canceled.In Appendix C, we explain the origin of the bias for the entropy estimator given in Equation (17).
Observe also that, as for the Kozachenko-Leonenko estimator, it is possible to adapt Equation ( 16) if we want to consider a number of neighbors k i depending on i. Equation ( 16) must then be replaced by:

Standard Transfer Entropy Estimator
Estimating entropies separately in Equations ( 8) and ( 9) leads to individual bias values.Now, it is possible to cancel out (at least partially) the bias considering the algebraic sums (Equations ( 8) and ( 9)).To help in this cancellation, on the basis of Kozachenko-Leonenko entropy estimator, Kraskov et al. proposed to retain the same (hyper-)ball radius for each of the different spaces instead of using the same number k for both joint space S X,Y and marginal spaces (S X and S Y spaces) [24,37], leading to the following MI estimator (Box 11 in Figure 1): where n X,i and n Y,i denote the number of points that strictly fall into the resulting distance in the lower-dimensional spaces S X and S Y , respectively.Applying the same strategy to estimate TE, the number of neighbors in the joint space S X p ,X − ,Y − is first fixed, then for each i, the resulting distance ,k is projected into the other three lower dimensional spaces, leading to the standard TE estimator [25,27,28] (implementation available in the TRENTOOL toolbox, Version 3.0, Box 10 in Figure 1): where n X − ,i , n (X − ,Y − ),i and n (X p ,X − ),i denote the number of points that fall into the distance in the lower dimensional spaces S X − , S X − ,Y − and S X p ,X − , respectively.This estimator is marked as the "standard algorithm" in the experimental part.
Note that a generalization of Equation ( 21) was proposed in [28] to extend this formula to the estimation of entropy combinations other than MI and TE.

From a Square to a Rectangular Neighboring Region for Entropy Estimation
In [24], to estimate MI, as illustrated in Figure 2, Kraskov et al. discussed two different techniques to build the neighboring region to compute ⁄ I (X; Y ): in the standard technique (square ABCD in Figure 2a,b), the region determined by the first k nearest neighbors is a (hyper-)cube and leads to Equation (20), and in the second technique (rectangle A B C D in Figure 2a,b), the region determined by the first k nearest neighbors is a (hyper-)rectangle.Note that the TE estimator mentioned in the previous section (Equation (21)) is based on the first situation (square ABCD in Figure 2a or 2b).The introduction of the second technique by Kraskov et al. was to circumvent the fact that Equation (15) was not applied rigorously to obtain the terms ψ(n X,i +1) or ψ(n Y,i +1) in Equation (20).As a matter of fact, for one of these terms, no point x i (or y i ) falls exactly on the border of the (hyper-)cube D x i (or D y i ) obtained by the distance projection from the S X,Y space.As clearly illustrated in Figure 2 (rectangle A B C D in Figure 2a,b), the second strategy prevents that issue, since the border of the (hyper-)cube (in this case, an interval of R) after projection from S X,Y space to S X space (or S Y space) contains one point.When the dimensions of S X and S Y are larger than one, this strategy leads to building an (hyper-)rectangle equal to the product of two (hyper-)cubes, one of them in S X and the other one in S Y .If the maximum distance of the k-th NN in S X,Y is obtained in one of the directions in S X , this maximum distance, after multiplying by two, fixes the size of the (hyper-)cube in S X .To obtain the size of the second (hyper-)cube (in S Y ), the k neighbors in S X,Y are first projected on S Y , and then, the largest of the distances calculated from these projections fixes the size of this second (hyper-)cube.
In this two-dimensional example, k = 5.The origin of the Cartesian axis corresponds to the current point x i .Only the five nearest neighbors of this point, i.e., the points in the set χ k i , are represented.The fifth nearest neighbor is symbolized by a star.The neighboring regions ABCD, obtained from the maximum norm around the center point, are squares, with equal edge lengths ε x = ε y .Reducing one of the edge lengths, ε x or ε y , until one point falls onto the border (in the present case, in the vertical direction), leads to the minimum size rectangle A B C D , where ε x = ε y .Two cases must be considered: (a) the fifth neighbor is not localized on a node, but between two nodes, contrary to (b).This leads to obtaining either two points (respectively the star and the triangle in (a)) or only one point (the star in(b)) on the border of A B C D .Clearly, it is theoretically possible to have more than two points on the border of A B C D , but the probability of such an occurrence is equal to zero when the probability distribution of the random points X j is continuous.
In the remainder of this section, for an arbitrary dimension d, we propose to apply this strategy to estimate the entropy of a single multidimensional variable X observed in R d .This leads to introducing a d-dimensional (hyper-)rectangle centered on x i having a minimal volume and including the set χ k i of neighbors.Hence, the rectangular neighboring is built by adjusting its size separately in each direction in the space S X .Using this strategy, we are sure that, in any of the d directions, there is at least one point on one of the two borders (and only one with probability one).Therefore, in this approach, the (hyper-)rectangle, denoted by D ε 1 ,...,ε d x i , where the sizes ε 1 , . . ., ε d in the respective d directions are completely specified from the neighbors set χ k i , is substituted for the basic (hyper-)square It should be mentioned that the central symmetry of the (hyper-)rectangle around the center point allows for reducing the bias in the density estimation [38] (cf.Equation ( 11) or ( 18)).Note that, when k < d, there must exist neighbors positioned on some vertex or edges of the (hyper-)rectangle.With k < d, it is impossible that, for any direction, one point falls exactly inside a face (i.e., not on its border).For example, with k = 1 and d > 1, the first neighbor will be on a vertex, and the sizes of the edges of the reduced (hyper-)rectangle will be equal to twice the absolute value of its coordinates, whatever the direction.
Hereafter, we propose to extend the entropy estimators by Kozachenko-Leonenko and Singh using the above strategy before deriving the corresponding TE estimators and comparing their performance.

Extension of the Kozachenko-Leonenko Method
As indicated before, in [24], Kraskov et al. extended the Kozachenko-Leonenko estimator (Equations ( 10) and ( 15)) using the rectangular neighboring strategy to derive the MI estimator.Now, focusing on entropy estimation, after some mathematical developments (see Appendix D), we obtain another estimator of H(X), denoted by ◊ H(X) K (Box 6 in Figure 1), Here, v i is the volume of the minimum volume (hyper-)rectangle around the point x i .Exploiting this entropy estimator, after substitution in Equation ( 8), we can derive a new estimation of TE.

Extension of Singh's Method
We propose in this section to extend Singh's entropy estimator by using a (hyper-)rectangular domain, as we did for the Kozachenko-Leonenko estimator extension introduced in the preceding section.Considering a d-dimensional random vector X ∈ R d continuously distributed according to a probability density function p X , we aim at estimating the entropy H(X) from the observation of a p X distributed IID random sequence X i , i = 1, . . ., N .For any specific data point x i and a fixed number k (1 ≤ k ≤ N ), the minimum (hyper-)rectangle (rectangle A B C D in Figure 2) is fixed, and we denote this region by D ε 1 ,...,ε d x i and its volume by v i .Let us denote ξ i (1 ≤ ξ i ≤ min(k, d)) the number of points on the border of the (hyper-)rectangle that we consider as a realization of a random variable Ξ i .In the situation described in Figure 2a,b, ξ i = 2 and ξ i = 1, respectively.According to [39] (Chapter 6, page 269), if D x i χ k i corresponds to a ball (for a given norm) of volume v i , an unbiased estimator of p X (x i ) is given by: This implies that the classical estimator ◊ p X (x i ) = k N v i is biased and that presumably log k N v i is also a biased estimation of log (p X (x i )) for N large, as shown in [39].Now, in the case D x i χ k i is the minimal (i.e., with minimal (hyper-)volume) (hyper-)rectangle D ε 1 ,...,ε d x i , including χ k i , more than one point can belong to the border, and a more general estimator p X (x i ) of p X (x i ) can be a priori considered: where ki is some given function of k and ξ i .The corresponding estimation of H(X) is then: with: t i being realizations of random variables T i and ki being realizations of random variables K i .We have: Our goal is to derive E ◊ H(X) − H(X) = E (T 1 ) − H(X) for N large to correct the asymptotic bias of ◊ H(X), according to Steps (1) to (3), explained in Section 2.1.3.To this end, we must consider an asymptotic approximation of the conditional probability distribution P (T 1 ≤ r|X 1 = x 1 , Ξ 1 = ξ 1 ) before computing the asymptotic difference between the expectation ] and the true entropy H(X).
Property 1.For N large, The Poisson approximation (when N → ∞ and v r → 0) of the binomial distribution summed in Equation ( 29) leads to a parameter λ = (N − ξ 1 − 1) p X (x 1 )v r .As N is large compared to ξ 1 + 1, we obtain from Equation ( 26): λ k1 e r p X (x 1 ), and we get the approximation: , we can get the density function of T 1 , noted g T 1 (r), by deriving P (T 1 ≤ r|X 1 = x 1 , Ξ 1 = ξ 1 ).After some mathematical developments (see Appendix F), we obtain: and consequently (see Appendix G for details), Therefore, with the definition of differential entropy H(X 1 ) = E[− log (p X (X 1 ))], we have: Thus, the estimator expressed by Equation ( 25) is asymptotically biased.Therefore, we consider a modified version, denoted by ◊ H(X) N S , obtained by subtracting an estimation of the bias i=1 log ki (according to the large numbers law), and we obtain, finally (Box 5 in Figure 1): In comparison with the development of Equation ( 22), we followed here the same methodology, except we take into account (through a conditioning technique) the influence of the number of points on the border.
We observe that, after cancellation of the asymptotic bias, the choice of the function of k and ξ i to define ki in Equation (24) does not have any influence on the final result.In this way, we obtain an expression for ◊ H(X) N S , which simply takes into account the values ξ i that could a priori influence the entropy estimation.
Note that, as for the original Kozachenko-Leonenko (Equation ( 10)) and Singh (Equation ( 16)) entropy estimators, both new estimation functions (Equations ( 22) and ( 35)) hold for any value of k, such that k N , and we do not have to choose a fixed k while estimating entropy in lower dimensional spaces.Therefore, under the framework proposed in [24], we built two different TE estimators using Equations ( 22) and (35), respectively.

Computation of the Border Points Number and of the (Hyper-)Rectangle Sizes
We explain more precisely hereafter how to determine the numbers of points ξ i on the border.Let us denote x j i ∈ R d , j = 1, . . ., k, the k nearest neighbors of x i ∈ R d , and let us consider the d × k array D i , such that for any (p, j) ∈ {1, . . ., d} × {1, . . ., k}, D i (p, j) = x j i (p) − x i (p) is the distance (in R) between the p-th component x j i (p) of x j i and the p-th component x i (p) of x i .For each p, let us introduce J i (p) ∈ {1, . . ., k} defined by D i (p, J i (p)) = max (D i (p, 1), . . ., D i (p, k)) and which is the value of the column index of D i for which the distance D i (p, j) is maximum in the row number p. Now, if there exists more than one index J i (p) that fulfills this equality, we select arbitrarily the lowest one, hence avoiding the max(•) function to be multi-valued.The MATLAB implementation of the max function selects such a unique index value.Then, let us introduce the d × k Boolean array B i defined by B i (p, j) = 1 if j = J i (p) and B i (p, j) = 0, otherwise.Then: (1) The d sizes ε p , p = 1, . . ., d of the (hyper-)rectangle D ε 1 ,...,ε d x i are equal respectively to ε p = 2D i (p, J i (p)), p = 1, . . ., d.
(2) We can define ξ i as the number of non-null column vectors in B i .For example, if the k-th nearest neighbor x k i is such that ∀j = k, ∀p = 1, . . ., d : , when the k-th nearest neighbor is systematically the farthest from the central point x i for each of the d directions, then all of the entries in the last column of B i are equal to one, while all other entries are equal to zero: we have only one column including values different from zero and, so, only one point on the border (ξ i = 1), which generalizes the case depicted in Figure 2b for d = 2.
N.B.: this determination of ξ i may be incorrect when there exists a direction p, such that the number of indices j for which D i (p, j) reaches the maximal value is larger than one: the value of ξ i obtained with our procedure can then be underestimated.However, we can argue that, theoretically, this case occurs with a probability equal to zero (because the observations are continuously distributed in the probability) and, so, it can be a priori discarded.Now, in practice, the measured quantification errors and the round off errors are unavoidable, and this probability will differ from zero (although remaining small when the aforesaid errors are small): theoretically distinct values D i (p, j) on the row p of D i may be erroneously confounded after quantification and rounding.However, the max(•) function then selects on row p only one value for J i (p) and, so, acts as an error correcting procedure.The fact that the maximum distance in the concerned p directions can then be allocated to the wrong neighbor index has no consequence for the correct determination of ξ i .

New Estimators of Transfer Entropy
From an observed realization x . ., N and a number k of neighbors, the procedure could be summarized as follows (distances are from the maximum norm): (1) similarly to the MILCA [31] and TRENTOOL toolboxes [34], normalize, for each i, the vectors x p i , x − i and y − i ; (2) in joint space S X p ,X − ,Y − , for each point and its k-th neighbor, then construct the (hyper-)rectangle with sizes ε 1 , . . ., ε d (d is the dimension of the vectors x p i , x − i , y − i ), for which the (hyper-)volume is v (X p ,X − ,Y − ),i = ε 1 × . . .× ε d and the border contains ξ (X p ,X − ,Y − ),i points; (3) for each point (x p i , x − i ) in subspace S X p ,X − , count the number k (X p ,X − ),i of points falling within the distance d (x p i ,x − i ,y − i ),k , then find the smallest (hyper-)rectangle that contains all of these points and for which v (X p ,X − ),i and ξ (X p ,X − ),i are respectively the volume and the number of points on the border; repeat the same procedure in subspaces S X − ,Y − and S X − .From Equation ( 22) (modified to k not constant for S X − , S X p ,X − and S X − ,Y − ), the final TE estimator can be written as (Box 8 in Figure 1): where and with Equation ( 35), it yields to (Box 7 in Figure 1): In Equations ( 36) and ( 37), the volumes v (X p ,X − ),i , v (X − ,Y − ),i , v (X p ,X − ,Y − ),i , v X − ,i are obtained by computing, for each of them, the product of the edges lengths of the (hyper-)rectangle, i.e., the product of d edges lengths, d being respectively equal to d . In a given subspace and for a given direction, the edge length is equal to twice the largest distance between the corresponding coordinate of the reference point (at the center) and each of the corresponding coordinates of the k nearest neighbors.Hence a generic formula is v U = dim(U ) j=1 ε U j , where U is one of the symbols (X p , X − ), (X − , Y − ), (X p , X − , Y − ) and X − and the ε U j are the edge lengths of the (hyper-)rectangle.
The new TE estimator Ÿ TE Y →X p1 (Box 8 in Figure 1) can be compared with the extension of Ÿ TE Y →X SA , the TE estimator proposed in [27] (implemented in the JIDT toolbox [30]).This extension [27], included in Figure 1 (Box 9 ), is denoted here by Ÿ TE Y →X EA .The main difference with our Ÿ TE Y →X p1 estimator is that our algorithm uses a different length for each sub-dimension within a variable, rather than one length for all sub-dimensions within the variable (which is the approach of the extended algorithm).We introduced this approach to make the tightest possible (hyper-)rectangle around the k nearest neighbors.Ÿ TE Y →X EA is expressed as follows: ). ( In the experimental part, this estimator is marked as the "extended algorithm".It differs from Equation (36) in two ways.Firstly, the first summation on the right hand-side of Equation ( 36) does not exist.Secondly, compared with Equation ( 36), the numbers of neighbors k X − ,i , k (X p ,X − ),i and k (X − ,Y − ),i included in the rectangular boxes, as explained in Section 3.1, are replaced respectively with l X − ,i , l (X p ,X − ),i and l (X − ,Y − ),i , which are obtained differently.More precisely, Step (2) in the above algorithm becomes: (2') For each point (x p i , x − i ) in subspace S X p ,X − , l (X p ,X − ),i is the number of points falling within a (hyper-)rectangle equal to the Cartesian product of two (hyper-)cubes, the first one in S X p and the second one in S X − , whose edge lengths are equal, respectively, to d max Denote by v (X p ,X − ),i the volume of this (hyper-)rectangle.Repeat the same procedure in subspaces S X − ,Y − and S X − .Note that the important difference between the construction of the neighborhoods used in Ÿ TE Y →X EA and in Ÿ TE Y →X p1 is that, for the first case, the minimum neighborhood, including the k neighbors, is constrained to be a Cartesian product of (hyper-)cubes and, in the second case, this neighborhood is a (hyper-)rectangle whose edge lengths can be completely different.

Experimental Results
In the experiments, we tested both Gaussian IID and Gaussian AR models to compare and validate the performance of the TE estimators proposed in the previous section.For a complete comparison, beyond the theoretical value of TE, we also computed the Granger causality index as a reference (as indicated previously, in the case of Gaussian signals TE and Granger causality index are equivalent up to a factor of two; see Appendix H).In each following figure, GCi/2 corresponds to the Granger causality index divided by two; TE estimated by the free TRENTOOL toolbox (corresponding to Equation ( 21)) is marked as the standard algorithm; that estimated by JIDT (corresponding to Equation (38)) is marked as the extended algorithm; TE p1 is the TE estimator given by Equation (36); and TE p2 is the TE estimator given by Equation (37).For all of the following results, the statistical means and the standard deviations of the different estimators have been estimated using an averaging on 200 trials.

Gaussian IID Random Processes
The first model we tested, named Model 1, is formulated as follows: where , the three processes Y , Z, and W being mutually independent.The triplet  Results are reported in Figure 3 where the dimensions d Y and d Z are identical.We observe that, for a low dimension and a sufficient number of neighbors (Figure 3a), all TE estimators tend all the more to the theoretical value (around 0.26) that the length of the signals is large, the best estimation being obtained by the two new estimators.Compared to Granger causality, these estimators display a greater bias, but a lower variance.Due to the "curse of dimensionality", with an increasing dimension (see Figure 3b), it becomes much more difficult to obtain an accurate estimation of TE.For a high dimension, all estimators reveal a non-negligible bias, even if the two new estimators still behave better than the two reference ones (standard and extended algorithms).

Vectorial AR Models
In the second experiment, two AR models integrating either two or three signals have been tested.The first vectorial AR model (named Model 2) we tested was as follows: The second vectorial AR model (named Model 3) was given by: For both models, e x , e y and e z denote realizations of independent white Gaussian noises with zero mean and a variance of 0.1.As previously, we display in the following figures not only the theoretical value of TE, but also the Granger causality index for comparison.In this experiment, the prediction orders m and n were equal to the corresponding regression orders of the AR models.For example, when estimating TE Y →X , we set m = 2, n = 2, and i , Y . For Figures 4 and 5, the number k of neighbors was fixed to eight, whereas, in Figure 6, this number was set to four and three (respectively Figures 6a,b) to show the influence of this parameter.Figures 4  and 6 are related to Model 2, and Figure 5 is related to Model 3. estimation of information theoretical measurement.In this work, we first investigated the estimation of Shannon entropy based on the k-NN technique involving a rectangular neighboring region and introduced two different k-NN entropy estimators.We derived mathematically these new entropy estimators by extending the results and methodology developed in [24] and [36].Given the new entropy estimators, two novel TE estimators have been proposed, implying no extra computation cost compared to existing similar k-NN algorithm.To validate the performance of these estimators, we considered different simulated models and compared the new estimators with the two TE estimators available in the free TRENTOOL and JIDT toolboxes, respectively, and which are extensions of two Kraskov-Stögbauer-Grassberger (KSG) MI estimators, based respectively on (hyper-)cubic and (hyper-)rectangular neighborhoods.
Under the Gaussian assumption, experimental results showed the effectiveness of the new estimators under the IID assumption, as well as for time-correlated AR signals in comparison with the standard KSG algorithm estimator.This conclusion still holds when comparing the new algorithms with the extended KSG estimator.Globally, all TE estimators satisfactorily converge to the theoretical TE value, i.e., to half the value of the Granger causality, while the newly proposed TE estimators showed lower bias for k sufficiently large (in comparison with the reference TE estimators) with comparable variances estimation errors.
As the variance remains relatively stable when the number of neighbors falls from eight to three, in this case, the extended algorithm, which displays a slightly lower bias, may be preferred.Now, one of the new TE estimators suffered from noticeable error when the number of neighbors was small.Some experiments allowed us to verify that this issue already exists when estimating the entropy of a random vector: when the number of neighbors k falls below the dimension d, then the bias drastically increases.More details on this phenomenon are given in Appendix I.
As expected, experiments with Model 1 showed that all three TE estimators under examination suffered from the "curse of dimensionality", which made it difficult to obtain accurate estimation of TE with high dimension data.In this contribution, we do not present the preliminary results that we obtained when simulating a nonlinear version of Model 1, for which the three variables X t , Y t and Z t were scalar and their joint law was non-Gaussian, because a random nonlinear transformation was used to compute X t from Y t , Z t .For this model, we computed the theoretical TE (numerically, with good precision) and tuned the parameters to obtain a strong coupling between X t and Z t .The theoretical Granger causality index was equal to zero.We observed the same issue as that pointed out in [41], i.e., a very slow convergence of the estimator when the number of observations increases, and noticed that the four estimators Ÿ TE Y →X SA , Ÿ TE Y →X EA , Ÿ TE Y →X p1 and Ÿ TE Y →X p2 , revealed very close performance.In this difficult case, our two methods do not outperform the existing ones.Probably, for this type of strong coupling, further improvement must be considered at the expense of an increasing computational complexity, as that proposed in [41].
This work is a first step in a more general context of connectivity investigation for neurophysiological activities obtained either from nonlinear physiological models or from clinical recordings.In this context, partial TE has also to be considered, and future work would address a comparison of the techniques presented in this contribution in terms of bias and variance.Moreover, considering the to the Lebesgue measure in R m+n+1 denoted by µ m+n+1 ) with the corresponding density: Then, we are sure that the two following conditional densities probability functions exist: and Equation (3) yields to: (44) Equation ( 44) can be rewritten:

B. Basic Structure of TE Estimators
From Equation (8), assuming that X and Y are jointly strongly ergodic leads to: where is the volume of the (hyper-)rectangle, and we obtain: Finally, by taking the experimental mean of the right term in Equation (50), we obtain an estimation of the expectation E [log p X (X)], i.e.,: has a (hyper-)volume constrained to the value v r .We have: where the first equivalence (the inclusion is a strict inclusion) is clearly implied by the construction of D 1 ,..., d x 1 and the second equivalence expresses the fact that the (hyper-)volume of D 1 ,..., d x 1 is larger than v r if and only if the normalized domain D 1 ,..., d x 1 does not contain more than (k − ξ 1 ) points x j (as ξ 1 of them are on the border of D 1 ,..., d x 1 , which is necessarily not included in D 1 ,..., d x 1 ).These equivalences imply the equalities between conditional probability values: Only (N − 1 − ξ 1 ) events X j : X j ∈ D 1 ,..., d x 1 are to be considered, because the variable X 1 and the ξ 1 variable(s) on the border of D 1 ,..., d x 1 must be discarded.Moreover, these events are independent.Hence, the probability value in (53) can be developed as follows: x 1 , we have P X ∈ D 1 ,..., d x 1 p X (x 1 )v r (note that the randomness of ( 1 , . . ., d ) does not influence this approximation as the (hyper-)volume of D 1 ,..., d x 1 is imposed to be equal to v r ).Finally, we can write: F. Derivation of Equation ( 32) , we take the derivative of P(T 1 ≤ r|X 1 = x 1 , Ξ 1 = ξ 1 ) to get the conditional density function of T 1 : (56) Defining: we have: e − k1 p X (x 1 )e r (a(i) − a(i + 1)) = −e − k1 p X (x 1 )e r (a(0) − a(k − ξ 1 + 1)) = e − k1 p X (x 1 )e r a(k − ξ 1 + 1) = k1 p X (x 1 )e r (59)

H. Transfer Entropy and Granger Causality
TE can be considered as a measurement of the degree to which the history Y − of the process Y disambiguates the future X p of X beyond the degree to how its history X − disambiguates this future [22].It is an information theoretic implementation of Wiener's principle of observational causality.Hence, TE reveals a natural relation to Granger causality.As is well known, Granger causality emphasizes the concept of reduction of the mean square error of the linear prediction of X p i when adding Y − i to X − i by introducing the Granger causality index: where lpe X p i |U is the error when predicting linearly X p i from U .TE is framed in terms of the reduction of the Shannon uncertainty (entropy) of the predictive probability distribution.When the probability distribution of X p i , X − i , Y − i is assumed to be Gaussian, TE and Granger causality are entirely equivalent, up to a factor of two [42]: Consequently, in the Gaussian case, TE can be easily computed from a statistical second order characterization of X p i , X − i , Y − i .This Gaussian assumption obviously holds when the processes Y and X are jointly normally distributed and, more particularly, when they correspond to a Gaussian autoregressive (AR) bivariate process.In [42], Barnett et al. discussed the relation between these two causality measures, and this work bridged information-theoretic methods and autoregressive ones.The theoretical entropy value is compared with its estimation from the Kozachenko-Leonenko reference estimator (Equation (10), red circles), its extension (Equation ( 22), black stars) and the extension of Singh's estimator (Equation (35), blue squares).It appears clearly that, for the extended Singh's estimator, the bias (true value minus estimated value) increases drastically when the number of neighbors decreases under a threshold slightly lower than the dimension d of the vector.This allows us to interpret some apparently surprising results obtained with this estimator in the estimation of TE, as reported in Figure 6b.TE estimation is a sum of four separate vector entropy estimations,

I. Comparison between Entropy Estimators
Here, the dimensions of the four vectors are d (X − , Y − ) = m + n = 4, d (X p , X − ) = 1 + m = 3, d (X p , X − , Y − ) = 1 + m + n = 5, d (X − ) = m = 2, respectively.Note that, if we denote by X M 2 and Y M 2 the two components in Model 2, the general notation (X p , X − , Y − ) corresponds to Y p M 2 , Y − M 2 , X − M 2 , because in Figure 6b, the analyzed direction is X → Y and not the reverse.We see that, when considering the estimation of H (X p , X − , Y − ), we have d = 5 and k = 3, which is the imposed neighbors number in the global space.Consequently, from the results shown in Figure 8, we can expect that in Model 2, the quantity H (X p , X − , Y − ) will be drastically underestimated.For the other components ¤ H (X − , Y − ), ¤ H (X p , X − ), Ÿ H (X − ), the numbers of neighbors to consider are generally larger than three (as a consequence of Kraskov's technique, which introduces projected distances) and d ≤ 5, so that we do not expect any underestimation of these terms.Therefore, globally, when summing the four entropy estimations, the resulting positive bias observed in Figure 6b is understandable., where β = 0.5."Curve 1" stands for the true value; "Curve 2", "Curve 3" and "Curve 4" correspond to the values of entropy obtained using respectively Equations (10), (22) and (35).

2 .
Original k-Nearest-Neighbors Strategies 2.1.Kozachenko-Leonenko and Singh's Entropy Estimators for a Continuously-Distributed Random Vector 2.1.1.Notations a Toeplitz matrix with the first line equal to [1, α, . . ., α d U −1 ].For the matrix C Y , we chose α = 0.5, and for C Z , α = 0.2.The standard deviation σ W was set to 0.5.The vectors a and b were such that a = 0.1 * [1, 2, . . ., d Y ] and b = 0.1 * [d Z , d Z − 1, . . ., 1].With this model, we aimed at estimating H(X|Y ) − H(X|Y, Z) to test if the knowledge of signals Y and Z could improve the prediction of X compared to only the knowledge of Y .

Figure 3 .
Figure 3. Information transfer from Z to X (Model 1) estimated for two different dimensions with k = 8.The figure displays the mean values and the standard deviations: (a) d Y = d Z = 3; (b) d Y = d Z = 8.

Figure 8
Figure 8 displays the values of entropy for a Gaussian d-dimensional vector as a function of the number of neighbors k, for d = 3 in Figure 8a and d = 8 in Figure 8b, obtained with different estimators.The theoretical entropy value is compared with its estimation from the Kozachenko-Leonenko reference estimator (Equation (10), red circles), its extension (Equation (22), black stars) and the extension of Singh's estimator (Equation(35), blue squares).It appears clearly that, for the extended Singh's estimator, the bias (true value minus estimated value) increases drastically when the number of neighbors decreases under a threshold slightly lower than the dimension d of the vector.This allows us to interpret some apparently surprising results obtained with this estimator in the estimation of TE, as reported in Figure6b.TE estimation is a sum of four separate vector entropy estimations,