Next Article in Journal
Sliding-Mode Synchronization Control for Uncertain Fractional-Order Chaotic Systems with Time Delay
Next Article in Special Issue
Transfer Entropy
Previous Article in Journal
2D Anisotropic Wavelet Entropy with an Application to Earthquakes in Chile
Previous Article in Special Issue
Assessing Coupling Dynamics from an Ensemble of Time Series
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Contribution to Transfer Entropy Estimation via the k-Nearest-Neighbors Approach

by
Jie Zhu
1,2,3,
Jean-Jacques Bellanger
1,2,
Huazhong Shu
3,4 and
Régine Le Bouquin Jeannès
1,2,3,*
1
Institut National de la Santé Et de la Recherche Médicale (INSERM), U 1099, Rennes F-35000, France
2
Université de Rennes 1, LTSI, Rennes F-35000, France
3
Centre de Recherche en Information Biomédicale sino-français (CRIBs), Rennes F-35000, France
4
Laboratory of Image Science and Technology (LIST), School of Computer Science and Engineering, Southeast University, Nanjing 210018, China
*
Author to whom correspondence should be addressed.
Entropy 2015, 17(6), 4173-4201; https://doi.org/10.3390/e17064173
Submission received: 31 December 2014 / Accepted: 10 June 2015 / Published: 16 June 2015
(This article belongs to the Special Issue Transfer Entropy)

Abstract

:
This paper deals with the estimation of transfer entropy based on the k-nearest neighbors (k-NN) method. To this end, we first investigate the estimation of Shannon entropy involving a rectangular neighboring region, as suggested in already existing literature, and develop two kinds of entropy estimators. Then, applying the widely-used error cancellation approach to these entropy estimators, we propose two novel transfer entropy estimators, implying no extra computational cost compared to existing similar k-NN algorithms. Experimental simulations allow the comparison of the new estimators with the transfer entropy estimator available in free toolboxes, corresponding to two different extensions to the transfer entropy estimation of the Kraskov–Stögbauer–Grassberger (KSG) mutual information estimator and prove the effectiveness of these new estimators.

1. Introduction

Transfer entropy (TE) is an information-theoretic statistic measurement, which aims to measure an amount of time-directed information between two dynamical systems. Given the past time evolution of a dynamical system A, TE from another dynamical system B to the first system A is the amount of Shannon uncertainty reduction in the future time evolution of A when including the knowledge of the past evolution of B. After its introduction by Schreiber [1], TE obtained special attention in various fields, such as neuroscience [28], physiology [911], climatology [12] and others, such as physical systems [1317].
More precisely, let us suppose that we observe the output Xi ∈ ℝ, i ∈ ℤ, of some sensor connected to A. If the sequence X is supposed to be an m-th order Markov process, i.e., if considering subsequences X i ( k ) = ( X i k + 1 , X i k + 2 , X i ) , k > 0, the probability measure P X (defined on measurable subsets of real sequences) attached to X fulfills the m-th order Markov hypothesis:
i : m > m : d P X i + 1 | X i ( m ) ( x i + 1 | x i ( m ) ) = d P X i + 1 | X i ( m ) ( x i + 1 | x i ( m ) ) , x i + 1 , x i ( k ) k ,
then the past information X i ( m ) (before time instant i + 1) is sufficient for a prediction of Xi+k, k ≥ 1, and can be considered as an m-dimensional state vector at time i (note that, to know from X the hidden dynamical evolution of A, we need a one-to-one relation between X i ( m ) and the physical state of A at time i). For the sake of clarity, we introduce the following notation: ( X i p , X i , Y i ), i = 1, 2, …, N, is an independent and identically distributed (IID) random sequence, each term following the same distribution as a random vector (Xp, X, Y) ∈ ℝ1+m+n whatever i (in Xp, X, Y, the upper indices “p” and “-” correspond to “predicted” and “past”, respectively). This notation will substitute for the notation ( X i + 1 , X i ( m ) , Y i ( n ) ), i = 1, 2, …, N, and we will denote by S X p , X , Y , S X p , X , S X , Y and S X the spaces in which (Xp, X, Y), (Xp, X), (X, Y) and X are respectively observed.
Now, let us suppose that a causal influence exists from B on A and that an auxiliary random process Yi ∈ ℝ, i ∈ ℤ, recorded from a sensor connected to B, is such that, at each time i and for some n > 0, Y i Y i ( n ) is an image (not necessarily one-to-one) of the physical state of B. The negation of this causal influence implies:
( m > 0 , n > 0 : i : d P X i p | X i ( m ) ( x i p | x i ( m ) ) = d P X i p | X i ( m ) , Y i ( n ) ( x i p | x i ( m ) , y i ( n ) ) .
If Equation (2) holds, it is said that there is an absence of information transfer from B to A. Otherwise, the process X can be no longer considered strictly a Markov process. Let us suppose the joint process (X, Y) is Markovian, i.e., there exist a given pair (m′, n′), a transition function f and an independent random sequence ei, i ∈ ℤ, such that [ X i + 1 , Y i + 1 ] T = f ( X i ( m ) , Y i ( m ) , e i + 1 ), where the random variable ei+1 is independent of the past random sequence (Xj, Yj, ej), ji, whatever i. As X i = g ( X i ( m ) , Y i ( m ) ) where g is clearly a non-injective function, the pair { ( X i ( m ) , Y i ( n ) ) , X i }, i ∈ ℤ, corresponds to a hidden Markov process, and it is well known that this observation process is not generally Markovian.
The deviation from this assumption can be quantified using the Kullback pseudo-metric, leading to the general definition of TE at time i:
TE Y X , i = m + n + 1 log [ d P X i p | X i , Y i ( x i p | x i , y i ) d P X i p | X i ( x i p | x i ) ] d P X i p , X i , Y i ( x i p , x i , y i ) ,
where the ratio in Equation (3) corresponds to the Radon–Nikodym derivative [18,19] (i.e., the density) of the conditional measure d P X i p | X i , Y i ( | x i , y i ) with respect to the conditional measure d P X i p | X i ( | x i ) Considering “log” as the natural ogarithm, information is measured in natural (nats). Now, given two observable scalar random time series X and Y with no a priori given model (as is generally the case), if we are interested in defining some causal influence from Y to X through TE analysis, we must specify the dimensions of the past information vectors X and Y, i.e., m and n. Additionally, even if we impose them, it is not evident that all of the coordinates in X i ( m ) and Y i ( n ) will be useful. To deal with this issue, variable selection procedures have been proposed in the literature, such as uniform and non-uniform embedding algorithms [20,21].
If the joint probability measure P X i p , X i , Y i ( x i p , x i , y i ) is derivable with respect to the Lebesgue measure µn+m+1 in ℝ1+m+n (i.e., if P X i p , X i , Y i is absolutely continuous with respect to µn++m+1), then the pdf (joint probability density function) p X i p , X i , Y i ( x i p , x i , y i ) and also the pdf for each subset of { X i p , X i , Y i } exist, and TEYX,i can then be written (see Appendix A):
TE Y X , i = E [ log ( p X i , Y i ( X i , Y i ) ) ] E [ log ( p X i p , X i ( X i p , Y i ) ) ] + E [ log ( p X i p , X i , Y i ( X i p , X i , Y i ) ) ] + E [ log ( p X i ( X i ) ) ]
or:
TE Y X , i = H ( X i , Y i ) + H ( X i p , X i ) H ( X i p , X i , Y i ) H ( X i ) ,
where (U) denotes the Shannon differential entropy of a random vector U. Note that, if the processes Y and X are assumed to be jointly stationary, for any real function g : ℝm+n+1 → ℝ, the expectation E [ g ( X i + 1 , X i ( m ) , Y i ( m ) ) ] does not depend on i. Consequently, TEYX,i does not depend on i (and so can be simply denoted by TEYX), nor all of the quantities defined in Equations (3) to (5). In theory, TE is never negative and is equal to zero if and only if Equation (2) holds.
According to Definition (3), TE is not symmetric, and it can be regarded as a conditional mutual information (CMI) [3,22] (sometimes also named partial mutual information (PMI) in the literature [23]). Recall that mutual information between two random vectors X and Y is defined by:
I ( X ; Y ) = H ( X ) + H ( Y ) H ( X , Y ) ,
and TE can be also written as:
TE Y X = I ( X p , Y | X ) .
Considering the estimation TE Y X ^ of TE, TEYX, as a function defined on the set of observable occurrences (xi, yi), i = 1, …, N, of a stationary sequence (Xi, Yi), i = 1, …, N, and Equation (5), a standard structure for the estimator is given by (see Appendix B):
TE Y X ^ = H ( X , Y ) ^ + H ( X p , X ) ^ H ( X p , X , Y ) ^ H ( X ) ^ = 1 N n = 1 N log ( p U 1 ( u 1 n ) ) ^ 1 N n = 1 N log ( p U 2 ( u 2 n ) ) ^ + 1 N n = 1 N log ( p U 3 ( u 3 n ) ) ^ + 1 N n = 1 N log ( p U 4 ( u 4 n ) ) ^ ,
where U1, U2, U3 and U4 stand respectively for (X, Y), (Xp, X), (Xp, X, Y) and X. Here, for each n, log ( p U ( u n ) ) ^ is an estimated value of log (pU (un)) computed as a function fn (u1, …, uN) of the observed sequence un, n = 1, …, N. With the k-NN approach addressed in this study, fn (u1, …, uN) depends explicitly only on un and on its k nearest neighbors. Therefore, the calculation of H ( U ) ^ definitely depends on the chosen estimation functions fn. Note that if, for N fixed, these functions correspond respectively to unbiased estimators of log (p (un)), then TE Y X ^ is also unbiased; otherwise, we can only expect that TE Y X ^ is asymptotically unbiased (for N large). This is so if the estimators of log (pU (un)) are asymptotically unbiased.
Now, the theoretical derivation and analysis of the most currently used estimators H ( U ) ^ ( u 1 , , u N ) = 1 N n = 1 N log ( p ( u n ) ) ^ for the estimation of H ( U ) generally suppose that u1, …, uN are N independent occurrences of the random vector U, i.e., u1, …, uN is an occurrence of an independent and identically distributed (IID) sequence U1, …, UN of random vectors ( i = 1 , , N : P U i = P U ). Although the IID hypothesis does not apply to our initial problem concerning the measure of TE on stationary random sequences (that are generally not IID), the new methods presented in this contribution are extended from existing ones assuming this hypothesis, without relaxing it. However, the experimental section will present results not only on IID observations, but also on non-IID stationary autoregressive (AR) processes, as our goal was to verify if some improvement can be nonetheless obtained for non-IID data, such as AR data.
If we come back to mutual information (MI) defined by Equation (6) and compare it with Equations (5), it is obvious that estimating MI and TE shares similarities. Hence, similarly to Equation (8) for TE, a basic estimation I ( X ; Y ) ^ of I ( X ; Y ) from a sequence (xi, yi), i = 1, …, N, of N independent trials is:
I ( X ; Y ) ^ = 1 N n = 1 N log ( p X ( x n ) ^ ) 1 N n = 1 N log ( p Y ( y n ) ^ ) + 1 N n = 1 N log ( p X , Y ( x n , ^ y n ) ) .
In what follows, when explaining the links among the existing methods and the proposed ones, we refer to Figure 1. In this diagram, a box identified by a number k in a circle is designed by box ⓚ.
Improving performance (in terms of bias and variance) of TE and MI estimators (obtained by choosing specific estimation functions log ( p ( ) ) ^) in Equations (8) and (9), respectively) remains an issue when applied on short-length IID (or non-IID) sequences [3]. In this work, we particularly focused on bias reduction. For MI, the most widely-used estimator is the Kraskov–Stögbauer–Grassberger (KSG) estimator [24,31], which was later extended to estimate transfer entropy, resulting in the k-NN TE estimator [2527,3235] (adopted in the widely-used TRENTOOL open source toolbox, Version 3.0). Our contribution originated in the Kozachenko–Leonenko entropy estimator summarized in [24] and proposed beforehand in the literature to get an estimation H ( X ) ^ of the entropy H ( X ) of a continuously-distributed random vector X, from a finite sequence of independent outcomes xi, i = 1, …, N. This estimator, as well as another entropy estimator proposed by Singh et al. in [36] are briefly described in Section 2.1, before we introduce, in Section 4, our two new TE estimators based on both of them. In Section 2.2, Kraskov MI and standard TE estimators derived in literature from the Kozachenko–Leonenko entropy estimator are summarized, and the passage from a square to rectangular neighboring region to derive new entropy estimation is detailed in Section 3. Our methodology is depicted in Figure 1.

2. Original k-Nearest-Neighbors Strategies

2.1. Kozachenko–Leonenko and Singh’s Entropy Estimators for a Continuously-Distributed Random Vector

2.1.1. Notations

Let us consider a sequence xi, i = 1, …, N in d X (in our context, this sequence corresponds to an outcome of an IID sequence X1, …, XN, such that the common probability distribution will be equal to that of a given random vector X). The set of the k nearest neighbors of xi in this sequence (except for xi) and the distance between xi and its k-th nearest neighbor are respectively denoted by χ i k and d x i , k. We denote D x i ( χ i k ) d X a neighborhood of xi in d X, which is the image of ( x i , χ i k ) by a set valued map. For a given norm ║·║ on d X (Euclidean norm, maximum norm, etc.), a standard construction ( x i , χ i k ) ( d X ) k + 1 D x i ( χ i k ) d X is the (hyper-)ball of radius equal to d x i , k, i.e., D x i ( χ i k ) = { x : x x i d x i , k } The (hyper-)volume (i.e., the Lebesgue measure) of D x i ( χ i k ) is then υ i = D x i ( χ i k ) d x (where d x d μ d X ( x )).

2.1.2. Kozachenko–Leonenko Entropy Estimator

The Kozachenko–Leonenko entropy estimator is given by (Box ③ in Figure 1):
H ( X ) ^ K L = ψ ( N ) + 1 N i = 1 N log ( υ i ) ψ ( k ) ,
where vi is the volume of D x i ( χ i k ) = { x : x x i d x i , k } computed with the maximum norm and ψ ( k ) = Γ ( k ) Γ ( k ) denotes the digamma function. Note that using Equation (10), entropy is measured in natural units (nats).
To come up with a concise presentation of this estimator, we give hereafter a summary of the different steps to get it starting from [24]. First, let us consider the distance d x i , k between xi and its k-th nearest neighbor (introduced above) as a realization of the random variable D x i , k, and let us denote by q x i , k ( x ), x ∈ ℝ, the corresponding probability density function (conditioned by Xi = xi). Secondly, let us consider the quantity h x i ( ε ) = u x i ε / 2 d P X ( u ). This is the probability mass of the (hyper-)ball with radius equal to ε/2 and centered on xi. This probability mass is approximately equal to:
h x i ( ε ) p X ( x i ) ξ ε / 2 d μ d ( ξ ) = p X ( x i ) c d ε d ,
if the density function is approximately constant on the (hyper-)ball. The variable cd is the volume of the unity radius d-dimensional (hyper-)ball in ℝd (cd = 1 with maximum norm). Furthermore, it can be established (see [24] for details) that the expectation E [ log ( h X i ( D X i , k ) ) ] where h X i is the random variable associated with h x i, D X i , k (which must not be confused with the notation D x i ( χ i k ) introduced previously) denotes the random distance between the k-th neighbor selected in the set of random vectors {Xk, 1 ≤ kN, k 6= i}, and the random point Xi is equal to ψ(k) − ψ(N) and does not depend on pX (·). Equating it with E [ log ( p X ( X i ) c d D X i , k ) ] leads to:
ψ ( k ) ψ ( N ) E [ log ( p X ( X i ) ) ] + E [ log ( c d D X i , k d ) ] = H ( X i ) + E [ log ( V i ) ]
and:
H ( X i ) ψ ( N ) ψ ( k ) + E [ log ( c d D X i , k d ) ] .
Finally, by using the law of large numbers, when N is large, we get:
H ( X i ) ψ ( N ) ψ ( k ) + 1 N i = 1 N log ( υ i ) = H ( X ) K L , ^
where vi is the realization of the random (hyper-)volume V i = c d D x i , k d.
Moreover, as observed in [24], it is possible to make the number of neighbors k depend on i by substituting the mean 1 N i = 1 N ψ ( k i ) for the constant ψ(k) in Equation (14), so that H ( X ) K L ^ becomes:
H ( X ) K L ^ = ψ ( N ) + 1 N i = 1 N ( log ( υ i ) ψ ( k i ) ) .

2.1.3. Singh’s Entropy Estimator

The question of k-NN entropy estimation is also discussed by Singh et al. in [36], where another estimator, denoted by H ( X ) S ^ hereafter, is proposed (Box ② in Figure 1):
H ( X ) S ^ = log ( N ) + 1 N i = 1 N log ( υ i ) ψ ( k ) .
Using the approximation ψ(N) ≈ log(N) for large values of N, the estimator given by Equation (16) is close to that defined by Equation (10). This estimator was derived by Singh et al. in [36] through the four following steps:
  • Introduce the classical entropy estimator structure:
    H ( X ) ^ 1 N i = 1 N log p X ( X i ) ^ = 1 N i = 1 N T i ,
    where:
    p X ( X i ) ^ k N υ i .
  • Assuming that the random variables Ti, i = 1, …, N are identically distributed, so that E [ H ( X ) ^ ] = E ( T 1 ) (note that E(T1) depends on N, even if the notation does not make that explicit), compute the asymptotic value of E(T1) (when N is large) by firstly computing its asymptotic cumulative probability distribution function and the corresponding probability density p T 1 , and finally, compute the expectation E ( T 1 ) = t p T 1 ( t ) d t.
  • It appears that E ( T 1 ) = E [ H ( X ) ^ ] = H ( X ) + B where B is a constant, which is identified with the bias.
  • Subtract this bias from H ( X ) ^ to get H ( X ) ^ S = H ( X ) ^ B and the formula given in Equation (16).
Note that the cancellation of the asymptotic bias does not imply that the bias obtained with a finite value of N is also exactly canceled. In Appendix C, we explain the origin of the bias for the entropy estimator given in Equation (17).
Observe also that, as for the Kozachenko–Leonenko estimator, it is possible to adapt Equation (16) if we want to consider a number of neighbors ki depending on i. Equation (16) must then be replaced by:
H ( X ) S ^ = log ( N ) + 1 N i = 1 N ( log ( υ i ) ψ ( k i ) ) .

2.2. Standard Transfer Entropy Estimator

Estimating entropies separately in Equations (8) and (9) leads to individual bias values. Now, it is possible to cancel out (at least partially) the bias considering the algebraic sums (Equations (8) and (9)). To help in this cancellation, on the basis of Kozachenko–Leonenko entropy estimator, Kraskov et al. proposed to retain the same (hyper-)ball radius for each of the different spaces instead of using the same number k for both joint space S X , Y and marginal spaces ( S X and S Y spaces) [24,37], leading to the following MI estimator (Box ⑪ in Figure 1):
I ^ K = ψ ( k ) + ψ ( N ) 1 N i = 1 N [ ψ ( n X , i + 1 ) + ψ ( n Y , i + 1 ) ] ,
where nX,i and nY,i denote the number of points that strictly fall into the resulting distance in the lower-dimensional spaces S X and S Y, respectively.
Applying the same strategy to estimate TE, the number of neighbors in the joint space S X p , X , Y SX is first fixed, then for each i, the resulting distance ε i d ( x i p , x i , y i ) , k is projected into the other three lower dimensional spaces, leading to the standard TE estimator [25,27,28] (implementation available in the TRENTOOL toolbox, Version 3.0, Box ⑩ in Figure 1):
T E Y X S A ^ = ψ ( k ) + 1 N i = 1 N [ ψ ( n X , i + 1 ) ψ ( n ( X , Y ) , i + 1 ) ψ ( n ( X p , X ) , i + 1 ) ] ,
where n X , i, n ( X , Y ) , i, n ( X p , X ) , i denote the number of points that fall into the distance εi from x i , ( x i , y i ) and ( x i p , x i ) in the lower dimensional spaces S X , S X , Y and S X p , X , respectively. This estimator is marked as the “standard algorithm” in the experimental part.
Note that a generalization of Equation (21) was proposed in [28] to extend this formula to the estimation of entropy combinations other than MI and TE.

3. From a Square to a Rectangular Neighboring Region for Entropy Estimation

In [24], to estimate MI, as illustrated in Figure 2, Kraskov et al. discussed two different techniques to build the neighboring region to compute I ( X ; Y ) ^: in the standard technique (square ABCD in Figure 2a,b), the region determined by the first k nearest neighbors is a (hyper-)cube and leads to Equation (20), and in the second technique (rectangle ABCD′ in Figure 2a,b), the region determined by the first k nearest neighbors is a (hyper-)rectangle. Note that the TE estimator mentioned in the previous section (Equation (21)) is based on the first situation (square ABCD in Figure 2a or 2b). The introduction of the second technique by Kraskov et al. was to circumvent the fact that Equation (15) was not applied rigorously to obtain the terms ψ(nX,i+1) or ψ(nY,i+1) in Equation (20). As a matter of fact, for one of these terms, no point xi (or yi) falls exactly on the border of the (hyper-)cube D x i (or D y i) obtained by the distance projection from the S X , Y space. As clearly illustrated in Figure 2 (rectangle ABCD′ in Figure 2a,b), the second strategy prevents that issue, since the border of the (hyper-)cube (in this case, an interval of ℝ) after projection from S X , Y space to S X space (or S Y space) contains one point. When the dimensions of S X and S Y are larger than one, this strategy leads to building an (hyper-)rectangle equal to the product of two (hyper-)cubes, one of them in S X and the other one in S Y. If the maximum distance of the k-th NN in S X , Y is obtained in one of the directions in S X, this maximum distance, after multiplying by two, fixes the size of the (hyper-)cube in S X. To obtain the size of the second (hyper-)cube (in S Y), the k neighbors in S X , Y are first projected on S Y, and then, the largest of the distances calculated from these projections fixes the size of this second (hyper-)cube.
In the remainder of this section, for an arbitrary dimension d, we propose to apply this strategy to estimate the entropy of a single multidimensional variable X observed in ℝd. This leads to introducing a d-dimensional (hyper-)rectangle centered on xi having a minimal volume and including the set χ i k of neighbors. Hence, the rectangular neighboring is built by adjusting its size separately in each direction in the space S X. Using this strategy, we are sure that, in any of the d directions, there is at least one point on one of the two borders (and only one with probability one). Therefore, in this approach, the (hyper-)rectangle, denoted by D x i ε 1 , , ε d, where the sizes ε1, …, εd in the respective d directions are completely specified from the neighbors set χ i k, is substituted for the basic (hyper-)square D x i ( χ i k ) = { : x x i d x i , k }. It should be mentioned that the central symmetry of the (hyper-)rectangle around the center point allows for reducing the bias in the density estimation [38] (cf. Equation (11) or (18)). Note that, when k < d, there must exist neighbors positioned on some vertex or edges of the (hyper-)rectangle. With k < d, it is impossible that, for any direction, one point falls exactly inside a face (i.e., not on its border). For example, with k = 1 and d > 1, the first neighbor will be on a vertex, and the sizes of the edges of the reduced (hyper-)rectangle will be equal to twice the absolute value of its coordinates, whatever the direction.
Hereafter, we propose to extend the entropy estimators by Kozachenko–Leonenko and Singh using the above strategy before deriving the corresponding TE estimators and comparing their performance.

3.1. Extension of the Kozachenko–Leonenko Method

As indicated before, in [24], Kraskov et al. extended the Kozachenko–Leonenko estimator (Equations (10) and (15)) using the rectangular neighboring strategy to derive the MI estimator. Now, focusing on entropy estimation, after some mathematical developments (see Appendix D), we obtain another estimator of H ( X ), denoted by H ( X ) ^ K (Box ⑥ in Figure 1),
H ( X ) ^ K = ψ ( N ) + 1 N i = 1 N log ( υ i ) ψ ( k ) + d 1 k .
Here, vi is the volume of the minimum volume (hyper-)rectangle around the point xi. Exploiting this entropy estimator, after substitution in Equation (8), we can derive a new estimation of TE.

3.2. Extension of Singh’s Method

We propose in this section to extend Singh’s entropy estimator by using a (hyper-)rectangular domain, as we did for the Kozachenko–Leonenko estimator extension introduced in the preceding section. Considering a d-dimensional random vector X ∈ ℝd continuously distributed according to a probability density function pX, we aim at estimating the entropy (X) from the observation of a pX distributed IID random sequence Xi, i = 1, …, N. For any specific data point xi and a fixed number k (1 ≤ kN), the minimum (hyper-)rectangle (rectangle A′B′C′D′ in Figure 2) is fixed, and we denote this region by D x i ε 1 , , ε d and its volume by vi. Let us denote ξi (1 ≤ ξi ≤ min(k, d)) the number of points on the border of the (hyper-)rectangle that we consider as a realization of a random variable Ξi. In the situation described in Figure 2a,b, ξi = 2 and ξi = 1, respectively. According to [39] (Chapter 6, page 269), if D x i ( χ i k ) corresponds to a ball (for a given norm) of volume vi, an unbiased estimator of pX(xi) is given by:
p X ( x i ) ^ = k 1 N υ i , i = 1 , 2 , , N .
This implies that the classical estimator p X ( x i ) ^ = k N υ i is biased and that presumably log ( k N υ i ) is also a biased estimation of (pX(xi)) for N large, as shown in [39].
Now, in the case D x i ( χ i k ), is the minimal (i.e., with minimal (hyper-)volume) (hyper-)rectangle D x i ε 1 , , ε d, including χ i k, more than one point can belong to the border, and a more general estimator p X ( x i ) ˜ of pX(xi) can be a priori considered:
p X ( x i ) ˜ = k ˜ i N υ i ,
where k ˜ i is some given function of k and ξi. The corresponding estimation of H ( X ) is then:
H ( X ) ^ = 1 N i = 1 N log ( p X ( x i ˜ ) ) = 1 N i = 1 N t i ,
with:
t i = log ( N υ i k ˜ i ) , i = 1 , 2 , , N ,
ti being realizations of random variables Ti and k ˜ i being realizations of random variables K ˜ i. We have:
i = 1 , , N : E [ H ( X ) ^ ] = E ( T i ) = E ( T 1 ) .
Our goal is to derive E [ H ( X ) ^ ] H ( X ) = E ( T 1 ) H ( X ) for N large to correct the asymptotic bias of H ( X ) ^, according to Steps (1) to (3), explained in Section 2.1.3. To this end, we must consider an asymptotic approximation of the conditional probability distribution P ( T 1 r | X 1 = x 1 , Ξ 1 = ξ 1 ) before computing the asymptotic difference between the expectation E [T1] = E [E [T1|X1 = x1, Ξ1 = ξ1]] and the true entropy H ( X ).
Let us consider the random Lebesgue measure V1 of the random minimal (hyper-)rectangle D x 1 ε 1 , , ε d ((ϵ1, …, ϵd) denotes the random vector for which (ε1, …, εd) ∈ ℝd is a realization) and the relation T 1 = log ( N V 1 K ˜ 1 ). For any r > 0, we have:
P ( T 1 > r | X 1 = x 1 , Ξ 1 = ξ 1 ) = P ( log ( N V 1 K ˜ 1 ) > r | X 1 = x 1 , Ξ 1 = ξ 1 ) = P ( V 1 > υ r | X 1 = x 1 , Ξ 1 = ξ 1 ) ,
where υ r = e r k ˜ 1 N, since, conditionally to Ξ1 = ξ1, we have K ˜ 1 = k ˜ 1.
In Appendix E, we prove the following property.
Property 1. For N large,
P ( T 1 > r | X 1 = x 1 , Ξ 1 = ξ 1 ) i = 0 k ξ 1 ( N ξ 1 1 i ) ( p X ( x 1 ) υ r ) i ( 1 p X ( x 1 ) υ r ) N ξ 1 1 i .
The Poisson approximation (when N and vr → 0) of the binomial distribution summed in Equation (29) leads to a parameter λ = (Nξ1 − 1) pX(x1)vr. As N is large compared to ξ1 + 1, we obtain from Equation (26):
λ k ˜ 1 e r p X ( x 1 ) ,
and we get the approximation:
lim N P ( T 1 > r | X 1 = x 1 , Ξ 1 = ξ 1 ) i = 0 k ξ 1 [ k ˜ 1 e r p X ( x 1 ) ] i i ! e k ˜ 1 e r p X ( x 1 ) .
Since P ( T 1 r | X 1 = x 1 , Ξ 1 = ξ 1 ) = 1 P ( T 1 > r | X 1 = x 1 , Ξ 1 = ξ 1 ), we can get the density function of T1, noted g T 1 ( r ), by deriving P ( T 1 r | X 1 = x 1 , Ξ 1 = ξ 1 ). After some mathematical developments (see Appendix F), we obtain:
g T 1 ( r ) = P ( T 1 r | X 1 = x 1 , Ξ 1 = ξ 1 ) = P ( T 1 > r | X 1 = x 1 , Ξ 1 = ξ 1 ) = [ k ˜ 1 e r p X ( x 1 ) ] ( k ξ 1 + 1 ) ( k ξ 1 ) ! e k ˜ 1 e r p X ( x 1 ) , r ,
and consequently (see Appendix G for details),
lim N E [ T 1 | X 1 = x 1 , Ξ 1 = ξ 1 ] = [ k ˜ 1 e r p X ( x 1 ) ] ( k ξ 1 + 1 ) ( k ξ 1 ) ! e k ˜ 1 e r p X ( x 1 ) d r = ψ ( k ξ 1 + 1 ) log ( k ˜ 1 ) log ( p X ( x 1 ) ) .
Therefore, with the definition of differential entropy (X1) = E[−log (pX(X1))], we have:
lim N E [ T 1 ] = lim N E [ E [ T 1 | X 1 , Ξ 1 ] ] = E [ ψ ( k Ξ 1 + 1 ) log ( K ˜ 1 ) ] + H ( X 1 ) .
Thus, the estimator expressed by Equation (25) is asymptotically biased. Therefore, we consider a modified version, denoted by H ( X ) ^ N S, obtained by subtracting an estimation of the bias E [ ψ ( k Ξ 1 + 1 ) log ( K ˜ 1 ) ] given by the empirical mean 1 N i = 1 N ψ ( k ξ i + 1 ) 1 N i = 1 N log ( k ˜ i ) (according to the large numbers law), and we obtain, finally (Box ⑤ in Figure 1):
H ( X ) ^ N S = 1 N i = 1 N t i 1 N i = 1 N ψ ( k ξ i + 1 ) + 1 N i = 1 N log ( k ˜ i ) = 1 N i = 1 N log ( N υ i k ˜ i ) 1 N i = 1 N ψ ( k ξ i + 1 ) + 1 N i = 1 N log ( k ˜ i ) = log ( N ) + 1 N i = 1 N log ( υ i ) 1 N i = 1 N ψ ( k ξ i + 1 ) .
In comparison with the development of Equation (22), we followed here the same methodology, except we take into account (through a conditioning technique) the influence of the number of points on the border.
We observe that, after cancellation of the asymptotic bias, the choice of the function of k and ξi to define k ˜ i in Equation (24) does not have any influence on the final result. In this way, we obtain an expression for H ( X ) ^ N S, which simply takes into account the values ξi that could a priori influence the entropy estimation.
Note that, as for the original Kozachenko–Leonenko (Equation (10)) and Singh (Equation (16)) entropy estimators, both new estimation functions (Equation (22) and (35)) hold for any value of k, such that kN, and we do not have to choose a fixed k while estimating entropy in lower dimensional spaces. Therefore, under the framework proposed in [24], we built two different TE estimators using Equations (22) and (35), respectively.

3.3. Computation of the Border Points Number and of the (Hyper-)Rectangle Sizes

We explain more precisely hereafter how to determine the numbers of points ξi on the border. Let us denote x i j d j = 1,…, k, the k nearest neighbors of x i d, and let us consider the d × k array Di, such that for any (p, j) ∈ {1, …, d} × {1,…, k}, D i ( p , j ) = | x i j ( p ) x i ( p ) | is the distance (in R) between the p-th component x i j ( p ) of x i j and the p-th component xi(p) of xi. For each p, let us introduce Ji(p) ∈ {1, …, k} defined by Di(p, Ji(p)) = max (Di(p, 1), …, Di(p, k)) and which is the value of the column index of Di for which the distance Di(p, j) is maximum in the row number p. Now, if there exists more than one index Ji(p) that fulfills this equality, we select arbitrarily the lowest one, hence avoiding the max(·) function to be multi-valued. The MATLAB implementation of the max function selects such a unique index value. Then, let us introduce the d × k Boolean array Bi defined by Bi(p, j) = 1 if j = Ji(p) and Bi(p, j) = 0, otherwise. Then:
  • The d sizes εp, p = 1, …, d of the (hyper-)rectangle D x i ε 1 , , ε d xi are equal respectively to εp = 2Di(p, Ji(p)), p = 1, …, d.
  • We can define ξi as the number of non-null column vectors in Bi. For example, if the k-th nearest neighbor x i k is such that ∀jk, p = 1 , , d : | x i j ( p ) x i ( p ) | < | x i k ( p ) x i ( p ) | i.e., when the k-th nearest neighbor is systematically the farthest from the central point xi for each of the d entries directions, then all of the entries in the last column of Bi are equal to one, while all other are equal to zero: we have only one column including values different from zero and, so, only one point on the border (ξi = 1), which generalizes the case depicted in Figure 2b for d = 2.
N.B.: this determination of ξi may be incorrect when there exists a direction p, such that the number of indices j for which Di(p, j) reaches the maximal value is larger than one: the value of ξi obtained with our procedure can then be underestimated. However, we can argue that, theoretically, this case occurs with a probability equal to zero (because the observations are continuously distributed in the probability) and, so, it can be a priori discarded. Now, in practice, the measured quantification errors and the round off errors are unavoidable, and this probability will differ from zero (although remaining small when the aforesaid errors are small): theoretically distinct values Di(p, j) on the row p of Di may be erroneously confounded after quantification and rounding. However, the max(·) function then selects on row p only one value for Ji(p) and, so, acts as an error correcting procedure. The fact that the maximum distance in the concerned p directions can then be allocated to the wrong neighbor index has no consequence for the correct determination of ξi.

4. New Estimators of Transfer Entropy

From an observed realization ( x i p , x i , y i ) S X p , X , Y , i = 1, 2, …, N of the IID random sequence ( X i p , X i , Y i ), i = 1, 2, …, N and a number k of neighbors, the procedure could be summarized as follows (distances are from the maximum norm):
  • similarly to the MILCA [31] and TRENTOOL toolboxes [34], normalize, for each i, the vectors x i p, x i and y i ;
  • in joint space S X p , X , Y , for each point ( x i p , x i , y i ), calculate the distance d ( x i p , x i , y i ) , k between ( x i p , x i , y i ) and its k-th neighbor, then construct the (hyper-)rectangle with sizes ε1, …, εd (d is the dimension of the vectors ( x i p , x i , y i )), for which the (hyper-)volume is υ ( X p , X , Y ) , i ε 1 × × ε d and the border contains ξ ( X p , X , Y ) , i points;
  • for each point ( x i p , x i ) in subspace S X p , X , count the number k ( X p , X ) , i of points falling within the distance d ( x i p , x i , y i ) , k, then find the smallest (hyper-)rectangle that contains all of these points and for which υ ( X p , X ) , i and ξ ( X p , X ) , i are respectively the volume and the number of points on the border; repeat the same procedure in subspaces S X , Y and S X .
From Equation (22) (modified to k not constant for S X , S X p , X and S X , Y ), the final TE estimator can be written as (Box ⑧ in Figure 1):
T E Y X p 1 ^ = 1 N i = 1 N log υ ( X p , X ) , i υ ( X , Y ) , i υ ( X p , X , Y ) , i υ X , i + 1 N i = 1 N ( ψ ( k ) + ψ ( k X , i ) ψ ( k ( X p , X ) , i ) ψ ( k ( X , Y ) , i ) + d X p + d X 1 k ( X p , X ) , i + d X + d Y 1 k ( X , Y ) , i d X p + d X + d Y 1 k d X 1 k X , i ) ,
where d X p = dim ( S X p ), d X = dim ( S X ), d Y = dim ( S Y ), and with Equation (35), it yields to (Box ⑦ in Figure 1):
T E Y X p 2 ^ = 1 N i = 1 N log υ ( X p , X ) , i υ ( X , Y ) , i υ ( X p , X , Y ) , i υ X , i + 1 N i = 1 N ( ψ ( k ξ ( X p , X , Y ) , i + 1 ) + ψ ( k X , i ξ X , i + 1 ) ψ ( k ( X p , X ) , i ξ ( X p , X ) , i + 1 ) + ψ ( k ( X , Y ) , i ξ ( X , Y ) , i + 1 ) ) .
In Equations (36) and (37), the volumes υ ( X p , X ) , i, υ ( X , Y ) , i, υ ( X p , X , Y ) , i, υ X , i are obtained by computing, for each of them, the product of the edges lengths of the (hyper-)rectangle, i.e., the product of d edges lengths, d being respectively equal to d X p + d X , d X + d Y , d X p + d X + d Y and d X . In a given subspace and for a given direction, the edge length is equal to twice the largest distance between the corresponding coordinate of the reference point (at the center) and each of the corresponding coordinates of the k nearest neighbors. Hence a generic formula is υ U = j = 1 dim ( U ) ε U j, where U is one of the symbols (Xp, X), (X, Y), (Xp, X, Y) and X and the εUj are the edge lengths of the (hyper-)rectangle.
The new TE estimator TE Y X p 1 ^ (Box ⑧ in Figure 1) can be compared with the extension of TE Y X S A ^, the TE estimator proposed in [27] (implemented in the JIDT toolbox [30]). This extension [27], included in Figure 1 (Box ⑨), is denoted here by TE Y X E A ^. The main difference with our TE Y X p 1 ^ estimator is that our algorithm uses a different length for each sub-dimension within a variable, rather than one length for all sub-dimensions within the variable (which is the approach of the extended algorithm). We introduced this approach to make the tightest possible (hyper-)rectangle around the k nearest neighbors. TE Y X E A ^ is expressed as follows:
TE Y X E A ^ = 1 N i = 1 N ( ψ ( k ) 2 k + ψ ( l X , i ) ψ ( l ( X , X ) , i ) + 1 l ( X p , X ) , i ψ ( l ( X , Y ) , i ) + 1 l ( X , Y ) , i ) .
In the experimental part, this estimator is marked as the “extended algorithm”. It differs from Equation (36) in two ways. Firstly, the first summation on the right hand-side of Equation (36) does not exist. Secondly, compared with Equation (36), the numbers of neighbors k X , i, k ( X p , X ) , i and k ( X , Y ) , i included in the rectangular boxes, as explained in Section 3.1, are replaced respectively with l X , i, l ( X p , X ) , i and l ( X , Y ) , i, which are obtained differently. More precisely, Step (2) in the above algorithm becomes:
  • (2′) For each point ( x i p , x i ) in subspace S X p , X , l ( X p , X ) , i is the number of points falling within a (hyper-)rectangle equal to the Cartesian product of two (hyper-)cubes, the first one in S X p and the second one in S X , whose edge lengths are equal, respectively, to d x i p max = 2 × max { x k p x i p : ( x p , x , y ) k χ ( x p , x , y ) i k } and d x i max = 2 × max { x k x i : ( x p , x , y ) k χ ( x p , x , y ) i k }, i.e., l ( X p , X ) , i = c a r d { ( x j p , x i ) : j { { 1 , , N } { i } } & x j p , x i p d x i p max & x j , x i d x i max }. Denote by υ ( X p , X ) , i the volume of this (hyper-)rectangle. Repeat the same procedure in subspaces S X , Y and S X .
Note that the important difference between the construction of the neighborhoods used in TE Y X E A ^ and is TE Y X p 1 ^ is that, for the first case, the minimum neighborhood, including the k neighbors, is constrained to be a Cartesian product of (hyper-)cubes and, in the second case, this neighborhood is a (hyper-)rectangle whose edge lengths can be completely different.

5. Experimental Results

In the experiments, we tested both Gaussian IID and Gaussian AR models to compare and validate the performance of the TE estimators proposed in the previous section. For a complete comparison, beyond the theoretical value of TE, we also computed the Granger causality index as a reference (as indicated previously, in the case of Gaussian signals TE and Granger causality index are equivalent up to a factor of two; see Appendix H). In each following figure, GCi/2 corresponds to the Granger causality index divided by two; TE estimated by the free TRENTOOL toolbox (corresponding to Equation (21)) is marked as the standard algorithm; that estimated by JIDT (corresponding to Equation (38)) is marked as the extended algorithm; TEp1 is the TE estimator given by Equation (36); and TEp2 is the TE estimator given by Equation (37). For all of the following results, the statistical means and the standard deviations of the different estimators have been estimated using an averaging on 200 trials.

5.1. Gaussian IID Random Processes

The first model we tested, named Model 1, is formulated as follows:
X t = a Y t + b Z t + W t , W t , Y d Y , Z d Y ,
where Y t ~ N ( 0 , C Y ), Z t ~ N ( 0 , C Z ), W t ~ N ( 0 , σ W 2 ), the three processes Y, Z, and W being mutually independent. The triplet (Xt, Yt, Zt) corresponds to the triplet ( X i p , X i , Y i ) introduced previously. CU is a Toeplitz matrix with the first line equal to [ 1 , α , , α d U 1 ]. For the matrix CY, we chose α = 0.5, and for CZ, α = 0.2. The standard deviation σW was set to 0.5. The vectors a and b were such that a = 0.1 ∗ [1, 2, …, dY] and b = 0.1 ∗ [dZ, dZ 1, …, 1]. With this model, we aimed at estimating H ( X | Y ) H ( X | Y , Z ) to test if the knowledge of signals Y and Z could improve the prediction of X compared to only the knowledge of Y.
Results are reported in Figure 3 where the dimensions dY and dZ are identical. We observe that, for a low dimension and a sufficient number of neighbors (Figure 3a), all TE estimators tend all the more to the theoretical value (around 0.26) that the length of the signals is large, the best estimation being obtained by the two new estimators. Compared to Granger causality, these estimators display a greater bias, but a lower variance. Due to the “curse of dimensionality”, with an increasing dimension (see Figure 3b), it becomes much more difficult to obtain an accurate estimation of TE. For a high dimension, all estimators reveal a non-negligible bias, even if the two new estimators still behave better than the two reference ones (standard and extended algorithms).

5.2. Vectorial AR Models

In the second experiment, two AR models integrating either two or three signals have been tested. The first vectorial AR model (named Model 2) we tested was as follows:
{ x t = 0.45 2 x t 1 0.9 x t 2 0.6 y t 2 + e x , t y t = 0.6 x t 2 0.175 2 y t 1 + 0.55 2 y t 2 + e y , t .
The second vectorial AR model (named Model 3) was given by:
{ x t = 0.25 x t 2 0.35 y t 2 + 0.35 z t 2 + e x , t y t = 0.5 x t 1 + 0.25 y t 1 0.5 z t 3 + e y , t z t = 0.6 x t 2 0.7 y t 2 0.2 z t 2 + e z , t .
For both models, ex, ey and ez denote realizations of independent white Gaussian noises with zero mean and a variance of 0.1. As previously, we display in the following figures not only the theoretical value of TE, but also the Granger causality index for comparison. In this experiment, the prediction orders m and n were equal to the corresponding regression orders of the AR models. For example, when estimating TEY X, we set m = 2, n = 2, and ( X i p , X i , Y i ) corresponds to ( X i + 1 , X i ( 2 ) , Y i ( 2 ) ).
For Figures 4 and 5, the number k of neighbors was fixed to eight, whereas, in Figure 6, this number was set to four and three (respectively Figures 6a,b) to show the influence of this parameter. Figures 4 and 6 are related to Model 2, and Figure 5 is related to Model 3.
As previously, for large values of k (cf. Figures 4 and 5), we observe that the four TE estimators converge towards the theoretical value. This result is all the more true when the signal length increases. As expected in such linear models, Granger causality outperforms the TE estimators at the expense of a slightly larger variance. Contrary to Granger causality, TE estimators are clearly more impacted by the signal length, even if their standard deviations remain lower. Here, again, when comparing the different TE estimators, it appears that the two new estimators achieve improved behavior compared to the standard and extended algorithms for large k.
In the scope of k-NN algorithms, the choice of k must be a tradeoff between the estimation of bias and variance. Globally, when the value of k decreases, the bias decreases for the standard and extended algorithms and for the new estimator TEp1. Now, for the second proposed estimator TEp2, it is much more sensitive to the number of neighbors (as can be seen when comparing Figures 4 and 6). As shown in Figures 3 and 5, the results obtained using TEp2 and TEp1 are quite comparable when the value of k is large (k = 8). Now, when the number of neighbors decreases, the second estimator we proposed, TEp2, is much less reliable than all of the other ones (Figure 6). Concerning the variance, it remains relatively stable when the number of neighbors falls from eight to three, and in this case, the extended algorithm, which displays a slightly lower bias, may be preferred.
When using k = 8, a possible interpretation of getting a lower bias with our algorithms could be that, once we are looking at a large enough number of k nearest neighbors, there is enough opportunity for the use of different lengths on the sub-dimensions of the (hyper-)rectangle to make a difference to the results, whereas with k = 3, there is less opportunity.
To investigate the impact on the dispersion (estimation error standard deviation) of (i) the estimation method and (ii) the number of neighbors, we display in Figures 7a,b the boxplots of the absolute values of the centered estimation errors (AVCE) corresponding to experiments reported in Figures 4a and 6b for a 1024-point signal length. These results show that neither the value of k, nor the tested TE estimator dramatically influence the dispersions. More precisely, we used a hypothesis testing procedure (two-sample Kolmogorov–Smirnov goodness-of-fit hypothesis, KSTEST2 in MATLAB) to test if two samples (each with 200 trials) of AVCE are drawn from the same underlying continuous population or not. The tested hypothesis corresponds to non-identical distributions and is denoted H = 1, and H = 0 corresponds to the rejection of this hypothesis. The confidence level was set to 0.05.
  • Influence of the method:
    • Test between the standard algorithm and TEp1 in Figure 7a: H = 0, p-value = 0.69 → no influence
    • Test between the extended algorithm and TEp1 in Figure 7a: H = 0, p-value = 0.91 → no influence
    • Test between the standard algorithm and TEp1 in Figure 7b: H = 0, p-value = 0.081 → no influence
    • Test between the extended algorithm and TEp1 in Figure 7b: H = 1, p-value = 0.018 → influence exists.
  • Influence of the neighbors’ number k:
    • Test between k = 8 (Figure 7a) and k = 3 (Figure 7b) for the standard algorithm: H = 0, p-value = 0.97 → no influence
    • Test between k = 8 (Figure 7a) and k = 3 (Figure 7b) for TEp1: H = 0, p-value = 0.97 → no influence.
For these six tested cases, the only case where a difference between distributions (and so, between the dispersions) corresponds to a different distribution is when comparing the extended algorithm and TEp1 in Figure 7b.

6. Discussion and Summary

In the computation of k-NN based estimators, the most time-consuming part is the procedure of nearest neighbor searching. Compared to Equations (10) and (16), Equations (22) and (35) involve supplementary information, such as the maximum distance of the first k-th nearest neighbor in each dimension and the number of points on the border. However, most currently used neighbor searching algorithms, such as k-d tree (k-dimensional tree) and ATRIA (A TRiangle Inequality based Algorithm) [40], provide not only information on the k-th neighbor, but also on the first (k − 1) nearest neighbors. Therefore, in terms of computation cost, there is no significant difference among the three TE estimators (Boxes ⑦, ⑧, ⑨, ⑩ in Figure 1).
In this contribution, we discussed TE estimation based on k-NN techniques. The estimation of TE is always an important issue, especially in neuroscience, where getting large amounts of stationary data is problematic. The widely-used k-NN technique has been proven to be a good choice for the estimation of information theoretical measurement. In this work, we first investigated the estimation of Shannon entropy based on the k-NN technique involving a rectangular neighboring region and introduced two different k-NN entropy estimators. We derived mathematically these new entropy estimators by extending the results and methodology developed in [24] and [36]. Given the new entropy estimators, two novel TE estimators have been proposed, implying no extra computation cost compared to existing similar k-NN algorithm. To validate the performance of these estimators, we considered different simulated models and compared the new estimators with the two TE estimators available in the free TRENTOOL and JIDT toolboxes, respectively, and which are extensions of two Kraskov–Stögbauer–Grassberger (KSG) MI estimators, based respectively on (hyper-)cubic and (hyper-)rectangular neighborhoods.
Under the Gaussian assumption, experimental results showed the effectiveness of the new estimators under the IID assumption, as well as for time-correlated AR signals in comparison with the standard KSG algorithm estimator. This conclusion still holds when comparing the new algorithms with the extended KSG estimator. Globally, all TE estimators satisfactorily converge to the theoretical TE value, i.e., to half the value of the Granger causality, while the newly proposed TE estimators showed lower bias for k sufficiently large (in comparison with the reference TE estimators) with comparable variances estimation errors.
As the variance remains relatively stable when the number of neighbors falls from eight to three, in this case, the extended algorithm, which displays a slightly lower bias, may be preferred.
Now, one of the new TE estimators suffered from noticeable error when the number of neighbors was small. Some experiments allowed us to verify that this issue already exists when estimating the entropy of a random vector: when the number of neighbors k falls below the dimension d, then the bias drastically increases. More details on this phenomenon are given in Appendix I.
As expected, experiments with Model 1 showed that all three TE estimators under examination suffered from the “curse of dimensionality”, which made it difficult to obtain accurate estimation of TE with high dimension data. In this contribution, we do not present the preliminary results that we obtained when simulating a nonlinear version of Model 1, for which the three variables Xt, Yt and Zt were scalar and their joint law was non-Gaussian, because a random nonlinear transformation was used to compute Xt from Yt, Zt. For this model, we computed the theoretical TE (numerically, with good precision) and tuned the parameters to obtain a strong coupling between Xt and Zt. The theoretical Granger causality index was equal to zero. We observed the same issue as that pointed out in [41], i.e., a very slow convergence of the estimator when the number of observations increases, and noticed that the four estimators TE Y X S A ^, TE Y X E A ^, TE Y X p 1 ^ and TE Y X p 2 ^, revealed very close performance. In this difficult case, our two methods do not outperform the existing ones. Probably, for this type of strong coupling, further improvement must be considered at the expense of an increasing computational complexity, as that proposed in [41].
This work is a first step in a more general context of connectivity investigation for neurophysiological activities obtained either from nonlinear physiological models or from clinical recordings. In this context, partial TE has also to be considered, and future work would address a comparison of the techniques presented in this contribution in terms of bias and variance. Moreover, considering the practical importance to know statistical distributions of the different TE estimators for independent channels, this point should be also addressed.

Author Contributions

All authors have read and approved the final manuscript.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix

A. Mathematical Expression of Transfer Entropy for Continuous Probability Distributions

Here, we consider that the joint probability measure P X i p , X i , Y i is absolutely continuous (with respect to the Lebesgue measure in ℝm+n+1 denoted by µm+n+1) with the corresponding density:
p X i p , X i , Y i ( x i p , x i , y i ) = d P X i p , X i , Y i ( x i p , x i , y i ) d μ n + m + 1 ( x i p , x i , y i ) .
Then, we are sure that the two following conditional densities probability functions exist:
p X i p | X i ( x i p | x i ) = d P X i p | X 1 ( x i p | x i ) d μ 1 ( x i p ) p X i p | X i , Y i ( x i p | x i , y i ) = d P X i p | X i , Y i ( x i p | x i , y i ) d μ 1 ( x i p ) .
and Equation (3) yields to:
TE Y X , i = m + n + 1 p X p , X i , Y i ( x i p , x i , y i ) log [ p X p | X i , Y i ( x i p | x i , y i ) p X p | X i ( x i p | x i ) ] d x i p d x i y i = m + n + 1 p X p , X i , Y i ( x i p , x i , y i ) log [ p X p , X i , Y i ( x i p , x i , y i ) p X i ( x i ) p X i , Y i ( x i , y i ) p X p , X i ( x i p , x i ) ] d x i p d x i y i .
Equation (44) can be rewritten:
TE Y X , i = E [ log ( p X i , Y i ( X i , Y i ) ) ] E [ log ( p X i p , X i ( X i p , X i ) ) ] + E [ log ( p X i p , X i , Y i ( X i p , X i , Y i ) ) ] + E [ log ( p X i ( X i ) ) ] .

B. Basic Structure of TE Estimators

From Equation (8), assuming that X and Y are jointly strongly ergodic leads to:
TE Y X = lim N 1 N i = 1 , , N [ log ( p X i , Y i ( X i , Y i ) ) log ( p X i p , X i ( X i p , X i ) ) + log ( p X i p , X i , Y i ( X i p , X i , Y i ) ) + log ( p X i ( X i ) ) ] ,
where the convergence holds with probability one. Hence, as a function of an observed occurrence (xi, yi), i = 1, …, N, of (Xi, Yi), i = 1, …, N, a standard estimation TE Y X ^ of TEYX is given by:
TE Y X ^ = H ( X , Y ) ^ + H ( X p , X ) ^ H ( X p , X , Y ^ ) H ( X ) ^ = 1 N n = 1 N log ( p U 1 ( u 1 n ) ^ ) 1 N n = 1 N log ( p U 2 ( u 2 n ) ^ ) + 1 N n = 1 N log ( p U 3 ( u 3 n ) ^ ) + 1 N n = 1 N log ( p U 4 ( u 4 n ) ^ ) ,
where U1, U2, U3 and U4 stand respectively for (X, Y), (Xp, X), (Xp, X, Y) and X.

C. The Bias of Singh’s Estimator

Let us consider the equalities E ( T 1 ) = E [ log ( p X ( X 1 ) ^ ) ] = E [ log ( k N V 1 ) ] where V1 is the random volume for which v1 is an outcome. Conditionally to X1 = x1, if we have k N V 1 N p r p X ( x 1 ) (convergence in probability), then E ( T 1 / X 1 = x 1 ) N log ( p X ( x 1 ) ), and by deconditioning we obtain E ( T 1 ) N E ( log ( p X ( X 1 ) ) ) = H ( X ). Therefore, if k N V 1 N p r p X ( x 1 ), the estimation of (X) is asymptotically unbiased. Here, this convergence in probability does not hold, even if we assume that E ( k N V 1 ) N p X ( x 1 ) (one order mean convergence), because we do not have var ( k N V 1 ) N 0. The ratio k N V 1 remains fluctuating when N → ∞, because the ratio var ( V 1 ) E ( V 1 ) does not tend to zero, even if V1 tends to be smaller: when N increases, the neighborhoods become smaller and smaller, but continue to ‘fluctuate’. This explains informally (see [37] for a more detailed analysis) why the naive estimator given by Equation (17) is not asymptotically unbiased. It is interesting to note that the Kozachenko–Leonenko entropy estimator avoids this problem, and so it does not need any bias subtraction.

D. Derivation of Equation (22)

As illustrated in Figure 2, for d = 2, there are two cases to be distinguished: (1) εx and εy are determined by the same point; (2) εx and εy are determined by distinct points.
Considering the probability density qi,k (ϵx, ϵy), (ϵx, ϵy) ∈ ℝ2 of the pair of random sizes (εx, εy) (along x and y, respectively), we can extend it to the case d > 2. Hence, let us denote by q x i , k d ( ε 1 , , ε d ), (ε1, …, ε) ∈ ℝd the probability density (conditional to Xi = xi) of the d-dimensional random vector whose d components are respectively the d random sizes of the (hyper-)rectangle built from the random k nearest neighbors, and denote by h x i ( ε 1 , , ε d ) = u D x i ε 1 , , ε d d P X ( u ) the probability mass (conditional to Xi = xi) of the random (hyper-)rectangle D x i ϵ 1 , , ϵ d. In [24], the equality E [ log ( h x i ( D x i , k ) ) ] = ψ ( k ) ψ ( N ) obtained for an (hyper-)cube is extended for the case d > 2 to:
E [ log ( h x i ( ϵ 1 , , ϵ d ) ) ] = ψ ( k ) d 1 k ψ ( N ) .
Therefore, if pX is approximately constant on D x i ε 1 , , ε d, we have:
h x i ( ε 1 , , ε d ) υ i p X ( x i ) ,
where υ i = D x i ε 1 , , ε d d μ d ( ξ ) is the volume of the (hyper-)rectangle, and we obtain:
log p X ( x i ) ψ ( k ) ψ ( N ) d 1 k log ( υ i ) .
Finally, by taking the experimental mean of the right term in Equation (50), we obtain an estimation of the expectation E [log pX(X)], i.e.,:
H ( X ) ^ = ψ ( k ) + ψ ( N ) + d 1 k + 1 N i = 1 N log ( υ i ) .

E. Proof of Property 1

Let us introduce the (hyper-)rectangle D x 1 ϵ 1 , , ϵ d centered on x1 for which the random sizes along the d directions are defined by ( ϵ 1 , , ϵ d ) = ( ϵ 1 , , ϵ d ) × ( υ r ϵ 1 × × ϵ d ) 1 / d, so that D x 1 ϵ 1 , , ϵ d and D x 1 ϵ 1 , , ϵ d are homothetic and D x 1 ϵ 1 , , ϵ d has a (hyper-)volume constrained to the value vr. We have:
x D x 1 ε 1 , , ε d d μ d ( x ) > υ r D x 1 ε 1 , , ε d D x 1 ε 1 , , ε d c a r e { x j : x j D x 1 ε 1 , , ε d } k ξ 1 ,
where the first equivalence (the inclusion is a strict inclusion) is clearly implied by the construction of D x 1 ϵ 1 , , ϵ d and the second equivalence expresses the fact that the (hyper-)volume of D x 1 ϵ 1 , , ϵ d is larger than vr if and only if the normalized domain D x 1 ϵ 1 , , ϵ d does not contain more than (k − ξ1) points xj (as ξ1 of them are on the border of D x 1 ϵ 1 , , ϵ d, which is necessarily not included in D x 1 ϵ 1 , , ϵ d). These equivalences imply the equalities between conditional probability values:
P ( T 1 > r | X 1 = x 1 , Ξ 1 = ξ 1 ) = P ( l o g ( N V 1 K ˜ 1 ) > r | X 1 = x 1 , Ξ 1 = ξ 1 ) = P ( V 1 > υ r | X 1 = x 1 , Ξ 1 = ξ 1 ) = P ( c a r d { X j : X j D x 1 ϵ 1 , , ϵ d } k ξ 1 ) .
Only (N − 1 − ξ1) events { X j : X j D x 1 ϵ 1 , , ϵ d } are to be considered, because the variable X1 and the ξ1 variable(s) on the border of D x 1 ϵ 1 , , ϵ d must be discarded. Moreover, these events are independent. Hence, the probability value in (53) can be developed as follows:
P ( T 1 > r | X 1 = x 1 , Ξ 1 = ξ 1 ) i = 0 k ξ 1 ( N ξ 1 1 i ) ( P ( X D x 1 ϵ 1 , , ϵ d ) ) i ( 1 P ( X D x 1 ϵ 1 , , ϵ d ) ) N ξ 1 1 i .
If pX(x1) is approximately constant on D x 1 ϵ 1 , , ϵ d, we have ( P ( X D x 1 ϵ 1 , , ϵ d ) ) p X ( x 1 ) υ r (note that the randomness of ( ϵ 1 , , ϵ d ) does not influence this approximation as the (hyper-)volume of D x 1 ϵ 1 , , ϵ d is imposed to be equal to vr). Finally, we can write:
P ( T 1 > r | X 1 = x 1 , Ξ 1 = ξ 1 ) i = 0 k ξ 1 ( N ξ 1 1 i ) ( p X ( x 1 ) υ r ) i ( 1 p X ( x 1 ) υ r ) N ξ 1 1 i .

F. Derivation of Equation (32)

With P ( T 1 r | X 1 = x 1 , Ξ 1 = ξ 1 ) = 1 P ( T 1 > r | X 1 = x 1 , Ξ 1 = ξ 1 ), we take the derivative of P ( T 1 r | X 1 = x 1 , Ξ 1 = ξ 1 ) to get the conditional density function of T1:
P ( T 1 r | X 1 = x 1 , Ξ 1 = ξ 1 ) = P ( T 1 > r | X 1 = x 1 , Ξ 1 = ξ 1 ) = [ i = 0 k ξ 1 [ k ˜ 1 p X ( x 1 ) e r ] i i ! e k ˜ 1 p X ( x 1 ) e r ] = i = 0 k ξ 1 ( [ [ k ˜ 1 p X ( x 1 ) e r ] i i ! ] e k ˜ 1 p X ( x 1 ) e r + [ k ˜ 1 p X ( x 1 ) e r ] i i ! [ e k ˜ 1 p X ( x 1 ) e r ] ) = i = 0 k ξ 1 ( i [ k ˜ 1 p X ( x 1 ) e r ] i 1 ( k ˜ 1 p X ( x 1 ) e r ) i ! e k ˜ 1 p X ( x 1 ) e r + [ k ˜ 1 p X ( x 1 ) e r ] i i ! e k ˜ 1 p X ( x 1 ) e r ( k ˜ 1 p X ( x 1 ) e r ) ) = i = 0 k ξ 1 e k ˜ 1 p X ( x 1 ) e r ( [ k ˜ 1 p X ( x 1 ) e r ] i ( i 1 ) ! [ k ˜ 1 p X ( x 1 ) e r ] i + 1 i ! ) .
Defining:
a ( i ) [ k ˜ 1 p X ( x 1 ) e r ] i ( i 1 ) ! and a ( 0 ) = 0 ,
we have:
P ( T 1 r ) = i = 0 k ξ 1 e k ˜ 1 p X ( x 1 ) e r ( a ( i ) a ( i + 1 ) ) = e k ˜ 1 p X ( x 1 ) e r ( a ( 0 ) a ( k ξ 1 + 1 ) ) = e k ˜ 1 p X ( x 1 ) e r a ( k ξ 1 + 1 ) ) = [ k ˜ 1 p X ( x 1 ) e r ] ( k ξ 1 + 1 ) ( k ξ 1 ) ! e k ˜ 1 p X ( x 1 ) e r .

G. Derivation of Equation (33)

lim n E ( T 1 | X 1 = x 1 ) = r [ k ˜ 1 p X ( x 1 ) e r ] ( k ξ 1 + 1 ) ( k ξ 1 ) ! e k ˜ 1 p X ( x 1 ) e r d r = 0 [ log ( z ) log ( k ˜ 1 ) log p X ( x 1 ) ] z k ξ 1 ( k ξ 1 ) ! e z d z = 1 Γ ( k ξ 1 + 1 ) 0 [ log ( z ) z k ξ 1 e z ] d z log ( k ˜ 1 ) log p X ( x 1 ) = 1 Γ ( k ξ 1 + 1 ) 0 [ log ( z ) z ( k ξ 1 + 1 ) e z ] d z log ( k ˜ 1 ) log p X ( x 1 ) = Γ ( k ξ 1 + 1 ) Γ ( k ξ 1 + 1 ) log ( k ˜ 1 ) log p X ( x 1 ) = ψ ( k ξ 1 + 1 ) log ( k ˜ 1 ) log p X ( x 1 ) .

H. Transfer Entropy and Granger Causality

TE can be considered as a measurement of the degree to which the history Y of the process Y disambiguates the future Xp of X beyond the degree to how its history X disambiguates this future [22]. It is an information theoretic implementation of Wiener’s principle of observational causality. Hence, TE reveals a natural relation to Granger causality. As is well known, Granger causality emphasizes the concept of reduction of the mean square error of the linear prediction of X i p when adding Y i to X i by introducing the Granger causality index:
G C Y X = log [ var ( l p e X i p | X i ) var ( l p e X i p | X i , Y i ) ] ,
where l p e X i p | U is the error when predicting linearly X i p from U. TE is framed in terms of the reduction of the Shannon uncertainty (entropy) of the predictive probability distribution. When the probability distribution of ( X i p , X i , Y i ) is assumed to be Gaussian, TE and Granger causality are entirely equivalent, up to a factor of two [42]:
TE Y X = 1 2 GC Y X .
Consequently, in the Gaussian case, TE can be easily computed from a statistical second order characterization of ( X i p , X i , Y i ). This Gaussian assumption obviously holds when the processes Y and X are jointly normally distributed and, more particularly, when they correspond to a Gaussian autoregressive (AR) bivariate process. In [42], Barnett et al. discussed the relation between these two causality measures, and this work bridged information-theoretic methods and autoregressive ones.

I. Comparison between Entropy Estimators

Figure 8 displays the values of entropy for a Gaussian d-dimensional vector as a function of the number of neighbors k, for d = 3 in Figure 8a and d = 8 in Figure 8b, obtained with different estimators. The theoretical entropy value is compared with its estimation from the Kozachenko–Leonenko reference estimator (Equation (10), red circles), its extension (Equation (22), black stars) and the extension of Singh’s estimator (Equation (35), blue squares). It appears clearly that, for the extended Singh’s estimator, the bias (true value minus estimated value) increases drastically when the number of neighbors decreases under a threshold slightly lower than the dimension d of the vector. This allows us to interpret some apparently surprising results obtained with this estimator in the estimation of TE, as reported in Figure 6b. TE estimation is a sum of four separate vector entropy estimations, T E Y X ^ = H ( X , Y ) ^ + H ( X p , X ) ^ H ( X p , X , Y ) ^ H ( X ) ^). Here, the dimensions of the four vectors are d (X, Y) = m + n = 4, d (Xp, X) = 1 + m = 3, d (Xp, X, Y) = 1 + m + n = 5, d (X) = m = 2, respectively. Note that, if we denote by XM2 and YM2 the two components in Model 2, the general notation (Xp, X, Y) corresponds to ( Y M 2 p , X M 2 , Y M 2 ), because in Figure 6b, the analyzed direction is XY and not the reverse. We see that, when considering the estimation of H (Xp, X, Y), we have d = 5 and k = 3, which is the imposed neighbors number in the global space. Consequently, from the results shown in Figure 8, we can expect that in Model 2, the (Xp, X, Y), will be drastically underestimated. For the other components H ( X , Y ) ^, H ( X p , X ) ^, H ( X ) ^, the numbers of neighbors to consider are generally larger than three (as a consequence of Kraskov’s technique, which introduces projected distances) and d ≤ 5, so that we do not expect any underestimation of these terms. Therefore, globally, when summing the four entropy estimations, the resulting positive bias observed in Figure 6b is understandable.

References

  1. Schreiber, T. Measuring information transfer. Phys. Rev. Lett. 2000, 85. [Google Scholar] [CrossRef]
  2. Gourévitch, B.; Eggermont, J.J. Evaluating information transfer between auditory cortical neurons. J. Neurophysiol. 2007, 97, 2533–2543. [Google Scholar]
  3. Hlaváčková-Schindler, K.; Paluš, M.; Vejmelka, M.; Bhattacharya, J. Causality detection based on information-theoretic approaches in time series analysis. Phys. Rep. 2007, 441, 1–46. [Google Scholar]
  4. Sabesan, S.; Narayanan, K.; Prasad, A.; Iasemidis, L.; Spanias, A.; Tsakalis, K. Information flow in coupled nonlinear systems: Application to the epileptic human brain. In Data Mining in Biomedicine; Springer: Berlin/Heidelberg, Germany, 2007; pp. 483–503. [Google Scholar]
  5. Ma, C.; Pan, X.; Wang, R.; Sakagami, M. Estimating causal interaction between prefrontal cortex and striatum by transfer entropy. Cogn. Neurodyn. 2013, 7, 253–261. [Google Scholar]
  6. Vakorin, V.A.; Krakovska, O.A.; McIntosh, A.R. Confounding effects of indirect connections on causality estimation. J. Neurosci. Methods 2009, 184, 152–160. [Google Scholar]
  7. Yang, C.; Le Bouquin Jeannes, R.; Bellanger, J.J.; Shu, H. A new strategy for model order identification and its application to transfer entropy for EEG signals analysis. IEEE Trans. Biomed. Eng. 2013, 60, 1318–1327. [Google Scholar]
  8. Zuo, K.; Zhu, J.; Bellanger, J.J.; Jeannès, R.L.B. Adaptive kernels and transfer entropy for neural connectivity analysis in EEG signals. IRBM 2013, 34, 330–336. [Google Scholar]
  9. Faes, L.; Nollo, G. Bivariate nonlinear prediction to quantify the strength of complex dynamical interactions in short-term cardiovascular variability. Med. Biol. Eng. Comput. 2006, 44, 383–392. [Google Scholar]
  10. Faes, L.; Nollo, G.; Porta, A. Non-uniform multivariate embedding to assess the information transfer in cardiovascular and cardiorespiratory variability series. Comput. Biol. Med. 2012, 42, 290–297. [Google Scholar]
  11. Faes, L.; Nollo, G.; Porta, A. Information-based detection of nonlinear Granger causality in multivariate processes via a nonuniform embedding technique. Phys. Rev. E 2011, 83, 051112. [Google Scholar]
  12. Runge, J.; Heitzig, J.; Petoukhov, V.; Kurths, J. Escaping the curse of dimensionality in estimating multivariate transfer entropy. Phys. Rev. Lett. 2012, 108, 258701. [Google Scholar]
  13. Duan, P.; Yang, F.; Chen, T.; Shah, S.L. Direct causality detection via the transfer entropy approach. IEEE Trans. Control Syst. Technol. 2013, 21, 2052–2066. [Google Scholar]
  14. Bauer, M.; Thornhill, N.F.; Meaburn, A. Specifying the directionality of fault propagation paths using transfer entropy, Proceedings of the 7th International Symposium on Dynamics and Control of Process Systems (DYCOPS 7), Cambridge, MA, USA, 7–9 July 2004; pp. 203–208.
  15. Bauer, M.; Cox, J.W.; Caveness, M.H.; Downs, J.J.; Thornhill, N.F. Finding the direction of disturbance propagation in a chemical process using transfer entropy. IEEE Trans. Control Syst. Technol 2007, 15, 12–21. [Google Scholar]
  16. Kulp, C.; Tracy, E. The application of the transfer entropy to gappy time series. Phys. Lett. A 2009, 373, 1261–1267. [Google Scholar]
  17. Overbey, L.; Todd, M. Dynamic system change detection using a modification of the transfer entropy. J. Sound Vib. 2009, 322, 438–453. [Google Scholar]
  18. Gray, R.M. Entropy and Information Theory; Springer: Berlin/Heidelberg, Germany, 2011. [Google Scholar]
  19. Roman, P. Some Modern Mathematics for Physicists and Other Outsiders: An Introduction to Algebra, Topology, and Functional Analysis; Elsevier: Amsterdam, The Netherlands, 2014. [Google Scholar]
  20. Kugiumtzis, D. Direct-coupling information measure from nonuniform embedding. Phys. Rev. E 2013, 87, 062918. [Google Scholar]
  21. Montalto, A.; Faes, L.; Marinazzo, D. MuTE: A MATLAB toolbox to compare established and novel estimators of the multivariate transfer entropy. PLoS One 2014, 9, e109462. [Google Scholar]
  22. Paluš, M.; Komárek, V.; Hrnčíř, Z.; Štěrbová, K. Synchronization as adjustment of information rates: Detection from bivariate time series. Phys. Rev. E 2001, 63, 046211. [Google Scholar]
  23. Frenzel, S.; Pompe, B. Partial mutual information for coupling analysis of multivariate time series. Phys. Rev. Lett. 2007, 99, 204101. [Google Scholar]
  24. Kraskov, A.; Stögbauer, H.; Grassberger, P. Estimating mutual information. Phys. Rev. E 2004, 69, 066138. [Google Scholar]
  25. Vicente, R.; Wibral, M.; Lindner, M.; Pipa, G. Transfer entropy—A model-free measure of effective connectivity for the neurosciences. J. Comput. Neurosci. 2011, 30, 45–67. [Google Scholar]
  26. Lindner, M.; Vicente, R.; Priesemann, V.; Wibral, M. TRENTOOL: A Matlab open source toolbox to analyse information flow in time series data with transfer entropy. BMC Neurosci. 2011, 12, 119. [Google Scholar]
  27. Wibral, M.; Vicente, R.; Lindner, M. Transfer Entropy in Neuroscience. In Directed Information Measures in Neuroscience; Springer: Berlin/Heidelberg, Germany, 2014; pp. 3–36. [Google Scholar]
  28. Gómez-Herrero, G.; Wu, W.; Rutanen, K.; Soriano, M.C.; Pipa, G.; Vicente, R. Assessing coupling dynamics from an ensemble of time series 2010. arXiv:1008.0539.
  29. Vlachos, I.; Kugiumtzis, D. Nonuniform state-space reconstruction and coupling detection. Phys. Rev. E 2010, 82, 016207. [Google Scholar]
  30. Lizier, J.T. JIDT: An information-theoretic toolkit for studying the dynamics of complex systems 2014. arXiv:1408.3270.
  31. MILCA Toolbox. Available online: http://www.ucl.ac.uk/ion/departments/sobell/Research/RLemon/MILCA/MILCA accessed on 11 June 2015.
  32. Wibral, M.; Rahm, B.; Rieder, M.; Lindner, M.; Vicente, R.; Kaiser, J. Transfer entropy in magnetoencephalographic data: Quantifying information flow in cortical and cerebellar networks. Prog. Biophys. Mol. Biol. 2011, 105, 80–97. [Google Scholar]
  33. Wollstadt, P.; Martínez-Zarzuela, M.; Vicente, R.; Díaz-Pernas, F.J.; Wibral, M. Efficient transfer entropy analysis of non-stationary neural time series. PLoS One 2014, 9, e102833. [Google Scholar]
  34. Wollstadt, P.; Lindner, M.; Vicente, R.; Wibral, M.; Pampu, N.; Martinez-Zarzuela, M. Trentool Toolbox. Available online: http://www.trentool.de accessed on 11 June 2015.
  35. Wibral, M.; Pampu, N.; Priesemann, V.; Siebenhühner, F.; Seiwert, H.; Lindner, M.; Lizier, J.T.; Vicente, R. Measuring information-transfer delays. PLoS One 2013, 8, e55809. [Google Scholar]
  36. Singh, H.; Misra, N.; Hnizdo, V.; Fedorowicz, A.; Demchuk, E. Nearest neighbor estimates of entropy. Am. J. Math. Manag. Sci. 2003, 23, 301–321. [Google Scholar]
  37. Zhu, J.; Bellanger, J.J.; Shu, H.; Yang, C.; Jeannès, R.L.B. Bias reduction in the estimation of mutual information. Phys. Rev. E 2014, 90, 052714. [Google Scholar]
  38. Fukunaga, K.; Hostetler, L. Optimization of k nearest neighbor density estimates. IEEE Trans. Inf. Theory 1973, 19, 320–326. [Google Scholar]
  39. Fukunaga, K. Introduction to Statistical Pattern Recognition; Academic Press: Waltham, MA. USA, 1990. [Google Scholar]
  40. Merkwirth, C.; Parlitz, U.; Lauterborn, W. Fast nearest-neighbor searching for nonlinear signal processing. Phys. Rev. E 2000, 62, 2089–2097. [Google Scholar]
  41. Gao, S.; Steeg, G.V.; Galstyan, A. Efficient Estimation of Mutual Information for Strongly Dependent Variables, Proceedings of the 18th International Conference on Artificial Intelligence and Statistics (AISTATS), San Diego, CA, USA, 9–12 May 2015; pp. 277–286.
  42. Barnett, L.; Barrett, A.B.; Seth, A.K. Granger causality and transfer entropy are equivalent for Gaussian variables. Phys. Rev. Lett. 2009, 103, 238701. [Google Scholar]
Figure 1. Concepts and methodology involved in k-nearest-neighbors transfer entropy (TE) estimation. Standard k-nearest-neighbors methods using maximum norm for probability density and entropy non-parametric estimation introduce, around each data point, a minimal (hyper-)cube (Box ①), which includes the first k-nearest neighbors, as is the case for two entropy estimators, namely the well-known Kozachenko–Leonenko estimator (Box ③) and the less commonly used Singh’s estimator (Box ②). The former was used in [24] to measure mutual information (MI) between two signals X and Y by Kraskov et al., who propose an MI estimator (Kraskov–Stögbauer–Grassberger (KSG) MI Estimator 1, Box ⑪) obtained by summing three entropy estimators (two estimators for the marginal entropies and one for the joint entropy). The strategy was to constrain the three corresponding (hyper-)cubes, including nearest neighbors, respectively in spaces S X, S Y and S X , Y, to have an identical edge length (the idea of projected distances, Box ⑭) for a better cancellation of the three corresponding biases. The same approach was used to derive the standard TE estimator [2529] (Box ⑩), which has been implemented in the TRENTOOL toolbox, Version 3.0. In [24], Kraskov et al. also suggested, for MI estimation, to replace minimal (hyper-)cubes with smaller minimal (hyper-)rectangles equal to the product of two minimal (hyper-)cubes built separately in subspaces S X and S Y (KSG MI Estimator 2, Box ⑫) to exploit more efficiently the Kozachenko–Leonenko approach. An extended algorithm for TE estimation based on minimal (hyper-)rectangles equal to products of (hyper-)cubes was then proposed in [27] (extended TE estimator, Box ⑨) and implemented in the JIDT toolbox [30]. Boxes ⑩ and ⑨ are marked as “standard algorithm” and “extended algorithm”. The new idea extends the idea of the product of cubes (Box ⑬). It consists of proposing a different construction of the neighborhoods, which are no longer minimal (hyper-)cubes, nor products of (hyper-)cubes, but minimal (hyper-)rectangles (Box ④), with possibly a different length for each dimension, to get two novel entropy estimators (Boxes ⑤ and ⑥), respectively derived from Singh’s entropy estimator and the Kozachenko–Leonenko entropy estimator. These two new entropy estimators lead respectively to two new TE estimators (Box ⑦ and Box ⑧) to be compared with the standard and extended TE estimators.
Figure 1. Concepts and methodology involved in k-nearest-neighbors transfer entropy (TE) estimation. Standard k-nearest-neighbors methods using maximum norm for probability density and entropy non-parametric estimation introduce, around each data point, a minimal (hyper-)cube (Box ①), which includes the first k-nearest neighbors, as is the case for two entropy estimators, namely the well-known Kozachenko–Leonenko estimator (Box ③) and the less commonly used Singh’s estimator (Box ②). The former was used in [24] to measure mutual information (MI) between two signals X and Y by Kraskov et al., who propose an MI estimator (Kraskov–Stögbauer–Grassberger (KSG) MI Estimator 1, Box ⑪) obtained by summing three entropy estimators (two estimators for the marginal entropies and one for the joint entropy). The strategy was to constrain the three corresponding (hyper-)cubes, including nearest neighbors, respectively in spaces S X, S Y and S X , Y, to have an identical edge length (the idea of projected distances, Box ⑭) for a better cancellation of the three corresponding biases. The same approach was used to derive the standard TE estimator [2529] (Box ⑩), which has been implemented in the TRENTOOL toolbox, Version 3.0. In [24], Kraskov et al. also suggested, for MI estimation, to replace minimal (hyper-)cubes with smaller minimal (hyper-)rectangles equal to the product of two minimal (hyper-)cubes built separately in subspaces S X and S Y (KSG MI Estimator 2, Box ⑫) to exploit more efficiently the Kozachenko–Leonenko approach. An extended algorithm for TE estimation based on minimal (hyper-)rectangles equal to products of (hyper-)cubes was then proposed in [27] (extended TE estimator, Box ⑨) and implemented in the JIDT toolbox [30]. Boxes ⑩ and ⑨ are marked as “standard algorithm” and “extended algorithm”. The new idea extends the idea of the product of cubes (Box ⑬). It consists of proposing a different construction of the neighborhoods, which are no longer minimal (hyper-)cubes, nor products of (hyper-)cubes, but minimal (hyper-)rectangles (Box ④), with possibly a different length for each dimension, to get two novel entropy estimators (Boxes ⑤ and ⑥), respectively derived from Singh’s entropy estimator and the Kozachenko–Leonenko entropy estimator. These two new entropy estimators lead respectively to two new TE estimators (Box ⑦ and Box ⑧) to be compared with the standard and extended TE estimators.
Entropy 17 04173f1
Figure 2. In this two-dimensional example, k = 5. The origin of the Cartesian axis corresponds to the current point xi. Only the five nearest neighbors of this point, i.e., the points in the set χ i k, are represented. The fifth nearest neighbor is symbolized by a star. The neighboring regions ABCD, obtained from the maximum norm around the center point, are squares, with equal edge lengths εx = εy. Reducing one of the edge lengths, εx or εy, until one point falls onto the border (in the present case, in the vertical direction), leads to the minimum size rectangle ABCD′, where εxεy. Two cases must be considered: (a) the fifth neighbor is not localized on a node, but between two nodes, contrary to (b). This leads to obtaining either two points (respectively the star and the triangle in (a)) or only one point (the star in(b)) on the border of A′B′C′D′. Clearly, it is theoretically possible to have more than two points on the border of A′B′C′D′, but the probability of such an occurrence is equal to zero when the probability distribution of the random points Xj is continuous.
Figure 2. In this two-dimensional example, k = 5. The origin of the Cartesian axis corresponds to the current point xi. Only the five nearest neighbors of this point, i.e., the points in the set χ i k, are represented. The fifth nearest neighbor is symbolized by a star. The neighboring regions ABCD, obtained from the maximum norm around the center point, are squares, with equal edge lengths εx = εy. Reducing one of the edge lengths, εx or εy, until one point falls onto the border (in the present case, in the vertical direction), leads to the minimum size rectangle ABCD′, where εxεy. Two cases must be considered: (a) the fifth neighbor is not localized on a node, but between two nodes, contrary to (b). This leads to obtaining either two points (respectively the star and the triangle in (a)) or only one point (the star in(b)) on the border of A′B′C′D′. Clearly, it is theoretically possible to have more than two points on the border of A′B′C′D′, but the probability of such an occurrence is equal to zero when the probability distribution of the random points Xj is continuous.
Entropy 17 04173f2
Figure 3. Information transfer from Z to X (Model 1) estimated for two different dimensions with k = 8. The figure displays the mean values and the standard deviations: (a) dY = dZ = 3; (b) dY = dZ = 8.
Figure 3. Information transfer from Z to X (Model 1) estimated for two different dimensions with k = 8. The figure displays the mean values and the standard deviations: (a) dY = dZ = 3; (b) dY = dZ = 8.
Entropy 17 04173f3
Figure 4. Information transfer (Model 2), mean values and standard deviations, k = 8. (a) From X to Y; (b) from Y to X.
Figure 4. Information transfer (Model 2), mean values and standard deviations, k = 8. (a) From X to Y; (b) from Y to X.
Entropy 17 04173f4
Figure 5. Information transfer (Model 3), mean values and standard deviations, k = 8. (a) From X to Y ; (b) from Y to X; (c) from X to Z; (d) from Z to X; (e) from Y to Z; (f) from Z to Y.
Figure 5. Information transfer (Model 3), mean values and standard deviations, k = 8. (a) From X to Y ; (b) from Y to X; (c) from X to Z; (d) from Z to X; (e) from Y to Z; (f) from Z to Y.
Entropy 17 04173f5
Figure 6. Information transfer from X to Y (Model 2), mean values and standard deviations: (a) k = 4; (b) k = 3.
Figure 6. Information transfer from X to Y (Model 2), mean values and standard deviations: (a) k = 4; (b) k = 3.
Entropy 17 04173f6
Figure 7. Box plots of the centered errors obtained with the five methods for Model 2, XY : (a) k = 8 (corresponding to Figure 4a); (b) k = 3 (corresponding to Figure 6b).
Figure 7. Box plots of the centered errors obtained with the five methods for Model 2, XY : (a) k = 8 (corresponding to Figure 4a); (b) k = 3 (corresponding to Figure 6b).
Entropy 17 04173f7
Figure 8. Comparison between four entropy estimators: (a) d = 3; (b) d = 8. The covariance matrix of the signals is a Toeplitz matrix with first line β[0:d−1], where β = 0.5. “Curve 1” stands for the true value; “Curve 2”, “Curve 3” and “Curve 4” correspond to the values of entropy obtained using respectively Equations (10), (22) and (35).
Figure 8. Comparison between four entropy estimators: (a) d = 3; (b) d = 8. The covariance matrix of the signals is a Toeplitz matrix with first line β[0:d−1], where β = 0.5. “Curve 1” stands for the true value; “Curve 2”, “Curve 3” and “Curve 4” correspond to the values of entropy obtained using respectively Equations (10), (22) and (35).
Entropy 17 04173f8

Share and Cite

MDPI and ACS Style

Zhu, J.; Bellanger, J.-J.; Shu, H.; Le Bouquin Jeannès, R. Contribution to Transfer Entropy Estimation via the k-Nearest-Neighbors Approach. Entropy 2015, 17, 4173-4201. https://doi.org/10.3390/e17064173

AMA Style

Zhu J, Bellanger J-J, Shu H, Le Bouquin Jeannès R. Contribution to Transfer Entropy Estimation via the k-Nearest-Neighbors Approach. Entropy. 2015; 17(6):4173-4201. https://doi.org/10.3390/e17064173

Chicago/Turabian Style

Zhu, Jie, Jean-Jacques Bellanger, Huazhong Shu, and Régine Le Bouquin Jeannès. 2015. "Contribution to Transfer Entropy Estimation via the k-Nearest-Neighbors Approach" Entropy 17, no. 6: 4173-4201. https://doi.org/10.3390/e17064173

Article Metrics

Back to TopTop