A Consistent Nonparametric Test for Granger Non-Causality Based on the Transfer Entropy

To date, testing for Granger non-causality using kernel density-based nonparametric estimates of the transfer entropy has been hindered by the intractability of the asymptotic distribution of the estimators. We overcome this by shifting from the transfer entropy to its first-order Taylor expansion near the null hypothesis, which is also non-negative and zero if and only if Granger causality is absent. The estimated Taylor expansion can be expressed in terms of a U-statistic, demonstrating asymptotic normality. After studying its size and power properties numerically, the resulting test is illustrated empirically with applications to stock indices and exchange rates.


Introduction
Characterizing causal interactions between time series has been challenging until Granger in his pioneering work brought forward the concept later known as Granger causality [1]. Since then, testing causal effects has attracted attention not only in Economics and Econometrics, but also in the domains of neuroscience [2,3], biology [4] and physics [5], among others.
The vector autoregressive (VAR) modeling-based test has become a popular methodology over the last decades, with repeated debates on its validity. As we see it, there are at least two critical problems with parametric causality tests. First, being based on a classical linear VAR model, traditional Granger causality tests may overlook significant nonlinear dynamical relationships between variables. As Granger [6] put it, nonlinear models represent the proper way to model the real world which is 'almost certainly nonlinear'. Secondly, parametric approaches to causality testing bear the risk of model mis-specification. A wrong regression model could lead to a lack of power, or worse, unjustified conclusions. For example, Baek and Brock construct an example where nonlinear causal relations cannot be detected by a traditional linear causality test [7].
In a series of studies, authors have tried to relax parametric model assumptions and provide nonparametric versions of Granger causality tests, which essentially are tests for conditional independence. Hiemstra and Jones were among the first to propose a formal nonparametric approach [8]. By modifying Beak and Brock's nonparametric method [7], and developing asymptotic theory, Hiemstra and Jones obtained a nonparametric test for Granger causality. However, as this test suffers from a fundamental inconsistency problem, [9] proposed a modified, consistent version of the test based on kernel density estimators, hereafter referred to as the DP test. Alternative semiparametric and nonparametric tests for conditional independence have been proposed based on, among other, additive models [10], the Hellinger distance measure [11], copulas [12], generalized empirical distribution functions [13], empirical likelihood ratios [14] and characteristic functions [15].
The scope of this paper is to provide a novel test for Granger causality, based on the information-theoretical notion of transfer entropy (hereafter TE), coined by Schreiber [16]. The transfer entropy was initially used to measure asymmetric information exchange in a bivariate system. By using appropriate conditional densities, the transfer entropy is able to measure information transfer from one variable to another. This property makes it attractive for detecting conditional dependence in dynamical settings in a general (distributional) sense. We refer to [17,18] for detailed reviews of the relation between Granger causality and directed information theory.
Despite the attractive properties of the transfer entropy and related information-theoretical notions, such as the mutual information-the application of concepts from information theory to time series analysis has proved difficult due to the lack of asymptotic theory for nonparametric estimators of these information-theoretical measures. For example, Granger and Lin [19] utilize entropy to detect serial dependence using critical values obtained by simulation. Hong and White [20] prove asymptotic normality for an entropy-based statistic, but the asymptotics only hold for a specific kernel function. Barnett [21] established an asymptotic χ 2 distribution for transfer entropy estimators in parametric settings. Establishing asymptotic distribution theory for a fully nonparametric transfer entropy measure is challenging, if not impossible. Diks and Fang [22] provide numerical comparisons to gain some insights into the statistical behavior of nonparametric transfer entropy-based tests.
In this paper, we propose a test statistic based on a first order Taylor expansion of the transfer entropy, which is shown to be asymptotically normally distributed. Instead of deriving the limiting distribution of the transfer entropy-which is hard to track-directly, we bypass the problem by focusing on a quantity that locally (near the null hypothesis) is similar, but globally different, while still sharing the global positive-definiteness property with the transfer entropy. Furthermore, we show that this new test statistic is closely related to the DP test (Section 2.2), and follow a similar approach to finding the asymptotic normal distribution of the estimator of the Taylor expansion.
This paper is organized as follows. Section 2 provides a short introduction to the nonparametric DP test and its lack of power against certain alternatives. Subsequently, the transfer entropy and a nonparametric test based on its first order Taylor expansion near the null hypothesis are introduced. The close linkage of this novel test statistic with the DP test is shown, and asymptotic normality is proved using an asymptotically equivalent U-statistic representation of the estimator. Section 2 also discusses the optimal bandwidth selection rule for specific cases. Section 3 deals with Monte Carlo simulations; three different data generating processes are considered, enabling a direct comparison of size and power between the modified DP test and the DP test. Section 4 considers two financial applications. In the first, we apply the new test to stock volume and return data to make a direct comparison with the DP test; in the second application high frequency exchange rates of main currencies are tested. Finally, Section 5 summarizes.

Nonparametric Granger Non-Causality Tests
This subsection provides some basic concepts and definitions for Granger causality, and the idea of nonparametrically testing for conditional independence. We restrict ourselves to the bivariate setting as it is the most common implementation, although generalization to multivariate densities is possible.
Intuitively, for a strictly stationary bivariate process {(X t , Y t )}, t ∈ Z, it is said that {X t } Granger causes {Y t } if current and past values of {X t } contain some additional information, beyond that in current and past values of {Y t }, about future values of {Y t }. A linear Granger causality test based on a parametric VAR model can be seen as a special case where testing for conditional independence is equivalent to testing a restriction in the conditional mean specification.
In a more general setting, the null hypothesis of Granger non-causality can be rephrased in terms of conditional dependence between two time series: {X t } is a Granger cause of {Y t } if the distribution of {Y t } conditional on its own history is not the same as that conditional on the histories of both {X t } and {Y t }. If we denote the information set of {X t } and {Y t } until time t by F Y,t and F X,t , respectively, and use '∼' to denote equivalence in distribution, we may give a formal and general definition for Granger causality. For a strictly stationary bivariate process In the absence of Granger causality, i.e., {X t } has no influence on the distribution of future {Y t }. This is also referred to as Granger non-causality and often expressed as conditional independence between {X t } and {Y t } as for m = 0, 1, 2, . . ., Granger non-causality, as expressed in Equation (1), lays the first stone for a nonparametric test without imposing any parametric assumptions, apart from strict stationarity and weak dependence, about the data generating process or underlying distributions for {X t } and {Y t }.
The orthogonality here concerns not only the conditional mean, but also higher conditional moments. We assume two things here. First, {(X t , Y t )} is a strictly stationary bivariate process. In practice, it is infeasible in nonparametric settings to condition on the entire past of X t and Y t . We therefore implicitly consider the process to be of finite Markov orders, l X < ∞ and l Y < ∞ in the past of X t and Y t , respectively. The null hypothesis of Granger non-causality is that H 0 : {X t } is not a Granger cause of {Y t }. To keep focus on the main contribution of this paper-the Taylor expansion of the transfer entropy and its asymptotic distribution-in this paper we limit ourselves to the bivariate case with single lags in the past such that k = l X = l Y = 1, which so far has been the case considered most in the literature on nonparametric Granger non-causality. Extensions to higher lags and/or higher-variate processes are feasible, but require reduction of the bias order by data sharpening or higher-order density estimation kernels (see e.g., Diks and Wolski, 2016). We define the three-variate vector W t = (X t , Y t , Z t ), where Z t = Y t+1 ; and W = (X, Y, Z) indicates a random variable W with distribution equal to the invariant distribution of W t . Within the bivariate setting, W is a three-dimensional vector. In terms of density functions f · (·) (which are assumed to exist), and given k = l X = l Y = 1, Equation (1) can be phrased as for all (x, y, z) in the support of W, or equivalently as for all (x, y, z) in the support of W. A nonparametric test for Granger non-causality seeks to find statistical evidence of violation of Equation (2) or Equation (3). There are many nonparametric measures available for this purpose, some of which are mentioned above. However, as far as we know, the DP test, to be described below, currently is the only fully nonparametric test that is known to have correct asymptotic size under the null hypothesis of Granger non-causality.

The DP Test
Hiemstra and Jones [8] proposed to test the condition expressed by Equation (2) by calculating correlation integrals for each density and measuring the discrepancy between two sides of the equation. However, their test is known to suffer from severe size distortion due to the fact that the quantity on which the test is based is inconsistent with Equation (2). To overcome this problem, DP suggest to use a conditional dependence measure by incorporating a local weight function g(x, y, z) and formulating Equation (3) as Under the null hypothesis of no Granger causality, the term within the large round brackets vanishes, and the expectation goes to zero. As noted in [23], Equation (4) can be treated as an infinite number of moment restrictions. Although testing for Equation (4) for a specific function g instead of testing Equation (2) or Equation (3) may lead to a loss of power against some specific alternatives, there is also an advantage to do so. For example, in the DP test, the weight function g(x, y, z) is taken to be g(x, y, z) = f 2 Y (y), as this leads to a U-statistic representation of the corresponding estimator, which enables the analytical derivation of the asymptotic normality of the test statistic. In principle, other choices for g(x, y, z) will also do as long as the test has satisfactory power against alternatives of interest. Since in the DP test, g(x, y, z) = f 2 Y (y), it tests the implication of H 0 , rather than H 0 itself. Given a local density estimator of a d W -variate random vector W at W i aŝ where K is a finite variance, zero mean, kernel density function (e.g., the standard normal density function) and h is the bandwidth, the DP test develops a third order U-statistic estimator for the functional q, given by where the normalization factor (n − 1)/(n(n − 2)) is inherited from the U-statistic representation of T n (h). It is worth mentioning that a second order square (or rectangular) kernel K is adopted by DP. However, there are two main drawbacks of using a square kernel. First, a square kernel will yield a discontinuous density estimatef (·), which is not attractive from a practical perspective. Second, it weighs all neighbor points W j equally, overlooking their relative distance to the estimation point W i . Therefore, a smooth kernel function-the Gaussian kernel-is used here, namely the product kernel function defined as where w s is s th element in W. Using a standard univariate Gaussian kernel, κ(w s ) = (2π) −1/2 e − 1 2 (w s ) 2 , K(.) is the standard multivariate Gaussian kernel as described in [24,25].
For l X = l Y = 1, DP prove the asymptotic normality of T n (h). Namely, if the bandwidth h = h n depends on the sample size as h n = Cn −β for constants C > 0 and β ∈ ( 1 4 , 1 3 ), then the test statistic in Equation (7) satisfies where S 2 n is a consistent estimator of the asymptotic variance of T n (h). DP suggest to implement an one-sided version of the test, rejecting H 0 : q = 0 against the alternative H a : q > 0 if T n (h) is too large. That is, given the asymptotic critical value z 1−α , the null hypothesis H 0 is rejected at significance level α if √ nT n (h)/S n > z 1−α .

Inconsistency of the DP Test
The drawback of the DP test arises from the fact that H 0 in Equation (5), obtained for a specific weight function g(x, y, z), need not be equivalent to H 0 in Equations (2) and (3); it merely is an implication of H 0 . For consistently testing H 0 , an analogue of q is desirable that satisfies the positive definiteness property stated next, which q does not satisfy.

Definition 1.
A functional s of the distribution of W is positive definite if s ≥ 0 with s = 0 if and only if X t and Z t are conditionally independent given Y t .
From the previous reasoning, it is obvious that Equation (5) is implied by Equation (3), and Definition 1 states that a strictly positive q is achieved if and only if H 0 is violated. In other words, the null hypothesis of Granger non-causality requires that X t and Z t are independent conditionally on Y t , which is just a sufficient, but not a necessary, condition for q = 0. With Definition 1, H 0 coincides with H 0 and a consistent estimator of q, i.e., T n (h) as suggested by DP, will have unit asymptotic power. If this property is not satisfied, a test for q = 0 could deviate from the test on H 0 . Although [23] identified specific sub-classes of processes for which q is positive definite, we can easily construct a counterexample where the DP test has no power even if X t strongly Granger causes Z t . For completeness, such a counterexample is given next.
Inspired by the example in [26], where a closely-related test for unconditional independence is proposed, we consider a conditional counterpart to illustrate that q is not positive definite. The one-sided DP test will be seen to suffer from a lack of power for this example process. As we show below, in the case where q = 0, this drawback cannot be overcome even with a two-sided DP test.
Consider the process {(X t , Y t , Z t )} where, as before, Z t ≡ Y t+1 . We assume that the i.i.d. continuous variable X t ∈ [−1, 1], with probability 1 − d of being positive, where 0 < d < 1. Further, there is no dependence between X t and Y t , and Z t does not depend on Y t but on X t in such a way that the conditional density of (X t , Z t |Y t = y) is given by for 0 ≤ d ≤ 1 2 . Given Equation (9), the marginal densities of X t , Y t and Z t can be calculated to be all equal, For this example, we can explicitly calculate q defined in Equation (5), which is found to be 4 , q has a negative value. In this situation, the one-sided DP test, which rejects for large q, is not a consistent test for Granger non-causality. One may argue that this is not a problem if we use a two-sided test at the price of losing some power. However, the inconsistency of the DP test-which tests H 0 rather than H 0 -then would still be illustrated by the example if d = 1 4 , for which q = 0 exactly, while {X t } is clearly a Granger cause of {Y t }; the DP test will only have trivial power against this alternative. Figure 1 reports the power of the one-sided DP test as a function of the sample size for different significance levels, based on 10, 000 independent simulations. Three nominal sizes are illustrated here: 5%, 10% and 15%, and the sample size ranges from 100 to 20, 000. It is striking from Figure 1 that the DP test hardly has power against the alternative with d = 1/4, for which q = 0. The same conclusion can be drawn from Figure 2, where the size-power plots [27] are given. For almost all sub-panels with different sample sizes, the power of the DP test is around the diagonal line for this particular example when q = 0, which indicates that the DP test has only trivial power to detect Granger causality from X t to Y t .
The lack of power of the one-sided DP test in this example is hardly alleviated by its two-sided counterpart, as a result of the absence of equivalence between q = 0 and conditional independence. The difference between H 0 and its implication H 0 gives rise to the lack of power of the DP test as the estimated quantity is not positive definite. In the next subsection, a new test statistic, based on the information-theoretical concept transfer entropy, is introduced and the test statistic is shown to be positive definite, which overcomes the inherited drawback of the DP test. In fact, this new test statistic shares many similarities with the DP test statistic, but also has an information-theoretical interpretation for its non-negativity.

Sample Size
Rejection Rate 5% 10% 15% Figure 1. Power for the one-sided DP test for the artificial process {(X t , Y t , Z t )} for d = 1/4, for which q = 0, at nominal size, from the bottom to the top, 5%, 10% and 15%, respectively, based on 10, 000 independent simulations. Size-power plots for the one-sided DP test for the artificial process {(X t , Y t , Z t )} for d = 1/4, for which q = 0, based on 10, 000 independent simulations. Each subplot draws the actual power against the nominal size for different sample sizes, ranging from 100 to 5000. The solid curve represents the actual power and the red dash line indicates the diagonal, indicating the nominal size of a test.

Information-Theoretical Interpretation
In a very different context from testing for conditional independence, the problem of information feedback and impact also has drawn much attention since 1950. Information theory, as a branch of applied mathematical theory of probability and statistics, studies the transmission of information over a noisy channel. This entropy, also referred to as Shannon entropy, is a key measure in the field of information theory brought forward in [28,29]. The entropy measures the uncertainty and randomness associated with a random variable. Suppose that S is a random vector with density f S (s), then the Shannon entropy is defined as There is a long history of applying information measures in econometrics. For example, ref. [30] uses the Kullback-Leibler information criterion (KLIC) [31] to construct a one-sided test for serial independence. Since then, nonparametric tests using entropy-based measures for independence between two time series are becoming prevalent. Granger and Lin [19] use entropy measure to identify the lags in a nonlinear bivariate model. Granger et al. [32] study dependence with a transformed metric entropy, which has the additional advantage of allowing multiple comparisons of distances and turns out to be a proper measure of distance. Hong and White [20] provide a new entropy-based test for serial dependence, and show that the test statistic is asymptotically normal.
Although inspiring, those results cannot be applied directly to measure conditional dependence. We therefore consider that the transfer entropy (TE) introduced in [16] is a suitable measure to serve this purpose. The TE quantifies the amount of information explained in one series k steps ahead from the state of another series, given the information contained in its own past. We briefly introduce the TE and KLIC before we further discuss its relation with the modified DP test.
Suppose that we have a bivariate process {(X t , Y t )}, and for brevity we put X = {X t }, Y = {Y t } and Z = {Y t+k }. Again, we limit ourselves to k = 1 lag for simplicity, and consider the three-dimensional vector W = (X, Y, Z) as before. The transfer entropy TE X→Y is a nonlinear and nonparametric measure for the amount of information contained in X about Z, in addition to the information about Z that already contained in Y. Although the TE defined in [16] applies to discrete variables, it is easily generalized to continuous variables. Conditional on Y, TE X→Y is defined as Using the conditional mutual information I(Z, X|Y = y), the TE can be equivalently formulated in terms of four Shannon entropies as In order to construct a test for Granger causality based on the TE, it remains to be shown that the TE is a proper basis for testing the null hypothesis. The following theorem, as a direct result of the properties of the KLIC, lays the quantitative foundation for testing based on the TE. Theorem 1. The transfer entropy TE X→Y , as a functional of the joint density of W = (X, Y, Z), is positive definite; that is, TE X→Y ≥ 0 with equality if and only if f Z,X|Y (z, x|y) = f X|Y (x|y) f Z|Y (z|y) for all (x, y, z) in the support of W.
Proof. Equation (1) follows from generalizing Theorem 3.1 in Chapter 2 of [33], where the divergence between two different densities has been considered. An alternative proof is given in Equation (A.1) in Appendix A by using Jensen's inequality and concavity of the log function.
It is not difficult to verify that the condition for TE X→Y = 0 coincides with Equations (2) and (3) for Granger non-causality under the null hypothesis. This positive definiteness makes TE X→Y a desirable measure for constructing a one-sided test of Granger causality; any divergence from zero is a sign of conditional dependence of Y on X. To estimate TE X→Y , one may follow the recipe in [34] by measuring k-nearest neighbor distances. A more natural method, applied in this paper, is to use the plug-in kernel estimates given in Equation (6), and replace unknown expectations by sample averages.
However, the direct use of the TE to test Granger non-causality is not easy due to the lack of asymptotic theory for the test statistic. It has been shown [19] that the asymptotic distribution of entropy-based estimators usually depends on strict assumptions regarding the dataset. Over the years several break-throughs have been made with the application of entropy to testing serial independence, e.g., [30] obtains an asymptotic N(0, 1) distribution for an entropy measure by a sample-splitting technique and [20] derives asymptotic normality under bounded support data and quartic kernel assumptions. However, the limiting distribution of the natural nonparametric TE estimator is still unknown under more general conditions.
One may argue in favor of using simulation techniques to overcome the problem of the lack of asymptotic theory. However, as suggested in [11], there exist estimation biases of TE statistics for non-parametric dependence measures under the smoothed bootstrap procedure. Even with parametric test statistics, it has been noticed [21] that the TE-based estimator is generally biased. Surrogate data are also applied widely, for instance in [35,36] to detect information transfer. We therefore consider the direct usage of the TE for nonparametric tests for Granger non-causality difficult, if not impossible.
Below, we show that a first order Taylor expansion of the TE provides a way out to construct the asymptotic distribution of this meaningful information measure. In the next section, we show that the first order Taylor expansion of the TE can form the basis of a modified DP test for conditional independence. This not only helps to circumvent the problem of asymptotic distribution for entropy-based statistic, but also endows the modified DP test with positive definiteness.
In the remaining part of this section we will introduce the first order Taylor expansion of the TE, and the positive definiteness of the measure will be given afterwards. Starting with Equation (10), we perform the first order Taylor expansion locally at TE X→Y = 0, which is where 'h.o.t' stands for 'higher order terms' in which is small close to the null hypothesis. Ignoring the higher order terms in the transfer entropy makes the distribution of the test statistic tractable, without (up to leading order) affecting the dependence measure close to the null hypothesis, where by definition a powerful test is needed.
By ignoring higher order terms, we define the first order expansion ϑ = E W f Z,X|Y (Z,X|Y) f X|Y (X|Y) f Z|Y (Z|Y) − 1 as a measure for conditional dependence. The following theorem states that ϑ, which with slight abuse of language we still refer to as a transfer entropy, inherits the positive definiteness of the TE. Equation (2) indicates that the divergence measure ϑ has the desirable property of positive definiteness, which the measure q used in the DP test is lacking. However, direct estimation of Equation (11) does not lead to a practically useful test statistic without the asymptotic distribution.
In the next subsection we show that the nonparametric estimator of ϑ is asymptotically normal. The key to this result is the fact that the DP statistic and the newly proposed statistic only differ in terms of the weight function g(x, y, z) in Equation (4), and that the proof of asymptotic normality of the DP test can be easily adjusted to accommodate this new weight function.

A Modified DP Test
In comparing Equations (3) and (4), it can be seen that the discrepancy between H 0 and H 0 arises from incorporation of the weight function g(x, y, z) = f 2 Y (y) to the null hypothesis. In principle other positive functions g(x, y, z) can be used, such as those discussed by DP. As long as the corresponding estimator of the divergence measure has a U-statistic representation, asymptotic normality follows from the theory of U-statistics. Particularly, we propose to modify the DP test by dividing all terms in the expectation of Equation (5) by the function v(x, y, z) ≡ f X,Y (x, y) f Y,Z (y, z), since then, by Theorem 2 is not just implied by H 0 , but equivalent to it. One can also think of Equation (12) as the result of plugging in a different weight function in Equation (4). By the choice g(x, y, , which was used by DP, Equation (12) simplifies to which is equivalent to the first order Taylor expansion in Equation (11) and hence to H 0 by Equation (2).
To estimate ϑ, we propose to use the following statistic with density estimator defined in Equation (6): wherev The reason for estimating ϑ in this form is that, with the sample statistic T n (h), we can obtain a third order U-statistic representation of ϑ, similar to that for the DP test statistic, by which asymptotic normality follows. The asymptotic normality of T n (h) is stated in Equation (3) below, which relies on the following two lemmas concerning the uniform consistency of density estimators.

Lemma 1. (Uniform consistency off ) Let
a stationary sequence of 3-variate random variables with a continuous and bounded Lebesgue density f , satisfying the strong mixing conditions in Assumption 2 of [37]. If for the estimation of f , based on the first n values W i , the kernel density estimator f n =f is used with kernel function K(w), as given in Equation (6), with n-dependent bandwidth h = h n = cnst. × n −β , β < 1 3 , and K(w) is bounded and integrable, then (1) is a special case of Theorem 7 in [37], which more generally concerns the uniform consistency of the kernel estimator of f and its derivatives.

Proof. Equation
Equation (1) provides the uniform consistency with probability one for a class of kernel estimators of multivariate density functions. This is a generalization of the consistency result of the univariate density estimation of [38,39] to the multivariate case with dependent observations. Note that to serve our purpose here we need uniform convergence, which is stronger than pointwise convergence. We refer to [40] for a detailed discussion between different types of convergence.
We next considerT n , which differs from T n in Equation (14) only in havingv(.) in the denominator replaced by the true unknown function v(.). In the next lemma, the short-hand notation v i = v(W i ) and

Lemma 2. Under the conditions of Lemma Equation (1), if in addition Var
and √ n(T n (h n ) − ϑ) have the same limiting distribution. More formally stated, Proof. See Equation (A.3).
Theorem 3. If the bivariate time series {(X t , Y t )} n t=1 is strictly stationary and satisfies at least one of the mixing conditions (a), (b) or (c) in Theorem 1 of [41], the corresponding random vector (X t , Y t , Z t ) satisfies the conditions of Lemmas 1 and 2, and the density estimation kernel has bandwidth h n = Cn −β , C > 0, β ∈ ( 1 4 , 1 3 ), T n (h n ) is asymptotically normally distributed. In particular where S 2 n is an HAC estimator of the long-run variance σ 2 of √ n(T n (h n ) − ϑ).

Proof. See Equation (A.4).
When implementing the test based on Equation (14), some comments regarding the treatment of the marginals are in order. Note that ϑ is invariant under invertible smooth transformations of the marginals due to the form of Equation (13) assuming that X t and Y t are continuous (the ratio of densities of the same variables is invariant under marginal transforms). Therefore, the dependence structure between X t and Y t remains intact under invertible marginal transforms. Although our testing framework does not depend crucially on the restrictive assumption of a uniform distribution for the time series as in [20,42], we recommend to use the probability integral transformation (PIT) on each of the marginals, as suggested by DP, as this usually improves the performance of statistical dependence tests. The reason is that, contrary to directly calculating the test statistics on the original data, the bounded support after transforming the marginals to a uniform distribution avoids non-existing moments during the bias and variance evaluation, which helps to stabilize the test statistic. There are alternative ways to transform the marginal variables into a bounded support, for example, by using a logistic function as [20]. Here, we decided to just apply the PIT, as it does not require any user-specified parameters, and always leads to identical (uniform) marginals. The procedure is to transform the original is uniform. Since the transfer entropy-based measure ϑ is non-negative, tests based on the statistic T n (h) are implemented as one-sided tests, rejecting the null hypothesis if √ nT n (h)/S n > z 1−α , where z 1−α is the (1 − α)th quantile of standard normal distribution for a given significance level α.

Bandwidth Selection
In nonparametric settings, there typically is no uniformly most powerful test against all alternatives. Hence, it is unlikely that a uniformly optimal bandwidth exists. As long as the bandwidth tends to zero with as h = Cn −β , C > 0, β ∈ ( 1 4 , 1 3 ), our test has unit asymptotic power. Yet, we may define the optimal bandwidth in the sense of asymptotically minimal mean squared error (MSE). When balancing the first and forth leading terms in Equation (A5) to minimize the squared bias and variance, for a second order kernel, it is easy to show that the optimal bandwidth for the DP test is given by where the scale factor 0.6 involved is a result of bias and variance adjustment for replacing the square kernel by the Gaussian kernel (the variance of the uniform DP kernel was 1/ √ 3 ≈ 0.57735, which we rounded off to 0.6). Intuitively, the q 2 and E[s(W)] terms are different from those for the DP test; more details can be found in Equation (A.5).
The optimal value for C is process-dependent and difficult to track analytically. For example, for a (G)ARCH process the optimal bandwidth is approximately given by h DP = Cn −2/7 where C ≈ 8 (see DP). Applying Equation (16), we proceed with h * = 4.8n −2/7 for (G)ARCH processes. To gain some insights into the bandwidth, we illustrate the test size and power with a 2-variate ARCH process, given by We let 0 < a < 0.4 and run 5000 Monte Carlo simulations for time series length varying from 200 to 5000. The size is assessed based on testing Granger non-causality from {X t } to {Y t }, and for the power we use the same process but testing from Granger non-causality from {Y t } to {X t }. The results are presented in Table 1, from which it can be seen that the modified DP test is conservative in the sense that its empirical size is lower than the nominal size 0.05 in all cases, while the power increases when a increases and when the sample size increases. Note: Empirical size and power of the modified DP test for the process given in Equation (17) for different sample sizes and parameter a. The values represent observed rejection rates over 5000 realizations for nominal size 0.05.

Size/Power Simulations
This section investigates the performance of the modified DP test. Before proceeding with new data generating processes, we first revisit the example illustrated in Equation (9) for which the DP test fails to detect that {X t } is Granger causing {Y t }. The modified DP test is performed with 10, 000 replications, with the same bandwidth. The counterpart of the power-size plots for the DP test in Figure 2 is delivered in Figure 3. In contrast with the lack of power of the DP test, for time series length n = 500 and larger, the modified DP test already has a very high power in this artificial experiment, as expected. Next, we use numerical simulations to study the behavior of the modified DP test, while direct comparisons between the modified DP test T n and the DP test T n are also given. Three processes are being considered. In the first experiment, we consider a simple bivariate VAR process, given by The second process is designed as a nonlinear VAR process in Equation (19). Again, the size and power are investigated by testing for Granger non-causality in two different directions.
The last process is the same as the example we used for illustrating the performance of the bandwidth selection rule, which is a bivariate ARCH process also given in Equation (17), The results, which are shown in Figures 4-6, are obtained with 5000 simulations for each process. We present the DP test and the modified DP test with both the empirical size-size and size-power plots for the three processes in Equations (18)- (20) for sample sizes n = 500 and n = 5000, respectively. The control parameter a is considered to take the values 0.1 and 0.4. As before, the empirical size is obtained by testing for Granger non-causality from {X t } to {Y t }, and the empirical power is the observed rejection rate of testing for Granger non-causality from {Y t } to {X t }.
It can be seen from Figures 4-6 that the modified DP test is slightly more conservative than the DP test under the null hypothesis. However, the size distortion reduces when the sample size increases. The modified DP test is more powerful than the DP test in the linear and nonlinear VAR settings given in Equations (18) and (19). Overall, we see that the larger the sample size and the stronger the causal effect are, the better the asymptotic performance of the modified DP test is.  DP n=500 MDP n=500 DP n=5000 MDP n=5000 Figure 6. Size-size and size-power plots of Granger non-causality tests, based on 5000 simulations. The DGP is bivariate ARCH as in Equation (20), with {Y t } Granger causing {X t }. The left (right) column shows observed rejection rates under the null (alternative) hypothesis, the blue lines stand for DP test while the red lines indicate the modified DP test. The solid line and dashed line present results for sample size n = 500 and n = 5000, respectively.

Stock Volume-Return Relation
In this section, we first revisit the stock return-volume relation considered in [8] and DP. This topic has a long research history. Early empirical work mainly focused on the positive correlation between volume and stock price change, see [43]. The later literature exposed directional relations, for example, [44] found that large price movements are followed by high volume. In [45], authors observed a high-volume return premium; namely, periods of extremely high (low) volume tend to be followed by positive (negative) excess returns. More recently, [46] investigated the power law cross-correlations between price changes and volume changes of the S&P 500 Index over a long period.
We use daily volume and returns data for the three most-followed indices in US stock markets, the Standard and Poor's 500 (S&P), the NASDAQ Composite (NASDAQ) and the Dow Jones Industrial Average (DJIA), between January 1985 and October 2016. The daily volume and adjusted daily closing prices were obtained from Yahoo Finance. The time series were converted by taking log returns multiplied by 100. In order to adjust for the day-of-the-week and month-of-the-year seasonal effects in both mean and variance of stock returns and volumes, we performed a two-stage adjustment process, similar to the procedure applied in [8]. We replace Akaike's information criterion used by [8] with the [47] information criterion to be more stringent on picking up variables, having no intention to provoke a debate over the two criteria; we simply prefer a more parsimonious liner model to avoid potential overfitting. We apply our test not only to the raw data, but also on VAR filtered residuals and EGARCH(1,1,1) filtered residuals. We have tried different error distributions like normal, Students' t, GED and Hansen's skewed t [48]. The differences caused by different distributional assumptions are small; we only report the results based on the Students' t distribution due to space considerations. The idea of filtering is to remove linear dependence and the effect of heteroskedasticity to isolate the nonlinear and higher moment relationships among series, respectively.
Tables 2-4 report the resulting t statistics for both the DP test and our modified DP test in both directions. The linear Granger F-values based on the optimal VAR models are also given. Two bandwidth values are used: 1.5 and 0.6, after standardization, where the latter value roughly corresponds to the derived optimal bandwidth (h = 0.6138) and the larger bandwidth, also used in DP, is added as a robustness check.   Generally speaking, the results indicate that the effect in the return-volume direction is stronger than vice versa. For the test results on the raw data, the F-tests based on the linear VAR model and both nonparametric tests suggest evidence of return affecting volume for all three indexes. For the other direction, causality from volume to return, the linear Granger test finds no evidence of causal impact while the nonparametric tests claim strong causal effect except for the DJIA where only the modified DP test finds a causal link from volume to return. As argued above, the results for the linear test are suspicious since it only examines linear causal effects in the conditional mean; information exchange from higher moments is completely ignored.
A direct comparison between the DP test and the modified DP test shows that the new test is more powerful overall. For the unfiltered data, both tests find a strong causal effect in two directions for S&P and NASDAQ, but for the DJIA, the t-statistics of the DP test are weaker than those of the modified DP test. The bi-directional causality between return and volume remains unchanged after linear VAR filtration, although the DP test again shows weaker evidence. The result also suggests that the causality is strictly nonlinear. The linear test (F-test) is unable to spot these nonlinear linkages.
Further, in the direction from Volume to Return, these nonlinear causalities tend to vanish after EGARCH filtering. Thus, the bi-directional linkage is reduced to a one-directional relation from return to volume. The modified DP statistics, however, are in general larger than the DP t-values, and indicate more causal relations. In contrast with the DP test, our test suggests that the observed nonlinear causality cannot be completely attributed to second moment effects. Heteroskedasticity modeling may reduce this nonlinear feature to some extent, but its impact is not as strong as the DP test would suggest.

Application to Intraday Exchange Rates
In the second application, we apply the modified DP test to intraday exchange rates. We consider five major currencies: JPY, AUD, GBP, EUR and CHF, all against the USD. The data, obtained from Dukascopy Historical Data Feed, contain 5-min bid and ask quotes for the third quarter of 2016; from July 1 to September 30, with a total of 92 trading days and 26,496 high frequency observations. We use 5-minute data, corresponding to the sampling frequency of 288 quotes per day, which is high enough to avoid measurement errors (see [49]) but also low enough for the micro-structure not to be of major concern.
Although the foreign exchange market is one of the most active financial markets in the world, where trading takes place 24 h per day, intraday trading is not always active. Thus, we delete the thin trading period, from Friday 21:00 GMT until Sunday 20:55 GMT, also to keep the intraday periodicity intact. We calculate the exchange rate returns as in [50]. First, the average log bid and log ask prices are calculated, then the differences between the log prices at consecutive times are obtained. Next, we remove the conditional mean dynamics by fitting an MA(1) model and using the residuals as our return series following [51]. Finally, intraday seasonal effects are filtered out using estimated time-of-day dummies following [50], i.e., where r i,n,t denotes intraday log returns after MA(1) filtering. The subscript i = 1, . . . , 5 indicates five different currencies and n, t stands for time t on day n. The first component of return series d i,t refers to a deterministic intraday seasonal component while z i,n,t is the nonseasonal return portion, which is assumed to be independent of d i,t . To distinguish d i,t from z i,n,t , we fit the time-of-day dummies to 2 log |r i,n,t | and use the estimatedd i,t to standardize the return r i,n,t with the restriction ∑ T t=1 d i,t = 1. Figures 7-9 report the first 200 autocorrelations of returns, absolute returns and squared returns, when checking on the raw series, MA(1) residuals and EGARCH residuals, respectively.
We perform pairwise nonparametric Granger causality tests on the MA(1) filtered and seasonally adjusted data, as well as on the standardized residuals after EGARCH(1,1,1) filtering. We use the skewed t distribution introduced in [48] to model the innovation terms. We choose a bandwidth of 0.2768, according to Equation (16). The test results are shown in Table 5 for both MA(1) de-meaned and de-seasoned data, as well as EGARCH filtered data. Although not reported here, there is statistical evidence for strong bi-directional causality among all currency pairs on raw return data at 5-min lag. These bi-directional causalities do not disappear after removing the MA(1) component and seasonal component. However, the observed information spillover is significantly weaker after the EGARCH filtering. When testing based on the EGARCH standardized residuals, only a few pairs still show signs of a strong causal relation. Especially, the directional relation of EUR → CHF is the only one detected by both the DP test and the modified DP test at the 1% level of significance. A graphical representation is provided in Figure 10, where one can clearly see that most causal links are gone after EGARCH filtering. The modified DP test exposes five uni-directional linkages among the EGARCH filtered returns at the 5% level. The EUR and GBP are the most important driving currencies. While the DP test also admits the importance of JPY and particularly AUD, which shows bi-directional causality between JPY and GBP.
To sum up, we find evidence of strong causal links among exchange returns at an intraday high-frequency timescale. Each currency has predictive power for other currencies, implying high co-movements in the international exchange market. Although those directional linkages are not affected by the de-meaning procedure, we may reduce most of them by taking the volatility dynamics into account. When filtering out heteroskedasticity by EGARCH estimation, there only exist a few pairs containing spillover effects.  Figure 10. Graphical representation of pairwise causalities on MA and seasonally filtered residuals, as well as EGARCH filtered residuals. The arrows in the graph indicate a directional causality at the 5% level of significance.

Summary and Conclusions
Borrowing the concept of transfer entropy from Information Theory, this paper develops a novel nonparametric test statistic for Granger non-causality. The asymptotic normality of the test statistic is derived by taking advantage of a U-statistic representation, similar to that applied in the DP test. The modified DP statistic, however, improves the DP statistic in at least the two respects: firstly, the positive definiteness of the quantity on which the test statistic is based, paves the way for properly testing for differences between conditional densities; secondly, the weight function in our test is motivated from an information-theoretical point of view, while the weight function in the DP test was selected in an ad hoc manner.
The simulation study confirms that the modified DP test has good size and power properties for a wide range of data generating processes. In the first application, a direct comparison with the DP test confirms that the DP test may suffer from a lack of power for specific processes, while the second application to high frequency exchange return data helps us better understand whether the spillover channel in exchange rate markets arises from conditional mean, conditional variance or higher conditional moments. Some obvious extensions to future work include the incorporation of additional lags of the variables and a generalization to higher-variate settings to allow for conditioning on additional, possibly confounding, variables.
Author Contributions: C.D. and H.F. contributed equally to this work. All authors have read and agreed to the published version of the manuscript.
by the uniform convergence ofv(·). It follows that According to the definition ofT n (h), Equation (14) is a re-scaled DP statistic with the scaling factor (1/v(.)). In a similar manner of Theorem 1 in DP, we can obtain the asymptotic behavior ofT n (h) by making use of the optimal mean squared error (MSE) bandwidth developed in [53] for this point estimator. For the moment, we consider the case where the vectors W i are assumed to be independent and identically distributed. We later allow for weak dependence in the time series context as described at the end of this section.
The test statisticT n (h) can be expressed by a degree three U-statisticK(W i , W j , W k , h) by symmetrization with respect to the indices i, j, k. Further, defining two kernel functions asK 1 (w i ) = E[K(w i , W j , W k , h)] andK 2 (w i , w j , h) = E[K(w i , w j , W k , h)], we assume the three mild conditions adapting from [53] for controlling the rate of convergence of the point-wise bias as well as the serial expansions of the kernel functions, being where all remainder terms are of higher orders, i.e., E||s * (W i , h)|| 2 = o P (h 2α ), q * 2 (h) = o P (h −γ ) and q * 3 (h) = o P (h −δ ) and the convergence rate is controlled by the parameters α, γ and δ. The conditions in Equation (A4) are satisfied if α is set as the order of kernel function K(.), which is 2 for the Gaussian kernel, and γ, δ depend on the dimensions of the variables under consideration via γ = d X + d Y + d Z and δ = d X + 2d Y + d Z . Define C 0 = 2Cov lim h→0K1 (W i , h), s(W i ) , we can show that the mean squared error of DP statistic as a function of sample size dependent bandwidth is given by endgroup.
The scaling factor (1/v(·)) in the modified test statisticT n (h) enters the MSE in Equation (A5) by mainly changing the bandwidth-independent variance term. For the other bandwidth-dependent terms, (1/v(·)) just re-scales the coefficients without affecting the convergence rates. Thus, we may still allow for all the h-dependent terms to be o P (n −1 ) to ensure that 9 n Var lim h→0K1 (W i , h) -term asymptotically dominates (in which case asymptotic normality of the test statistic obtains). Therefore, adopting a sample size-dependent bandwidth h n = Cn −β , with C, β > 0, one finds where S 2 n is a consistent estimator of the asymptotic variance 9Var lim h→0K1 (W i , h) . The bivariate case, for α = 2 and d X + d Y + d Z = 3, requires β ∈ (1/4, 1/3). In the time series setting, under the assumption that the processes are stationary and satisfies at least one of the mixing conditions (a), (b) or (c) in Theorem 1 in [41], the long-run variance of √ n(T n (h) − ϑ) is given by σ 2 = 9 Var (U t ) + 2 ∑ ∞ =1 Cov [U t , U t+ ] , where U t = lim h→0 (K 1 (W t , h)). The variance σ 2 can then be estimated using an HAC estimator for the long-run variance of U t [41,54].