Transfer Entropy for Nonparametric Granger Causality Detection An Evaluation of Different Resampling Methods

: The information-theoretical concept transfer entropy is an ideal measure for detecting conditional independence, or Granger causality in a time series setting. The recent literature indeed witnesses an increased interest in applications of entropy-based tests in this direction. However, those tests are typically based on nonparametric entropy estimates for which the development of formal asymptotic theory turns out to be challenging. In this paper, we provide numerical comparisons for simulation-based tests to gain some insights into the statistical behavior of nonparametric transfer entropy-based tests. In particular, surrogate algorithms and smoothed bootstrap procedures are described and compared. We conclude this paper with a ﬁnancial application to the detection of spillover effects in the global equity market.


Introduction
Entropy, introduced by Shannon [1,2], is an information theoretical concept with several appealing properties, and therefore wide applications in information theory, thermodynamics and time series analysis.Based on this classical measure, transfer entropy (TE) has become a popular information theoretical measure for quantifying the flow of information.This concept, which was coined by Schreiber [3], was applied to distinguish a possible asymmetric information exchange between the variables of a bivariate system.When based on appropriate non-parametric density estimates, the TE is a flexible non-parametric measure for conditional dependence, coupling structure, or Granger causality in a general sense.
The notion of Granger causality was developed by the pioneering work of Granger [4] to capture causal interactions in a linear system.In a more general model-free world, the Granger causal effect can be interpreted as the impact of incorporating the history of another variable on the conditional distribution of a future variable in addition to its own history.Recently, various nonparametric measures have been developed to capture such difference between conditional distributions in a more complex, and typically nonlinear system.There is a growing list of such methods, based on, among others, correlation integrals [5], kernel density estimation [6], Hellinger distances [7], copula functions [8] and empirical likelihood [9].
In contrast with the above-mentioned methods, TE-based causality tests do not attempt to capture the difference between two conditional distributions explicitly.Instead, with the information theoretical interpretation, the TE offers a natural way to measure directional information transfer and Granger causality.We refer to [10,11] for detailed reviews of the relation between Granger causality and directed information theory.However, the direct application of entropy and its variants, though attractive, turns out to be difficult, if not impossible altogether, due to the lack of asymptotic distribution theory for the test statistics.For example, Granger and Lin normalize the entropy to detect serial dependence with critical values obtained from simulations [12].Hong and White provide the asymptotic distribution for the Granger-Lin statistic with a specific kernel function [13].Barnett and Bossomaier derive a χ 2 distribution for the TE at the cost of the model-free property [14].
On the other hand, to obviate the asymptotic problem, several resampling methods on TE have been developed for providing empirical distributions of the test statistics.Two popular techniques are bootstrapping and surrogate data.Bootstrapping is a random resampling technique proposed by Efron [15] to estimate the properties of an estimator by measuring those properties from approximating distributions.The "surrogate" approach developed by Theiler et al. [16] is another randomization method initially employing Fourier transforms to provide a benchmark in detecting nonlinearity in a time series setting.It is worth mentioning that the two methods are different with respect to the statistical properties of the resampled data.For the surrogate method the null hypothesis is maintained, while the bootstrap method does not seek to impose the null hypothesis on the bootstrapped samples.We refer to [17,18] for detailed applications of the two methods.
However, not all resampling methods are suitable for entropy-based dependence measures.As Hong and White [13] put it, a standard bootstrap fails to deliver a consistent entropy-based statistic because it does not preserve the statistical properties of a degenerate U-statistic [13].Similarly, with respect to traditional surrogates based on phase randomization of the Fourier transform, Hinich et al. [19] criticize the particularly restrictive assumption of linear Gaussian process, and Faes et al. [20] point out that it cannot preserve the whole statistical structure of the original time series.
As far as we are aware, there are several applications of both methods in entropy-based tests, for example, Su and White [7] propose a smoothed local bootstrap for entropy-based test for serial dependence, Papana et al. [21] apply stationary bootstrap in partial TE estimation, Quiroga et al. [22] use time-shifted surrogates to test the significance of the asymmetry of directional measures of coupling, and Marschinski and Kantz [23] introduce the effective TE, which relies on random shuffling surrogate in estimation.Kugiumtzis [24] and Papana et al. [21] provide some comparisons between bootstrap and surrogate methods for entropy-based tests.
In this paper, we adopt TE as a test statistic for measuring conditional independence (Granger non-causality).Being aware of the fact that the analytical null distribution may not always be accurate or available in an analytically closed form, we resort to resampling techniques for constructing the empirical null distribution.The techniques under consideration include smoothed local bootstrap, stationary bootstrap and time-shifted surrogates, all of which are shown in literature to be applicable to entropy-based test statistics.Using different dependence structures, the size and power performance of all methods are examined in simulations.
The remainder of this paper is organized as follows.Section 2 first provides the TE-based testing framework and a short introduction to kernel density estimation; then bandwidth selection rules are discussed.After presenting the resampling methods, including the smoothed local bootstrap and time-shifted surrogates for different dependence structure settings.Section 3 examines the empirical performance of different resampling methods, presenting the size and power of the tests.Section 4 considers a financial application with the TE-based nonparametric test and Section 5 summarizes.

Transfer Entropy and Its Estimator
Information theory is a branch of applied mathematical theory of probability and statistics.The central problem of classical information theory is to measure transmission of information over a noisy channel.Entropy, also referred to as Shannon entropy, is one key measure in the field of information theory brought by [1,2].Entropy measures the uncertainty and randomness associated with a random variable.Supposing that S is a random vector with density f S (s), its Shannon entropy is defined as There is a long history of applying information theoretical measures in time series analysis.For example, Robinson [25] applies the Kullback-Leibler information criterion [26] to construct a one-sided test for serial independence.Since then, nonparametric tests using entropy measures for dependence between two time series are becoming prevalent.Granger and Lin [12] normalize the entropy measure to identify the lags in a nonlinear bivariate time series model.Granger et al. [27] study dependence with a transformed metric entropy, which turns out to be a proper measure of distance.Hong and White [13] provide a new entropy-based test for serial dependence, and the test statistic follows a standard normal distribution asymptotically.
Although those heuristic approaches work for entropy-based measures of dependence, these methodologies do not carry over directly to measures of conditional dependence, i.e., Granger causality.The term TE was coined by Schreiber [3], although it appeared in the literature earlier under different names, is a suitable measure to serve this purpose.The TE quantifies the amount of information contained in one series at k steps ahead from the state of another series, given the current and past state of itself.Suppose we have two series {X t } and {Y t }, for brevity put X = {X t }, Y = {Y t } and Z = {Y t+k }, further we define a three-variate vector W t as W t = (X t , Y t , Z t ), where Z t = Y t+k ; and W = (X, Y, Z) is used when there is no danger of confusion.Within this bivariate setting, W is a three dimensional continuous vector.In this paper, we limit ourselves to k = 1 for simplicity, but the method can be generalized into multiple steps easily.The quantity TE X→Y is a nonlinear measure for the amount of information explained in Z (future Y) by X, accounting for the information on Z already contained in Y.Although TE defined in [3] applies to discrete variables, it is easily generalized to continuous variables.Conditional on Y, TE X→Y is defined as ( Using conditional mutual information I(Z, X|Y = y), the TE can be equivalently formulated in terms of four Shannon entropy terms as In order to construct a test for Granger causality based on the TE, one first needs to show quantitatively that the TE is a proper basis for detecting whether the null hypothesis is satisfied.
The following theorem, as a direct application of the Kullback-Leibler criterion, lays the quantitative foundation for testing based on the TE.

Theorem 1. TE X→Y ≥ 0 with equality if and only
Proof.The proof of Theorem 1 is given in [28].
It is not difficult to verify that the condition for TE X→Y = 0 coincides with the null hypothesis of Granger non-causality defined in Equation ( 4), also referred to as conditional independence or no coupling.Mathematically speaking, the null hypothesis of Granger non-causality, H 0 : {X t } is not a Granger cause of {Y t }, can be phrased as for (x, y, z) in the support of W. A nonparametric test for Granger non-causality seeks to find statistical evidence of violation of Equation ( 4).There are many nonparametric measures available for this purpose, some of which are mentioned above.Equation ( 4) provides the basis for a model-free test without imposing any parametric assumptions about the data generating process or underlying distributions for {X t } and {Y t }.We only assume two things here.First, {X t , Y t } is a strictly stationary bivariate process.Second, the process has finite memory, i.e., variable lags l X , l Y ∞.The second (finite Markov order) assumption is needed in this nonparametric setting to make conditioning on past information feasible by conditioning on a finite number of past observations.Moreover, strict stationarity and the mixing properties implied by the finite Markov order assumption ensure that the transfer entropy can be estimated consistently through kernel density estimation of the underlying densities.
As far as we are aware, the direct use of TE to test Granger non-causality in nonparametric setting is difficult, if not impossible at all, due to the lack of asymptotic theory for the test statistic.As Granger and Lin [12] put it, very few asymptotic distribution results for entropy-based estimators are available.Although over the years several break-throughs have been made with application of entropy to testing serial independence, the limiting distribution of TE statistic is still unknown.One may wish to use simulation techniques to overcome the lack of asymptotic distributions.However, as noted by Su and White [7], there are estimation biases of the TE statistics for non-parametric dependence measures under the smoothed bootstrap procedure.Even for the parametric test statistic used by Barnett and Bossomaier [14], the authors noticed that the TE-based estimator is generally biased.

Density Estimation and Bandwidth Selection
The non-negativity property in Theorem 1 makes TE X→Y a desirable measure for constructing a one-sided test of conditional independence; any positive divergence from zero is a sign of conditional dependence of Y on X.To estimate TE there are several different approaches, such as histogram-based estimators [29], correlation sums [30] and nearest neighbor estimators [31].However, the optimal rule for the number of neighbor points is unclear, and as Kraskov et al. [31] comment, a small value of neighbor points may lead to large statistical errors.A more natural method, kernel density estimators, the properties of which have been well studied, is applied in this paper.With the plug-in kernel estimates of densities, we may replace the expectation in Equation ( 2) by a sample average to get an estimate for TE X→Y .
A local density estimator of a d W -variate random vector W at W i is given by fW where K is a kernel function and h is the bandwidth.We take K(.) to be a product kernel function defined as K(W) = ∏ d W s=1 κ(w s ), where w s is s th element in W. Using a standard univariate Gaussian kernel, κ(w s ) = (2π) −1/2 e − 1 2 (w s ) 2 , K(.) is the standard multivariate Gaussian kernel as described by Wand and Jones [32] and Silverman [33].Using Equation ( 5) as the plug-in density estimator, and replacing the expectation by the sample mean, we obtain the estimator for the TE given by If we estimate the Shannon entropy in Equation ( 1) based on a sample of size n from the d W -dimensional random vector W, by the sample average of the plug-in density estimates, we obtain then Equation (6a) can be equivalently expressed in terms of four entropy estimators, that is, To construct a statistical test, we develop the asymptotic properties of Î(Z, X|Y) defined in Equations (6a) and (6c) through two steps.In the first step, given the density estimates the consistency of entropy estimates is achieved and then the linear combination of four entropy estimates would converge in probability to the true value.The following two theorems ensure the consistency of Î(Z, X|Y).Theorem 2. Given the kernel density estimate fW (W i ) for f W (W i ), where W is a d W -dimensional random vector with length n, let Ĥ(W) be the plug-in estimate for the Shannon entropy as defined in Equation (6b).
Then Ĥ(W) Proof.The proof of Theorem 2 is given in [12] using results from [34].
The basic idea of the proof is to take the Taylor series expansion of log( fW (W i )) around the true value log( f W (W i )) and use the fact that fW (W i ), given an appropriate bandwidth sequence, converges to f W (W i ) pointwise to obtain consistency.In the next step, the consistency of Î(Z, X|Y) is provided by the continuous mapping theorem.Proof.The proof is straightforward if one applies the Continuous Mapping Theorem.See Theorem 2.3 in [35].
Before we move to the next section, it is worth having a careful look at the issue of bandwidth selection.The bandwidth h for kernel estimation determines how smooth the density estimation is; a smaller bandwidth reveals more structure of the data, whereas a larger bandwidth delivers a smoother density estimate.Bandwidth selection is essentially a trade-off between the bias and variance in density estimation.A very small value of h could eliminate estimation bias, with a large variance.On the other hand, a large bandwidth reduces estimation variance at the expense of incorporating more bias.See Chapter 3 in [32] and [36] for details.
However, the bandwidth selection for the TE statistic is more involved.To the best of our knowledge, there is a blank in the field of optimal bandwidth selection in kernel-based TE estimator.As He et al. [37] show, when estimating the entropy estimator, two types of errors would be generated, one is from entropy estimation and the other from density estimation, and the optimal bandwidth for density estimation may not coincide with the optimal one for entropy estimation.Thus, rather than the rule-of-thumb bandwidth in [33], which aims at optimal density estimation, the bandwidth in our study should provide an accurate estimator for I(Z, X|Y) in the minimal mean squared error (MSE) sense, say.In [28] we develop such a bandwidth rule for a TE-based estimator.In that paper, rather than directly developing asymptotic properties for TE, we study the properties of the first order Taylor expansion of TE under the null hypothesis.The suggested bandwidth is shown to be MSE optimal and allows us to obtain asymptotic normality for the test statistic.In principal, the convergence rate of the TE estimator should be the same as the leading term of its Taylor approximation.We therefore propose to use the same rate also here, giving where C is an unknown parameter.This bandwidth would deliver a consistent test since the variance of local estimate of Î(Z i , X i |Y i ) will dominate the MSE.In [28] we suggest to use C = 4.8 based on simulations, while Diks and Panchenko suggest to set C ≈ 8 for autoregressive conditional heteroskedasticity (ARCH) processes [6].Our simulations here also may possibly prefer a larger value of C because the squared bias is of higher order and hence less concern for the TE-based statistic.
A larger bandwidth could better control the estimation variance and deliver a more powerful test.
As a robustness check, we adopt C = 8 as well as C = 4.8 suggested by our other simulation study [28].
To match the Gaussian kernel, we standardize the data before estimate Equation (6a)-(6c) such that the transformed time series have mean zero zero and unit variance; very similar results are obtained by matching the mean absolute deviation instead of the variance of the standard Gaussian kernel for TE estimation.

Resampling Methods
To develop simulation-based tests for the null hypothesis, given in Equation (4), of no Granger causality from X to Y, or equivalently, for conditional independence, we consider three resampling techniques, i.e., (1) time shifted surrogates developed by Quiroga et al. [22], (2) the smoothed bootstrap of Su and White [7] and (3) the stationary bootstrap introduced by Politis and Romano [38].The first technique is widely applied in coupling measures, as for example by Kugiumtzis [39] and Papana et al. [40], while the latter two have already been used for detecting conditional independence for decades.It worth mentioning that the surrogates and bootstrap methods treat the null quite differently.Surrogate data are supposed to preserve the dependence structure imposed by H 0 while bootstrap data are not restricted to H 0 .It is possible to bootstrap the dataset without imposing the conditional independence structure of {X, Y, Z} implied by the null hypothesis; see, for instance, [41] for more details.To avoid resampling errors and to make different methods more comparable, we limit ourselves to methods that impose the null hypothesis on the resampled data.The following three different resampling methods are implemented with different sampling details.

Time-Shifted Surrogates
• (TS.a)The first resampling method only deals with the driving variable X. Suppose we have observations {x 1 , ..., x n }, the time-shifted surrogates are generated by cyclically time-shifting the components of the time series.Specifically, an integer d is randomly generated within the interval ([0.05n], [0.95n]), and then the first d values of {x 1 , ..., x n } would be moved to the end of the series, to deliver the surrogate sample X * = {x d+1 , ..., x n , x 1 , ..., x d }.Compared with the traditional surrogates based on phase randomization of the Fourier transform, the time-shifted surrogates can preserve the whole statistical structure in X.The couplings between X and Y are destroyed, although the null hypothesis of X not causing Y is imposed.
• (TS.b)The second scheme resamples both the driving variable X and the response variable Y separately.Similar to (TS.a), Y * = {y c+1 , ..., y n , y 1 , ..., y c } is created given another random integer c from the range ([0.05n], [0.95n]).In contrast with the standard time-shifted surrogates described in (TS.a), in this setting we add more noise to the coupling between X and Y.

Smoothed Local Bootstrap
The smoothed bootstrap selects samples from a smoothed distribution instead of drawing observations from the empirical distribution directly.See [42] for a discussion of the smoothed bootstrap procedure.Based on rather mild assumptions, Neumann and Paparoditis [43] show that there is no need to reproduce the whole dependence structure of the stochastic process to get an asymptotically correct nonparametric dependence estimator.Hence a smoothed bootstrap from the estimated conditional density is able to deliver a consistent statistic.Specifically, we consider two versions of the smoothed bootstrap that are different in dependence structure to some extent.
• (SMB.a)In the first setting, Y * is resampled without replacement from the smoothed local bootstrap.Given the sample Y = {y 1 , ...y n }, the bootstrap sample is generated by adding a smoothing noise term , where h b > 0 is the bandwidth used in bootstrap procedure, ε Y i represents a sequence of i.i.d.N(0, 1) random variables.Without random replacement from the original time series, this procedure does not disturb the original dynamics of Y = {y 1 , ...y n } at all.After Y * is resampled, both X * and Z * are drawn from the smoothed conditional densities f (x|Y * ) and f (z|Y * ) as described in [44].
• (SMB.b)Secondly, we implement the smoothed local bootstrap as in [7].The only difference between this setting and (SMB.a) is that the bootstrap sample Y * is drawn with replacement from the smoothed kernel density.

Stationary Bootstrap
Politis and Romano [38] propose the stationary bootstrap to maintain serial dependence within the bootstrap time series.This method replicates the time dependence of original data by resampling blocks of the data with randomly varying block length.The lengths of the bootstrap blocks follows a geometric distribution.Given a fixed probability p, the length L i of block i is decided as .., and the starting points of block i are randomly and uniformly drawn from the original n observations.To restore the dependence structure exactly under the null, we combine the stationary bootstrap with the smoothed local bootstrap for our simulations.The resampling procedure works as the follows: once the TE statistic Î for the original data W = {(X i , Y i , Z i ), i = 1, ..., n} is estimated according to Equation (6a)-(6c), we start to generate the resampled data set, which is denoted by W * j with j = 1, ..., B, where B is the number of simulations.Using the simulated sample, for each j we compute the TE statistic Î * j , in exactly the same way as Î was computed.The p-value for the one-sided test is calculated as where the constant 1 is added to avoid p-values equal to zero.A final remark concerns the difference between this paper and [21]; there both time-shifted surrogate and the stationary bootstrap are implemented for an entropy-based causality test.However, our paper provides additional insights into several aspects.Firstly, the smoothed bootstrap, being shown in the literature to work for nonparametric kernel estimators under general dependence structure, is applied in our paper.Secondly, they treat the bootstrap and surrogate sample in a similar way, but as we noted above, the bootstrap method is not designed to impose the null hypothesis but designed to keep the dependence structure present in the original data.The stationary bootstrap procedure in [21] might be incompatible with the null hypothesis of conditional independence since it destroys the dependence completely.Because they restore independence between X and Y rather than conditional independence between X|Y and Z|Y during resampling, the distribution of the estimated statistics from the resampled data may not necessary correspond to that of the statistic under the null of only conditional independence.Thirdly, we provide rigorous size and power result in our simulations, which is missing in their paper.

Simulation Study
In this section, we investigate the performance of the five resampling methods in detecting conditional dependence for several data generating processes.In Equations ( 9)-( 16), we use a single parameter a to control the strength of the conditional dependence.The size assessment is obtained based on testing Granger non-causality from {X t } to {Y t }, and for the power we use the same process but we test for Granger non-causality from {Y t } to {X t }.We set a = 0.4 to represent moderate dependence in the size performance investigation and a = 0.1 to evaluate the power of the tests.Further, Equation ( 17) represents a stationary autoregressive process with regime switching, and Equation ( 18) is included to investigate the power performance in the presence of two-way causal linkages, where the two control parameters are b = −0.2 and c = 0.1.
In each experiment, we run 500 simulations for sample sizes n = {200, 500, 1000, 2000}.The surrogate and the bootstrap sample size is set to B = 999.For fair comparisons between (TS.a) and (TS.b), as well as between (SMB.a) and (SMB.b),we fix the seeds of the random number generator in the resampling functions to eliminate the potential effect of randomness.Besides, we use the empirical standard deviation of {Y t } as the bootstrapping bandwidth and C = {4.8,8} in the bandwidth equation Equation (7) for the kernel density estimation.
It is worth mentioning that the data generating processes in Equations ( 9)-( 12), ( 17) and ( 18) are stationary and of finite memory as we assumed earlier.However, it is also important to be aware of the behavior of the proposed non-parametric test for robustness consideration when the two assumptions are not satisfied.The finite memory assumption is violated in Equations ( 13)-( 15) since the GARCH process, being equivalent to an infinite ARCH process, strictly speaking is of infinite Markov order; and the stationarity assumption does not hold in Equation ( 16) where X t and Y t are cointegrated of order one.Since for the VECM process the two time series {X t } and {Y t } are not stationary, we can not directly apply our nonparametric test.In this case, we perform the Engle-Granger approach [45] first to eliminate the influence of the co-integration, and then perform the nonparametric test on the collected stationary residuals from the linear regression of ∆X t and ∆Y t on a constant and the co-integration term.The procedure is similar to that in [46]. 1. Linear vector autoregressive process (VAR).
2. Nonlinear VAR.This process is considered in [47] to show the failure of linear Granger causality test.
8. VECM process.Note that in this situation both {X t } and {Y t } are not stationary.
The empirical rejection rates are summarized in Tables 1-10.The top panels in each table summarize the empirical rejection rates obtained for the 5% and 10% nominal significance levels for processes ( 9)-( 18) under the null hypothesis, and the bottom panels report the corresponding empirical power under the alternatives.Generally speaking, the size and power are quite satisfactory for almost all combinations of the constant C, sample size n and nominal significance level.The performance differences for the various resampling schemes are not substantial.
With respect to the size performance, most of the time we see that the realized rejection rates stay in line with the nominal size.Besides, the bootstrap methods outperform the time-shifted surrogate methods in that their empirical size is slightly closer to the nominal size.Lastly, the size of the tests is not very sensitive to the choice of the constant C apart from the cases for the models given in Equations ( 13)- (15), where the data generating process has infinite memory.
From the point of view of power, (TS.a) and (SMB.a)seem to outperform their counterparts, yet, the differences are subtle.Along the dimension of the sample size, clearly we see that the empirical power increases in the sample size in most cases.Furthermore, the results are very robust with respect to choices for the constant C in the kernel density estimation bandwidth.For the VAR and nonlinear processes given by Equations ( 9), (10) and (12) a smaller C seems to give more powerful tests while a larger C is more beneficial for detecting conditional dependence structure in the (G)ARCH processes of Equations 11 and ( 13)- (15).
Finally, Table 10 presents the empirical power for the two-way VAR process, where the two variables {X} and {Y} are inter-tangled with each other.Due to the setting b = −0.2 and c = 0.1 in Equation (18), it is obvious that {Y} is a stronger Granger cause for {X} than the other way around.As a consequence, the reported rejection rates in Table 10 are overall higher when testing Y → X than X → Y.
To visualize the simulation results, Figures 1-10 report the empirical size and power against the nominal size.Since the performance of the five difference resampling methods is quite similar, we only show the results for (SMB.a) for simplicity.In each figure, the left (right) panels show the realized size (power), and we choose C = 4.8 (C = 8) for the top (bottom) two panels.We can see from the figures that the empirical performance of the TE test are overall satisfactory, apart for those (G)ARCH processes where a small C may lead to conservative testing size for large sample sizes (see Figures 3a, 5a, 6a and 7a).The under-rejection problem is caused by the inappropriate choice C = 4.8, which makes the bandwidth for kernel estimation too small.The influence of an inappropriately small bandwidth can also be seen in Figures 5b and 6b, where the test has limited power for the alternative.

Application
In this section, we apply the TE-based nonparametric test on detecting financial market interdependence, in terms of both return and volatility.Diebold and Yilmaz [49] performed a variance decomposition of the covariance matrix of the error terms from a reduced-form VAR model to investigate the spillover effect in the global equity market.More recently, Gamba-Santamaria et al. [50] extended the framework and considered the time-varying feature in global volatility spillovers.Their research, although providing simple and intuitive methods for measuring directional linkages between global stock markets, may suffer from the limitation of the linear parametric modeling, as discussed above.We revisit the topic of spillovers in the global equity market by the nonparametric method.
For our analysis, we use daily nominal stock market indexes from January 1992 to March 2017, obtained from Datastream, for six developed countries including the US (DJIA), Japan (Nikkei 225), Hong Kong (Hangseng), the UK (FTSE 100), Germany (DAX 30) and France (CAC 40).The target series are weekly return and volatility for each index.The weekly returns are calculated in terms of diffferenced log prices multiplied by 100, from Friday to Friday.Where the price for Friday is not available due to a public holiday, we use the Thursday price instead.
The weekly volatility series are generated following [49] 11.From the ADF test results, it is clear that all time series are stationary for further analysis (we also performed the Johansen cointegration test pair-wisely on the price levels and no cointegration was found for the six market indexes.).We firstly provide a full-sample analysis of global stock market return and volatility spillovers over the period from January 1992 to March 2017, summarized in Tables 12 and 13.The two tables report the pairwise test statistics for conditional independence between index X and index Y, given the constant C in the bandwidth for kernel estimation is 4.8 or 8 and 999 resampling time series.In other words, we test for the absence of the one-week-ahead directional linkage from index X to Y by using the five resampling methods described in Section 2. For example, the first line in the top panel in Table 12 reports the one-week-ahead influence of DJIA returns upon other indexes by using the first time-shifted surrogates method (TS.a).Given C = 8, DJIA return is shown to be a strong Granger cause for Nikkei, FTSE and CAC at the 1% level, and for DAX at the 5% level.Based on Tables 12 and 13, we may draw several conclusions.Firstly, the US index and German index are the most important return transmitters and Hong Kong is the largest source for volatility spillover, judged by the numbers of significant linkages.Note that this finding is similar as the result in [49], where the total return (volatility) spillovers from US (Hong Kong) to others are found to be much higher than from any other country.Figure 11 provides a graphical illustration of the global spillover network based on the result of (SMB.a) from Tables 12 and 13.Apart from the main transmitters, we can clearly see that Nikkei and CAC are the main receivers in the global return spillover network, while DAX is the main receiver of global volatility transmission.
Secondly, the result obtained is very robust, no matter which re-sampling method is applied.Although the differences between the five resampling methods are small, (TS.a) is seen to be slightly more powerful than (TS.b) in Table 12 However, the summary results in Tables 12 and 13 are static in the sense that they do not take into account possible time-variation.The statistics are measurements for averaged-out directional linkages over the whole period from 1992 to 2017.The conditional dependence structure of the time series, at any point in time, can be very different.Hence, the full-sample analysis is very likely to oversee the cyclical dynamics between each pair of stock indices.To investigate the dynamics in the global stock market, we now move from the full-sample analysis to a rolling-window study.Considering a 200-week rolling window starting from the beginning of the sample and admitting a 5-week forward step for the iterative evaluation of the conditional dependence, we can assess the variation of the spillover in the global equity market over time.
Taking the return series of the DJIA as an illustration, we iteratively exploit the local smoothed bootstrap method for detecting Granger causality from and to the DJIA return in Figure 12 (all volatility series are extremely skewed, see Table 11.In a small sample analysis, the test statistics turn out to be sensitive to the clustering outliers, which typically occur during the market turmoil.As a result, the volatility dynamics are more radical and less informative than that of returns).The red line represents the p-values for the TE-based test on DJIA weekly return as the information transmitter while the blue line shows the p-values associated with testing on DJIA as a receiver of information spillover on a weekly basis from others.The plots displays an event-dependent pattern, particularly for the recent financial crisis; from early 2009 until the end of 2012, all pairwise tests show the presence of a strong bi-directional linkage.Besides, the DJIA is strongly leading the Nikkei, Hangseng and CAC during the first decade of this century.Further, we see that the influence from other indices to the DJIA are different, typically responding to economics events.For example, the blue line in the second panel of Figure 12 plunges below the 5% level twice before the recent financial crisis, meaning that the Hong Kong Hangseng index causes fluctuations in the DJIA during those two periods; first in the late 90's and again by the end of 2004.The timing of the first fall matches the 1997 Asian currency crisis and the latter one was in fact caused by China's austerity policy in October 2004.
Finally, the dynamic plots provide additional insights into the sample period that the full sample analysis may omit.The DAX and CAC are found to be less relevant for future fluctuations in the weekly return of the DJIA, according to Table 12 and Figure 11.However, one can clearly see that since 2001, the p-values for DAX→DJIA and CAC→DJIA are consistently below 5% for most of the time, suggesting an increase of the integration of global financial markets.

Conclusions
This paper provides guidelines for the practical application of TE in detecting conditional dependence, i.e., Granger causality in a more general sense, between two time series.Although there already is a tremendous literature that tried to apply the TE in this context, the asymptotics of the statistic and the performance of the resampling-based measures are still not understood well.We have considered tests based on five different resampling methods, all of which were shown in the literature to be suitable for entropy-related tests, and investigated the size and power of the associated tests numerically.Two time-shifted surrogates and three smoothed bootstrap methods are tested on simulated data from several processes.The simulation results in this controlled environment suggest that all five measures achieve reasonable rejection rates under the null as well as the alternative hypotheses.Our results are very robust with respect to the density estimation method, including the procedure used for standardizing the location and scale of the data and the choice of the bandwidth parameter, as long as the convergence rate of the kernel estimator of TE is consistent with its first order Taylor expansion.
In the empirical application, we have shown how the proposed resampling techniques can be used on real world data for detecting conditional dependence in the data set.We use global equity data to carry out the detection in pairwise causalities in the return and volatility series among the world leading stock indexes.Our work can be viewed as a nonparametric extension of the spillover measures considered by Diebold and Yilmaz [49].In accordance with them, we found evidence that the DJIA and the DAX are the most important return transmitters and Hong Kong is the largest source for volatility spillover.Furthermore, the rolling window-based test for Granger causality in pairwise return series demonstrated that the causal linkages in the global equity market are time-varying rather than static.The overall dependence is more tight during the most recent financial crisis, and the fluctuations of the p-values are shown to be event dependent.
As for future work, there are several directions for potential extensions.On the theoretical side, it would be practically meaningful to consider causal linkage detection beyond the single period lag and to deal with the infinite order issue in a nonparametric setting.Further nonparametric techniques need to be developed to play a similar role as the information criterion does for order selection of an estimation model in the parametric world.On the empirical side, it will be interesting to further exploit entropy-based statistics in testing conditional dependence when there exists a so-called common factor, i.e., looking at multivariate systems with more than two variables.One potential candidate for this type of test in the partial TE has been coined by Vakorin et al. [51], but its statistical properties still need to be thoroughly studied yet.

Figure 1 .Figure 2 .Figure 3 .
Figure 1.Size-size and size-power plots of Granger non-causality tests, based on 500 replications and smoothed local bootstrap (a).The data generating process (DGP) is the bivariate VAR process in Equation (9), with Y affecting X.The left (right) column shows observed rejection rates under the null (alternative) hypothesis.The sample size varies from n = 200 to n = 2000.

Figure 4 .Figure 5 .Figure 6 .Figure 7 .Figure 8 .Figure 9 .Figure 10 .
Figure 4. Size-size and size-power plots of Granger non-causality tests, based on 500 replications and smoothed local bootstrap (a).The DGP is the bilinear process in Equation (12), with Y affecting X.The left (right) column shows observed rejection rates under the null (alternative) hypothesis.The sample size varies from n = 200 to n = 2000.

Figure 11 .
Figure 11.Graphical representation of pairwise causalities on global stock returns and volatilities.All "−→" in the graph indicate a significant directional causality at the 5% level.

Figure 11 .
Figure 11.The time-varying p-values for the TE-based Granger causality test in Return series are presented.The causal linkages from DJIA to other markets, as well as the linkages from other markets to DJIA are tested.

Figure 12 .
Figure 12.Time-varying p-values for the TE-based Granger causality test in Return series.The causal linkages from DJIA to other markets, as well as the linkages from other markets to DJIA are tested.
2 is picked at random from the data set; and with probability 1 − p, y * 2 = y s+1 , so that y * 2 would be the next observation to y s in original series Y = {y 1 , ...y n }.Proceeding in this way, {y • (STB) In short, firstly y * 1 is picked randomly from the original n observations of Y = {y 1 , ...y n }, denoted as y * 1 = y s where s ∈ [1, n].With probability p, y * * 1 , ..., y * n } can be generated.If y * i = y s and s = n, the "circular boundary condition" would kick in, so that y * i+1 = y 1 .After Y * = {y * 1 , ..., y * n } is generated, both X * and Z * are randomly drawn from the smoothed conditional densities f (x|Y * ) and f (z|Y * ) as in (SMB.b).

Table 1 .
(9)erved size and power of the TE-based test for the linear VAR process in Equation(9).Empirical size and power of the TE-based test at 5% and 10% significance levels for process Equation(9)for different resampling methods.The values represent observed rejection rates over 500 realizations for nominal size 0.05.Sample sizes go from 200 to 2000.The control parameter a = 0.4 for size evaluation and a = 0.1 for establishing powers.For this simulation study we consider C = 4.8 and C = 8. Note:

Table 2 .
(10)rved size and power of the TE-based test for the nonlinear VAR process in Equation(10).Empirical size and power of the TE-based test at 5% and 10% significance levels for process Equation(10)for different resampling methods.The values represent observed rejection rates over 500 realizations for nominal size 0.05.Sample sizes go from 200 to 2000.The control parameter a = 0.4 for size evaluation and a = 0.1 for establishing powers.For this simulation study, we consider C = 4.8 and C = 8. Note:

Table 3 .
(11)rved size and power of the TE-based test for the bivariate ARCH process in Equation(11).Empirical size and power of the TE-based test at 5% and 10% significance levels for process Equation(11)for different resampling methods.The values represent observed rejection rates over 500 realizations for nominal size 0.05.Sample sizes go from 200 to 2000.The control parameter a = 0.4 for size evaluation and a = 0.1 for establishing powers.For this simulation study, we consider C = 4.8 and C = 8. Note:

Table 4 .
(12)rved size and power of the TE-based test for the bilinear process in Equation(12).Empirical size and power of the TE-based test at 5% and 10% significance levels for process Equation(12)for different resampling methods.The values represent observed rejection rates over 500 realizations for nominal size 0.05.Sample sizes go from 200 to 2000.The control parameter a = 0.4 for size evaluation and a = 0.1 for establishing powers.For this simulation study, we consider C = 4.8 and C = 8. Note:

Table 5 .
(13)rved size and power of the TE-based test for the AR(2)-GARCH process in Equation(13).

Table 6 .
Observed size and power of the TE-based test for the ARMA-GARCH process in Equation (14).

Table 8 .
(16)rved size and power of the TE-based test for the VECM process in Equation(16).Empirical size and power of the TE-based test at 5% and 10% significance levels for process Equation(16)for different resampling methods.The values represent observed rejection rates over 500 realizations for nominal size 0.05.Sample sizes go from 200 to 2000.The control parameter a = 0.4 for size evaluation and a = 0.1 for establishing powers.For this simulation study, we consider C = 4.8 and C = 8. Note:

Table 9 .
(17)rved size and power of the TE-based test for the threshold AR(1) process in Equation(17).Empirical size and power of the TE-based test at 5% and 10% significance levels for process Equation(17)for different resampling methods.The values represent observed rejection rates over 500 realizations for nominal size 0.05.Sample sizes go from 200 to 2000.The control parameter a = 0.4 for size evaluation and a = 0.1 for establishing powers.For this simulation study, we consider C = 4.8 and C = 8. Note: by making use of the weekly high, low, opening and closing prices, obtained from the underlying daily high, low, opening and closing data.The volatility σ 2 t for week t is estimated asσt 2 = 0.511(H t − L t ) 2 − 0.019[(C t − O t )(H t + L t − 2O t ) − 2(H t − O t )(L t − O t )] − 0.383(C t − O t ) 2 , (19)whereHt is the Monday-Friday high, L t is the Monday-Friday low, O t is the Monday-Friday open and C t is the Monday-Friday close (in natural logarithms multiplied by 100).Futher, after deleting the volatility estimates for the New Year week in 2002, 2008 and 2013 due to the lack of observations for Nikkei 225 index, we have 1313 observations in total for weekly returns volatilities.The descriptive statistics, Ljung Box (LB) test statistics and Augmented Dickey Fuller (ADF) test statistics for both series are summarized in Table

Table 11 .
[49]riptive statistics for global stock market return and volatility.Note: Descriptive statistics for six globally leading indexes.The sample size is 1313 for both Returns and Volatilities.The nominal returns are measured by weekly Friday-to-Friday log price difference multiplied by 100 and the Monday-to-Friday volatilities are calculated following[49].For the LB test and the ADF test statistics, the asterisks indicate the significance of the corresponding p-value at 1% ( * * ) levels.

Table 12 .
Detection of Conditional Dependence in Global Stock Returns.Statistics for pairwise TE-based test on returns of global stock-indexes for one-week ahead conditional non-independence.The results are shown both for the five different resampling methods in Section 2.3.The constant C takes value 4.8 and 8 for robustness-check.The asterisks indicate the significance of the corresponding p-value, at the 5% ( Note: * ) and 1% ( * * ) levels.

Table 13 .
Detection of Conditional Dependence in Global Stock Volatilities.Statistics for pairwise TE-based test on volatilities of global stock-indexes for one-week ahead conditional non-independence.The results are shown both for the five different resampling methods in Section 2.3.The constant C takes value 4.8 and 8 for robustness-check.The asterisks indicate the significance of the corresponding p-values at the 5% ( Note: * ) and 1% ( * * ) levels.