Simulation Study of Direct Causality Measures in Multivariate Time Series

Measures of the direction and strength of the interdependence among time series from multivariate systems are evaluated based on their statistical significance and discrimination ability. The best-known measures estimating direct causal effects, both linear and nonlinear, are considered, i.e., conditional Granger causality index (CGCI), partial Granger causality index (PGCI), partial directed coherence (PDC), partial transfer entropy (PTE), partial symbolic transfer entropy (PSTE) and partial mutual information on mixed embedding (PMIME). The performance of the multivariate coupling measures is assessed on stochastic and chaotic simulated uncoupled and coupled dynamical systems for different settings of embedding dimension and time series length. The CGCI, PGCI and PDC seem to outperform the other causality measures in the case of the linearly coupled systems, while the PGCI is the most effective one when latent and exogenous variables are present. The PMIME outweighs all others in the case of nonlinear simulation systems.


Introduction
The quantification of the causal effects among simultaneously observed systems from the analysis of time series recordings is essential in many scientific fields, ranging from economics to neurophysiology. Estimating the inter-dependence among the observed variables provides valuable knowledge about the processes that generate the time series. Granger causality has been the leading concept for the identification of directional interactions among variables from their time series, and it has been widely used in economics [1]. However, the last few years, it has become popular also in many different fields, e.g., for the analysis of electroencephalograms.
The mathematical formulation of linear Granger causality is based on linear regression modeling of stochastic processes. Many modifications and extensions of the Granger causality test have been developed; see e.g., [2][3][4][5][6][7]. Most of the non-causality tests, built on the Granger causality concept and applied in economics, are therefore based on the modeling of the multivariate time series. Despite the success of these strategies, the model-based methods may suffer from the shortcomings of model mis-specification.
The majority of the measures determining the interrelationships among variables that have been developed so far are for bivariate data, e.g., state-space based techniques [8,9], information measures [10][11][12] and techniques based on the concept of synchronization [13,14].
Bivariate causality tests may erroneously detect couplings when two variables are conditionally independent. To address this, techniques accounting for the effect of the confounding variables have been introduced, termed direct causality measures, which are more appropriate when dealing with multivariate time series [15][16][17]. Direct causality methods emerged as extensions of bivariate Granger causality. For example, the Granger causality index (GCI), implementing the initial idea for two variables in the time domain, has been extended to the conditional and partial Granger causality index (CGCI and PGCI) [2,18]. Directed coherence (DC) was introduced in the frequency domain, and being a bivariate measure, it cannot discriminate between direct and indirect coupling. The direct transfer function (DTF) is similarly defined as DC [19]. The partial directed coherence (PDC) is an extension of DC to multivariate time series measuring only the direct influences among the variables [20]. Similarly, direct Directed Transfer Function (dDTF) modified DTF to detect only direct information transfer [21].
Information theory sets a natural framework for non-parametric methodologies of several classes of statistical dependencies. Several techniques from information theory have been used in the last few years for the identification of causal relationships in multivariate systems, and the best known is transfer entropy (TE) [11]. Test for causality using the TE has also been suggested [22]. However, the TE is, again, bivariate and its natural extension to account for the presence of confounding variables has been recently introduced, namely, the partial TE (PTE), under different estimating schemes, using bins [23], correlation sums [24] and nearest neighbors [25]. The TE and PTE are actually expressions of conditional mutual information, and with this respect, an improved version of TE making use of a properly restricted non-uniform state space reconstruction was recently developed, termed mutual information on mixed embedding (MIME) [26]. Later, a similar approach to TE/PTE was implemented, which takes into consideration the conditional entropy [27]. Recently MIME was extended for multivariate time series to the partial MIME (PMIME) [28]. Other coupling methods have also been suggested, such as Renyi's information transfer [29]. In a different approach, the TE has been defined on rank vectors instead of sample vectors, called the symbolic transfer entropy (STE) [30], and, respectively, to the multivariate case termed partial STE (PSTE) (for a correction of STE and PSTE, see, respectively, [31,32]).
In this work, we compare model-based methods, both in the time and frequency domain, and information theoretic multivariate causality measures that are able to distinguish between direct and indirect causal effects. We include in the study most of the known direct causality measures of these classes, i.e., CGCI and PGCI (linear in time domain), PDC (linear in frequency domain), PTE, PSTE and PMIME (from information theory). The statistical significance of the test statistics is assessed with resampling methods, bootstraps or randomization tests using appropriate surrogates, whenever it is not theoretically known.
The structure of the paper is as follows. The multivariate causality measures considered in this study are presented in Section 2. The statistical significance of the coupling measures is assessed on simulated systems. The simulation systems and the setup of the simulation study are presented in Section 3, while the results of this study and the performance of the causality measures are discussed in Section 4. Finally, the conclusions are drawn in Section 5.

Direct Causality Measures
Let {x 1,t , . . . , x K,t }, t = 1, . . . , n, denote a K-variate time series, consisting of K simultaneously observed variables, X 1 , . . . , X K , belonging to a dynamical system or representing respective subsystems of a global system. The reconstructed vectors of each X i are formed as and m i and τ i are, respectively, the reconstruction parameters of embedding dimension and time delay for X i . The notation, X 2 → X 1 , denotes the Granger causality from X 2 to X 1 , while X 2 → X 1 |Z denotes the direct Granger causality from X 2 to X 1 , accounting for the presence of the other (confounding) variables, i.e., Z = {X 3 , . . . , X K }. The notation of Granger causality for other pairs of variables is analogous.
Almost all of the causality measures require the time series be stationary, i.e., their mean and variance do not change over time. If the time series are non-stationary, then the data should be pre-processed, e.g., for time series that are non-stationary in mean, one can apply the measures on the first or higher order differences. Different transformations are needed in case of non-stationary data in variance or co-integrated time series.

Conditional Granger Causality Index
Granger causality is based on the concept that if the value of a time series, X 1 , to be predicted is improved by using the values of X 2 , then we say that X 2 is driving X 1 . A vector autoregressive model (VAR) in two variables and of order P , fitted to the time series, {x 1,t }, is: where a 1,j , b 1,j are the coefficients of the model and 1 the residuals from fitting the model with variance s 2 1U . The model in Equation (1) is referred to as the unrestricted model, while the restricted model is obtained by omitting the terms regarding the driving variable [the second sum in Equation (1)] and has residual variance, s 2 1R . According to the concept of Granger causality, the variable, X 2 , Granger causes . The magnitude of the effect of X 2 on X 1 is given by the Granger Causality Index (GCI), defined as: Considering all K variables, the unrestricted model for X 1 is a VAR model in K variables and involves the P lags of all K variables [K sum terms instead of two in Equation (1)]; the restricted model will have all but the P lags of the driving variable, X 2 . Likewise, the conditional Granger causality index (CGCI) is: where s 2 1U and s 2 1R are the residual variances for the unrestricted and restricted model defined for all K variables.
A parametric significance test for GCI and CGCI can be conducted for the null hypothesis that variable X 2 is not driving X 1 , making use of the F -significance test for all P coefficients, b 1,j [41]. When we want to assess collectively the causal effects among all pairs of the K variables, a correction for multiple testing should be performed, e.g., by means of the false discovery rate [42].
The order, P , of the VAR model is usually chosen using an information criterion, such as the Akaike Information Criterion (AIC) [43] and the Bayesian Information Criterion (BIC) [44]. The estimation of the coefficients of the VAR models and the residual variances of the models are described analytically in [45].

Partial Granger Causality Index
The partial Granger causality index (PGCI) is associated with the concept of Granger causality and partial correlation [18]. The PGCI addresses the problem of exogenous inputs and latent variables. The intuition is that the influence of exogenous and/or latent variables on a system will be reflected by correlations among the residuals of a VAR model of the measured variables. Thus, in the PGCI, one makes use of the residual covariance matrix of the VAR unrestricted and restricted model, denoted Σ and ρ, respectively, and not only of the residual variance of the response variable, X 1 , s 2 1U and s 2 1R , subsequently. For example, for X 2 → X 1 |X 3 , denoting the components of Σ as Σ ij , i, j = 1, 2, 3 and the components of ρ as ρ ij , i, j = 1, 2, the PGCI is given as: Note that Σ 11 = s 2 1U and ρ 11 = s 2 1R . The PGCI constitutes an improved estimation of the direct Granger causality as compared to the CGCI when the residuals of the VAR models are correlated; otherwise, it is identical to the CGCI. The estimation procedure for the PGCI is described analytically in [18].

Partial Directed Coherence
The partial directed coherence (PDC) is related to the same VAR model as the CGCI, but is defined in the frequency domain [20]. Denoting the K × K matrix of the Fourier transform of the coefficients of the VAR model in K variables and order P by A(f ), the PDC from X 2 to X 1 at a frequency f is given by [20]:

Partial Transfer Entropy
The transfer entropy (TE) is a nonlinear measure that quantifies the amount of information explained in X 1 at h steps ahead from the state of X 2 , accounting for the concurrent state of X 1 [11]. The TE is given here in terms of entropies. For a discrete variable, X (scalar or vector), the Shannon entropy is H(X) = − p(x i ) ln p(x i ), where p(x i ) is the probability mass function of variable, X, at the value, x i . Further, the TE is expressed as: The first equality is inserted to show that the TE is equivalent to the conditional mutual information is the mutual information (MI) of two variables, X and Y . The time horizon, h, is introduced here instead of the single time step, originally used in the definition of TE. The partial transfer entropy (PTE) is the extension of the TE designed for the direct causality of X 2 to X 1 conditioning on the remaining variables in Z The entropy terms of PTE are estimated here using the k-nearest neighbors method [48].

Symbolic Transfer Entropy
The symbolic transfer entropy (STE) is the continuation of the TE estimated on rank-points formed by the reconstructed vectors of the variables [30]. For each vector, x 2,t , the ranks of its components assign a rank-point,x 2,t = [r 1 , r 2 , . . . , r m 2 ], where r j ∈ {1, 2, . . . , m 2 } for j = 1, . . . , m 2 . Following this sample-point to rank-point conversion, the sample, x 1,t+h , in Equation (7) is taken as the rank point at time, t + h,x 1,t+h , and STE is defined as: where the entropies are computed based on the rank-points.
In complete analogy to the derivation of the PTE from the TE, the partial symbolic transfer entropy (PSTE) extends the STE for multivariate time series and is expressed as: where the rank vector,ẑ t , is the concatenation of the rank vectors for each of the embedding vectors of the variables in Z.

Partial Mutual Information on Mixed Embedding
The mutual information on mixed embedding (MIME) is derived directly from a mixed embedding scheme based on the conditional mutual information (CMI) criterion [26]. In the bivariate case and for the driving of X 2 on X 1 , the scheme gives a mixed embedding of varying delays from the variables, X 1 and X 2 , that explains best the future of X 1 , defined as x h 1,t = [x 1,t+1 , . . . , x 1,t+h ]. The mixed embedding vector, w t , may contain lagged components of X 1 , forming the subset, w X 1 Similarly to the MIME, the PMIME can be considered as a normalized version of the PTE for optimized non-uniform embedding of all K variables. Thus, the PMIME takes values between zero and one, where zero indicates the absence of components of X 2 in the mixed embedding vector and, consequently, no direct Granger causality from X 2 to X 1 .
A maximum lag to search for components in the mixed embedding vector is set for each variable, here being the same maximum lag, L max , for all variables. L max can be set equal to a sufficiently large number without affecting the performance of the measure; however, the larger it is, the higher the computational cost is. For the estimation of the MI and the CMI, the k-nearest neighbors method is used [48].

Simulation Study
The multivariate causality measures are evaluated in a simulation study. All the considered direct coupling measures are computed on 100 realizations of multivariate uncoupled and coupled systems, for increasing coupling strengths and for all directions. The simulation systems that have been used in this study are the following.
• System 1: A vector autoregressive process of order one [VAR(1)] in three variables with X 1 → X 2 and X 2 → X 3 where θ t , η t and t are independent to each other Gaussian white noise processes, with standard deviations one, 0.2 and 0.3, respectively.
• System 2: A VAR(5) process in four variables with where i,t , i = 1, . . . , 4 are independent to each other Gaussian white noise processes with unit standard deviation.
The time series of this system become completely synchronized for coupling strengths, c ≥ 0.7.
In order to investigate the effect of noise on the causality measures, we also consider the case of addition of Gaussian white noise to each variable of System 5, with standard deviation 0.2 times their standard deviation.
• System 6: Three coupled Lorenz systems with nonlinear couplings, X 1 → X 2 and X 2 → X 3 The first variables of the three interacting systems are observed at a sampling time of 0.05 units. The couplings, X 1 → X 2 and X 2 → X 3 , have the same strength, c, and c = 0, 1, 3, 5. The time series of the system become completely synchronized for coupling strengths, c ≥ 8. For a more detailed description of the synchronization of the coupled Systems 5 and 6, see [51].
The time series lengths considered in the simulation study are n = 512 and n = 2048. Regarding the CGCI, the PGCI and the PDC, the order, P , of the VAR model is selected by combining the Akaike Information Criterion (AIC), the Bayesian Information Criterion (BIC), as well as our knowledge for the degrees of freedom of each coupled system, as follows. The range of model orders, for which the AIC and the BIC are calculated, is selected to be at the level of the 'true' model order based on the equations of each system. Specifically, for Systems 1, 4, 5, 6 and 7, we considered the range of model orders, [1,5], for the calculation of AIC and BIC, and for Systems 2 and 3, we considered the range, [1,10]. Further, we estimate the PDC for a range of frequencies determined by the power spectrum of the variables of each system. We specify this range by selecting those frequencies that display the highest values in the auto-spectra of the variables [52]. The p-values from a non-parametric test for the PDC are estimated for the selected range of frequencies (using bootstraps [53]), and in order to decide whether a coupling is significant, at least 80% of the p-values from this range of frequencies should be significant.
The embedding dimension, m, for the PTE and the PSTE and the maximum lag, L max , for the PMIME are set equal to P and τ = 1 for the PTE and the PSTE. Note that this choice of L max may be very restrictive, and the PMIME may not be optimal; but, we adopt it here to make the choice for VAR order and embedding uniform. The time step ahead, h, for the estimation of the PTE, PSTE and PMIME is set to one for the first five systems and System 7 (the common choice for discrete-time systems), while for the continuous-time system (System 6), h is set to be equal to m. The number of nearest neighbors for the estimation of the PTE and the PMIME is set to k = 10. We note that the k-nearest neighbors methods for the estimation of the measures is found to be stable and not significantly affected by the choice of k [48]. The threshold for the stopping criterion for the mixed embedding scheme for the PMIME is set to A = 0.95 (for details, see [26]).

Statistical Significance of the Causality Measures
In the simulation study, we assess the statistical significance of the causality measures by means of parametric tests, when applicable, and nonparametric (resampling) tests, otherwise, in the way these have been suggested in the literature for each measure. The correction for multiple testing regarding the significance of a measure on all possible variable pairs is not considered here, as the interest is in comparing the performance of the different direct causality measures rather than providing rigorous statistical evidence for the significance of each coupling.
Starting with the CGCI, it bears a parametric significance test, and this is the F -test for the null hypothesis that the coefficients of the lagged driving variables in the unrestricted VAR model are all zero [41]. If P 1 and P 2 are the numbers of variables in the restricted and the unrestricted autoregressive model, respectively, (P 2 > P 1 ), and n is the length of the time series, then the test statistic is F = ((RSS 1 − RSS 2 )/(P 1 − P 2 ))/(RSS 2 /(n − P 2 )), where RSS i is the residual sum of squares of model, i. Under the null hypothesis that the unrestricted model does not provide a significantly better fit than the restricted model, the F -statistic follows the Fisher-Snedecor, or F , distribution with (P 2 − P 1 , n − P 2 ) degrees of freedom. The null hypothesis is rejected if the F -statistic calculated on the data is greater than the critical value of the F -distribution for some desired false-rejection probability (here α = 0.05).
The statistical significance of the PGCI is assessed by means of confidence intervals formed by bootstrapping [53], since the null distribution is unknown. The empirical distribution of any statistic using bootstrapping is formed from the values of the statistic computed on a number of new samples obtained by random sampling with replacement from the observed data. In the context of vector autoregressive models, this can be realized by subdividing the data matrix (of the predictor and response jointly) into a number of windows, which are repeatedly sampled with replacement to generate bootstrap data matrices. By this procedure, the causal relationships within each window are not affected. The PGCI is computed for each bootstrapped data matrix. The confidence interval of the PGCI is formed by the lower and upper empirical quantiles of the bootstrap distribution of the PGCI for the significance level, α = 0.05. The bootstrap confidence interval for the PGCI can be considered as a significance test, where the test decision depends on whether zero is included in the confidence interval. The details for the estimation of the bootstrap confidence intervals of the PGCI can be found in [18].
The statistical significance of the PDC can be determined using both parametric testing [46,47], and randomization (surrogate) testing [54]. Here, we choose the parametric approach. The statistical significance of a nonzero value, PDC X 2 →X 1 (f ), is investigated by means of a critical value, c P DC . Under the null hypothesis that there is no Granger causality, X 2 → X 1 , it holds |A 12 (f )| = 0, and c P DC can be derived from theoretical considerations for each frequency, f , at a given α-significance level by: The term, χ 2 1,1−a , denotes the (1 − α)-quantile of the χ 2 distribution with one degree of freedom, and C ij (f ) is an estimate of the expression: where Σ −1 jj denotes the entries of the inverse of the covariance matrix, Σ, of the VAR process [47]. The statistical significance of the PTE and the PSTE is evaluated assuming a randomization test with appropriate surrogate time series, as their null distribution is not known (for the PSTE, in [32], analytic approximations were built, but found to be inferior to approximations using surrogates). We create M surrogate time series consistent with the non-causality null hypotheses, H 0 , i.e., X 2 does not Granger causes X 1 . To destroy any causal effect of X 2 on X 1 without changing the dynamics in each time series, we randomly choose a number, d, less than the time series length, n, and the d-first values of the time series of X 2 are moved to the end, while the other series remain unchanged. The random number, d, for the time-shifted surrogates is an integer within the range, [0.05n, 0.95n], where n is the time series length. This scheme for generating surrogate time series is termed time-shifted surrogates [55]. We estimate the causality measure (PTE or PSTE) from the original multivariate time series, let us denote it q 0 , and for each of the M multivariate surrogate time series, let us denote them q 1 , q 2 , . . . , q M . If q 0 is at the tail of the empirical null distribution formed by q 1 , q 2 , . . . , q M , then H 0 is rejected. For the two-sided test, if r 0 is the rank of q 0 in the ordered list of q 0 , q 1 , . . . , q M , the p-value for the test is 2(r 0 − 0.326)/(M + 1 + 0.348) if r 0 < (M + 1)/2 and 2[1 − (r 0 − 0.326)/(M + 1 + 0.348)] if r 0 ≥ (M + 1)/2, by applying the correction for the empirical cumulative density function in [56].
Finally, PMIME does not rely on any significance test, as it gives zero values in the uncoupled case and positive values, otherwise. This was confirmed using time-shifted surrogates also for the PMIME in the simulation study, and the PMIME values of the surrogate time series were all zero.
For the estimation of CGCI and PGCI and their statistical significance, we used the 'Causal Connectivity Analysis' toolbox [57]. The programs for the computations of the remaining causality measures have been implemented in Matlab.

Evaluation of Causality Measures
In order to evaluate the multivariate causality measures, the percentage of rejection of the null hypothesis of no causal effects (H 0 ) in 100 realizations of the system is calculated for each possible pair of variables and for different time series lengths and free parameters of the measures. The focus when presenting the results is on the sensitivity of the measure or, respectively, the power of the significance test (the percentage of rejection at the significance level 5% or α = 0.05 when there is true direct causality), as well as the specificity of the measure or size of the test (the percentage of rejection at α = 0.05 when there is no direct causality) and how these properties depend on the time series length and the measure-specific parameter.

Results for System 1
For the estimation of the linear measures, the order of the model, P , is set to one, as indicated from the Bayesian Information Criterion (BIC) and the Akaike Information Criterion (AIC), while for the estimation of PTE, m is also set to one. The PDC is estimated for the range of frequencies [0, 0.5], since the auto-spectra of the variables do not suggest a narrower range. Indeed, the p-values of PDC are all significant in [0, 0.5], when there is direct causality, and not significant, when there is no direct causality. The CGCI, PGCI, PDC and PTE correctly detect the direct causal effect for both time series lengths, n = 512 and 2048. All the aforementioned measures indicate 100% rejection of H 0 for the true couplings, X 1 → X 2 and X 2 → X 3 , and low percentages for all other couplings. Their performance is not affected by the time series length, for the time series lengths considered. The estimated percentages are displayed for both n in Table 1.  The PSTE can only be estimated for m ≥ 2, and therefore, results are obtained for m = 2. The PSTE correctly detects the direct causalities for m = 2; however, it also indicates the indirect effect, X 1 → X 3 , and the spurious causal effect, X 2 → X 1 .
Only for this system, the PMIME with the threshold of A = 0.95 failed to detect the true direct effects, and the randomization test gave partial improvement (detection of one of the two true direct effects, X 2 → X 3 ). This is merely a problem of using the fixed threshold, A = 0.95, in this system, and following the adapted threshold proposed in [28], the two true direct effects could be detected for all realizations with n = 512 and n = 2048 with the largest rate of false rejection being 8%.

Results for System 2
For the second simulation system, the model order is set to P = 5, as indicated both by BIC and AIC. The embedding dimension, m, and L max are also set to five. The PDC is estimated for the range of frequencies, [0, 0.4], since the auto-spectra of all the variables are higher in the range, [0, 0.2], while variable, X 3 , exhibits a peak in the range, [0.2, 0.4]. Indicatively, the p-values from one realization of the system for the range of frequencies, [0, 0.5], is displayed in Figure 1a. The CGCI, PGCI, PDC and PMIME correctly detect the direct couplings (X 2 → X 1 , X 1 → X 3 , X 2 → X 3 , X 4 → X 2 ), as shown in Table 2. The performance of the CGCI and PDC is not affected by the time series length. The PGCI is also not affected by n, except for the causal effect, X 2 → X 4 , where the PGCI falsely indicates causality for n = 2048 (20%). The PMIME indicates lower power of the test compared to the linear measures only for X 2 → X 3 and n = 512. Table 2. Percentage of statistically significant values of the causality measures for System 2, P = m = L max = 5. The directions of direct causal effects are displayed in bold face. CGCI, conditional Granger causality index; PGCI, partial Granger causality index; PDC, partial directed coherence; PTE, partial transfer entropy; PSTE, partial symbolic transfer entropy; PMIME, partial mutual information on mixed embedding.    The PTE detects the direct causal relationships, apart from the coupling, X 2 → X 3 , although the percentage of rejection in this direction increases with n (from 14% for n = 512 to 39% for n = 2048). Further, the erroneous relationships, X 1 → X 4 (50%) and X 3 → X 4 (29%), are observed for n = 2048. Focusing on the PTE values, it can be observed that they are much higher for the directions of direct couplings than for the remaining directions. Moreover, the percentages of significant PTE values increase with n for the directions with direct couplings and decrease with n for all other couplings (see Table 3).
We note that the standard deviation of the estimated PTE values from the 100 realizations are low (on the order of 10 −2 ). Thus, the result of having falsely statistically significant PTE values for X 1 → X 4 and X 3 → X 4 is likely due to insufficiency of the randomization test.
PSTE fails to detect the causal effects for the second coupled system for both time series lengths, giving rejections at a rate between 1% and 24% for all directions. The failure of PSTE may be due to the the high dimensionality of the rank vectors (the joint rank vector has dimension 21).

Results for System 3
The CGCI, PGCI, PDC, PTE and PMIME correctly detect all direct causal effects (X 1 → X 2 , (based on BIC and AIC), as shown in Table 4. The PDC is again estimated in the range of frequencies, [0, 0.4], (see in Figure 1b the auto-spectra of the variables and the p-values from the parametric test of PDC from one realization of the system). The CGCI, PGCI and PDC perform similarly for the two time series lengths. The PTE indicates 100% significant values for n = 2048 when direct causality exists. However, the PTE also indicates the spurious causality, X 5 → X 4 , for n = 2048 (37%). The specificity of the PMIME is improved by the increase of n, and the percentage of positive PMIME values in case of no direct causal effects varies from 0% to 22% for n = 512, while for n = 2048, it varies from 0% to 1%. The PSTE again fails to detect the causal effects, giving very low percentage of rejection at all directions (2% to 16%).
Since the linear causality measures CGCI, PGCI and PDC have been developed for the detection of direct causality in linear coupled systems, it was expected that these methods would be successfully applied to all linear systems. The nonlinear measures PMIME and PTE also seem to be able to capture the direct linear couplings in most cases, with PMIME following close the linear measures both in specificity and sensitivity.
In the following systems, we investigate the ability of the causality measures to correctly detect direct causal effects when nonlinearities are present.

Results for System 4
For the fourth coupled system, the BIC and AIC suggest to set P = 1, 2 and 3. The performance of the linear measures does not seem to be affected by the choice of P . The PDC is estimated for frequencies in [0.1, 0.4]. The auto-spectra of the three variables do not display any peaks or any upward/downward trends. No significant differences in the results are observed if a wider or narrower range of frequencies is considered. The linear measures, CGCI, PGCI and PDC, capture only the linear direct causal effect, X 2 → X 3 , while they fail to detect the nonlinear relationships, X 1 → X 2 and X 1 → X 3 , for both time series lengths.
The PTE and the PMIME correctly detect all the direct couplings for the fourth coupled system for m = L max = 1, 2 and 3. The percentage of significant values of the causality measures are displayed in Table 5. The PTE gives equivalent results for m = 1 and m = 2. The PTE correctly detects the causalities for m = 3, but at a smaller power of the significance test for n = 512 (63% for X 1 → X 2 , 46% for X 2 → X 3 and 43% for X 1 → X 3 ). The percentage of significant PMIME values is 100% for the directions of direct couplings, and falls between 0% and 6% for all other couplings, and this holds for any L max = 1, 2 or 3 and for both n. The PSTE indicates the link X 2 → X 3 for both time series lengths, while X 1 → X 2 is detected only for n = 2048. The PSTE fails to point out the causality, X 1 → X 3 . The results for m = 2 and 3 are equivalent. In order to investigate whether the failure of PSTE to show X 1 → X 3 is due to finite sample data, we estimate the PSTE also for n = 4096. For m = 2, it indicates the same results as for n = 2048. For m = 3, the PSTE detects all the direct causal effects, X 1 → X 2 (99%), X 2 → X 3 (100%), X 1 → X 3 (86%), but X 3 → X 1 (62%) is also erroneously detected.

Results for System 5
For the fifth coupled simulation system, we set the model order, P = 2, based on the complexity of the system, and P = 3, 4 and 5 using the AIC and BIC. The auto-spectra of the variables display peaks in [0.1, 0.2] and [0.4, 0.5]. The PDC is estimated for different ranges of frequencies to check its sensitivity with respect to the selection of the frequency range. When small frequencies are considered, the PDC seems to indicate larger percentages of spurious couplings; however, also, the percentages of significant PDC values at the directions of true causal effects are smaller. The results are presented for System 5 considering the range of frequencies, [0.4, 0.5].
The CGCI seems to be sensitive to the selection of the model order P , indicating some spurious couplings for the different P . The best performance for CGCI is achieved for P = 3; therefore, only results for P = 3 are shown. On the other hand, the PGCI turns out to be less dependent on P , giving similar results for P = 2, 3, 4 and 5. The PTE is not substantially affected by the selection of the embedding dimension, m (at least for the examined coupling strengths); therefore, only results for m = 2 are discussed. The PSTE is sensitive to the selection of m, performing best for m = 2 and 3, while for m = 4 and 5, it indicates spurious and indirect causal effects. The PMIME does not seem to depend on L max . Results are displayed for L max = 5. The percentage of significant values for each measure are displayed in Figure 2, for all directions, for increasing coupling strength and for both n.  Most of the measures show good specificity, and the percentage of rejection for all pairs of the variables of the uncoupled system (c = 0) is at the significance level, α = 0.05, with only CGCI scoring a somehow larger percentage of rejection up to 17%.
For the weak coupling strengths, c = 0.05 and 0.1, the causality measures cannot effectively detect the causal relationships or have a low sensitivity. The CGCI and the PTE seem to have the best performance, while the PMIME seems to be effective only for n = 2048 and c = 0.1.
As the coupling strength increases, the sensitivity of the causality measures is improved. For c = 0.2, the CGCI, PTE and PMIME correctly indicate the true couplings for both n, while the PGCI and the PSTE do this only for n = 2048. The PDC has low power, even for n = 2048. For c = 0.3, nearly all measures correctly point out the direct causal effects (see Table 6). The best results are obtained with the PMIME, while the CGCI and PTE display similar performance. The PGCI and the PSTE are sensitive to the time series length and have a high power only for n = 2048. The PDC performs poorly, giving low percentage of significant PDC values, even for n = 2048. All measures have good specificity, with CGCI and PTE giving rejections well above the nominal level for some non-existing couplings.
Considering larger coupling strengths, the causality measures correctly indicate the true couplings, but also some spurious ones. The PMIME outperforms the other measures giving 100% positive values for both n for X 1 → X 2 and X 2 → X 3 and 0% at the remaining directions for c ≥ 0.2. Indicative results for all measures are displayed for the strong coupling strength c = 0.5 in Table 7.
In order to investigate the effect of noise on each measure, we consider the coupled Hénon map (System 5) with the addition of Gaussian white noise with standard deviation 0.2 times the standard deviation of the original time series. Each measure is estimated again from 100 realizations from the noisy system for the same free parameters as considered in the noise-free case.
The CGCI is not significantly affected by the addition of noise, giving equivalent results for P = 3 as for the noise-free system. The CGCI detects the true causal effects even for weak coupling strength (c ≥ 0.05). For different P values (P = 2, 4 or 5), some spurious and/or indirect causal effects are observed for c > 0.3.
The PGCI is also not considerably affected by the addition of noise. The causal effects are detected only for coupling strengths, c ≥ 0.3, for n = 512, and for c ≥ 0.2, for n = 2048, while the power of the test increases with c and with n (see Figure 3a).
The PDC fails in the case of the noisy coupled Hénon maps, detecting only the coupling X 1 → X 2 , for coupling strengths, c ≥ 0.2 and n = 2048 (see Figure 3b).
The PTE seems to be significantly affected by the addition of noise, falsely detecting the coupling, X 2 → X 1 , X 3 → X 2 , and the indirect coupling, X 1 → X 3 , for strong coupling strengths. The performance of PTE is not significantly influenced by the choice of m. Indicative results are presented in Table 8 for m = 2.
Noise addition does not seem to affect the performance of PSTE. Results for m = 2 are equivalent to the results obtained for the noise-free case. The power of the significance test increases with c and n. The PSTE is sensitive to the selection of m; as m increases, the percentage of significant PSTE values in the directions of no causal effects also increases.
The PMIME outperforms the other measures also for the noisy coupled Hénon maps, detecting the true couplings for c ≥ 0.2 for n = 512 (100%) and for c ≥ 0.1 for n = 2048 (for coupling strength c = 0.1 the percentages are 22% and 23% for X 1 → X 2 , X 2 → X 3 , respectively, and for c ≥ 0.2, the percentages are 100%, for both couplings).   For System 6, we set P = 3 based on the complexity of the system and P = 5 regarding the AIC and BIC. The PTE, PSTE and PMIME are estimated for four different combinations of the free parameters, h and m (L max for PMIME), i.e., for h = 1 and m = 3, for h = 3 and m = 3, for h = 1 and m = 5 and for h = 5 and m = 5. The PDC is computed for the range of frequencies, [0, 0.2], based on the auto-spectra of the variables. As this system is a nonlinear flow, the detection of causal effects is more challenging compared to stochastic systems and nonlinear coupled maps. Indicative results for all causality measures are displayed for increasing coupling strengths in Figure 4. The CGCI has poor performance, indicating many spurious causalities. The PGCI improves the specificity of the CGCI, but still, the percentages of statistically significant PGCI values increase with c for non-existing direct couplings (less for larger n). Similar results are obtained for P = 3 and 5. On the other hand, the PDC is sensitive to the selection of P , indicating spurious causal effects for all P . As P increases, the percentage of significant PDC values at the directions of no causal effects is reduced. However, the power of the test is also reduced.
The PTE is sensitive to the embedding dimension m and the number of steps ahead h, performing best for h = 1 and m = 5. It fails to detect the causal effects for small c; however, for c > 1, it effectively indicates the true couplings. The size of the test increases with c (up to 36% for c = 5 and n = 2048), while the power of the test increases with n.
The PSTE is also affected by its free parameters, performing best for h = 3 and m = 3. It is unable to detect the true causal effects for weak coupling strengths (c ≤ 1) and for small time series lengths. The PSTE is effective only for c > 2 and n = 2048. Spurious couplings are observed for strong coupling strengths c and n = 2048.
The PMIME is also influenced by the choice of h and L max , indicating low sensitivity when setting h = 1, but no spurious couplings, while for h = L max , the percentage of significant PMIME values for X 1 → X 2 and X 2 → X 3 is higher, but the indirect coupling X 1 → X 3 is detected for strong coupling strengths. The PMIME has a poor performance for weak coupling strength (c < 2).
As c increases, the percentages of significant values of almost all the causality measures increase, but not only at the true directions, X 1 → X 2 and X 2 → X 3 . Indicative results are presented in Table 9 for strongly coupled systems (c = 5). The CGCI gives high percentages of rejection of H 0 for all couplings (very low specificity). This also holds for the PGCI, but at a lower significance level. The PTE correctly detects the two true direct causal effects for h = 1 and m = 5, but at some significant degree, also the indirect coupling, X 1 → X 3 , and the non-existing coupling, X 3 → X 2 . The PSTE does not detect the direct couplings for n = 512, but it does when n = 2048 (97% for X 1 → X 2 and 80% for X 2 → X 3 ), but then it detects also spurious couplings, most notably X 3 → X 2 (35%). The PMIME points out only the direct causal effects, giving, however, a lower percentage than the other measures for h = 1, L max = 5. Its performance seems to be affected by the selection of h and L max . The nonlinear measures turn out to be more sensitive to their free parameters. For the last coupled simulation system, we set the model order P = 2, 3, 4 based on AIC and BIC, while the PDC is estimated in the range of frequencies, [0.1, 0.2]. The embedding dimension, m, for the estimation of PTE and PSTE, as well as L max for the estimation of PMIME, are set equal to P . Results for all causality measures are displayed in Table 10.
The CGCI (for P = 5) correctly indicates the causal effects for n = 512, giving 100% percentage of significant values at the direction X 1 → X 2 and X 1 → X 4 , but lower percentage at the directions X 1 → X 3 (63%), X 4 → X 5 (37%) and X 5 → X 4 (42%). The power of the test increases with n, but spurious couplings are also detected for n = 2048. For P = 2, 3 and 4, the CGCI indicates more spurious couplings than for P = 5. System 7 favors the PGCI, as it has been specifically defined for systems with latent and exogenous variables. The PGCI denotes the couplings, X 1 → X 2 , X 1 → X 3 , X 1 → X 4 and X 4 → X 5 , even for n = 512 with a high percentage; however, it fails to detect the coupling, X 5 → X 4 , for both n.
The PDC detects the true couplings at low percentage for n = 512. The percentage increases with n = 2048 at the directions of the true couplings. However, there are also false indications of directed couplings.
The PTE does not seem to be effective in this setting for any of the considered m values, since it indicates many spurious causal effects, while it fails to detect X 4 → X 5 . The PSTE completely fails in this case, suggesting significant couplings at all directions. The true causal effects are indicated by the PMIME, but here, as well, many spurious causal effects are also observed.

Discussion
In this paper, we have presented six multivariate causality measures that are able to detect the direct causal effects among simultaneously measured time series. The multivariate direct coupling measures are tested on simulated data from coupled and uncoupled systems of different complexity, linear and nonlinear, maps and flows. The linear causality measures and the PMIME can be used in complex systems with a large number of observed variables, but the PTE and the PSTE fail, because they involve estimation of probability distributions of high dimensional variables.
The simulation results suggest that for real world data, it is crucial to investigate the presence of nonlinearities and confirm the existence of causal effects by estimating more than one causality measure, sensitive to linear, as well as nonlinear causalities. Concerning the specificity of the coupling measures (in absence of direct causality), the PMIME outperforms the other measures, but for weak coupling, it is generally less sensitive than the PTE. In general, the PMIME indicated fewer spurious causal effects. Here, we considered only systems of a few variables, and for larger systems, the PMIME was found to outperform the PTE and, also, the CGCI [28].
Regarding the three first linear coupled systems in the simulation study, the CGCI, PGCI and PDC are superior to the nonlinear causality measures, both in sensitivity and specificity. The PMIME correctly indicates the coupling among the variables, but tends to have smaller sensitivity than the linear tests. The PTE cannot detect the true direct causality in all the examined linear systems and gives also some spurious results. The PSTE was the less effective one. For the last simulation system with the exogenous and latent variables (System 7), all but PGCI measures had low specificity, indicating only the true direct causal effects.
Concerning the nonlinear coupled system, the PTE and the PMIME outperform the other methods. The linear measures (CGCI, PGCI and PDC) fail to consistently detect the true direct causal effects (low sensitivity). The failure of the linear measures may not only be due to the fact that the system is nonlinear, but also due to the small time series lengths and the low model order. For example, the PDC correctly indicated the causal effects on simulated data from the coupled Rössler system for n = 50, 000 and model order, P = 200 (see [49]). The PSTE requires large data sets to have a good power, while it gives spurious couplings at many cases. Though the PSTE performed overall worst in the simulation study, there are other settings in which it can be useful, e.g., in the presence of outliers or non-stationarity in mean, as slow drifts do not have a direct effect on the ranks. The addition of noise does not seem to affect the causality measures, CGCI, PGCI, PSTE and PMIME.
The free parameters were not optimized separately for each measure. For all systems, the parameters of model order, P , embedding dimension, m, and maximum lag, L max , were treated as one free parameter, the values of which were selected according to the complexity of each system and the standard criteria of AIC and BIC. The linear measures tend to be less sensitive to changes on this free parameter than the nonlinear ones. The PTE gave more consistent results than the PSTE for varying m, whereas the PMIME was not dependent on L max . For the nonlinear measures and the continuous-time system (three coupled Lorenz systems), we considered also the causalities at more than one step ahead, and the PTE, PSTE and PMIME were found to be sensitive to the selection of the steps ahead.
A point of concern regarding all direct causality measures, but the PMIME, is that the size of the significance test was high in many settings. This was observed for both types of spurious direct causal effect, i.e., when there is indirect coupling or when there is no causal effect. In many cases of nonexisting direct causalities, although the observed test size was large, the estimated values of the measure were low compared to those in the presence of direct causalities. This raises also the question of the validity of the significance tests. The randomization test used time-shifted surrogates. Although it is simple and straightforward to implement, it may not always be sufficient, and further investigation for other randomization techniques is due for future work.
In conclusion, we considered six of the best-known measures of direct causality and studied their performance for different systems, time series lengths and free parameters. The worst performance was observed for the PSTE, since it completely failed in the case of the linear coupled systems, while for nonlinear systems, it required large data sets. The other measures scored differently in terms of sensitivity and specificity in the different settings. The CGCI, PGCI and PDC outperformed the nonlinear ones in the case of the linear coupled simulation systems, while in the presence of exogenous and latent variables, the PGCI seems to be the most effective one. The PMIME seems to have the best performance for nonlinear and noisy systems, while always obtaining the highest specificity, indicating no spurious effects. It is the intention of the authors to pursue the comparative study on selected real applications.