Open Access
This article is

- freely available
- re-usable

*Entropy*
**2019**,
*21*(6),
610;
https://doi.org/10.3390/e21060610

Article

Kernel Methods for Nonlinear Connectivity Detection

^{1}

Instituto de Astronomia, Geofísica e Ciências Atmosféricas, Department of Atmospheric Sciences, University of São Paulo, São Paulo 05508-090, Brazil

^{2}

Escola Politécnica, Department of Telecommunications and Control Engineering, University of São Paulo, São Paulo 05508-900, Brazil

^{*}

Author to whom correspondence should be addressed.

Received: 1 May 2019 / Accepted: 15 June 2019 / Published: 20 June 2019

## Abstract

**:**

In this paper, we show that the presence of nonlinear coupling between time series may be detected using kernel feature space $\mathbb{F}$ representations while dispensing with the need to go back to solve the pre-image problem to gauge model adequacy. This is done by showing that the kernelized auto/cross sequences in $\mathbb{F}$ can be computed from the model rather than from prediction residuals in the original data space $\mathbb{X}$. Furthermore, this allows for reducing the connectivity inference problem to that of fitting a consistent linear model in $\mathbb{F}$ that works even in the case of nonlinear interactions in the $\mathbb{X}$-space which ordinary linear models may fail to capture. We further illustrate the fact that the resulting $\mathbb{F}$-space parameter asymptotics provide reliable means of space model diagnostics in this space, and provide straightforward Granger connectivity inference tools even for relatively short time series records as opposed to other kernel based methods available in the literature.

Keywords:

nonlinear time series; nonlinear-Granger causality; inference## 1. Introduction

Describing ‘connectivity’ has become of paramount interest in many areas of investigation that involve interacting systems. Physiology, climatology, and economics are three good examples where dynamical evolution modelling is often hindered as system manipulation may be difficult or unethical. Consequently, interaction inference is frequently constrained to using time observations alone.

A number of investigation approaches have been put forward [1,2,3,4,5]. However, the most popular and traditional one still is the nonparametric computation of cross-correlation (CC) between pairs of time series, and variants thereof, like coherence analysis [6], even despite their many shortcomings [7].

When it comes to connectivity analysis, recent times have seen the rise of Granger Causality (GC) as a unifying concept. This is mostly due to GC’s unreciprocal character [8] (as opposed to CC) which allows for establishing the direction of information flow between component subsystems.

Most GC approaches rest on fitting parametric models to time series data and, again as opposed to CC, under appropriate conceptualization, also holds for more than just pairs of time series, giving rise to the ideas of (a) Granger connectivity and (b) Granger influentiability [9].

GC inferential methodology is dominated by the use of linear multivariate time series models [10]. This is so because linear models have statistical properties (and shortcomings) that are well understood besides having the advantage of sufficing when the data are Gaussian. As an added advantage GC characterization allows immediate frequency domain connectivity characterization via concepts like ‘directed coherence’ (DC) and ‘partial directed coherence’ (PDC) [11].

It is often the case, however that data Gaussianity does not hold. Whereas nonparametric approaches do exist [1,4,5], parametric nonlinear modelling offers little relief from the need for long observation data sets for reliable estimation in sharp contrast to linear models that perform well under the typical scenario of fairly short datasets over which natural phenomena can be considered stable. A case in point is neural data where animal behaviour changes are associated with relatively short-lived episodic signal modifications.

The motivation for the present development is that reproducing kernel transformations applied to data, as in the support vector machine learning classification case [12], can effectively produce estimates that inherit many of the good convergence properties of linear methods. Because these properties carry over under proper kernelization, it is possible to show that nonlinear links between subsystems can be rigorously detected.

Before proceeding, it is important to have in mind that the present developments focus on addressing the connectivity detection issue exclusively, in which we clearly show that solving the so-called pre-image reconstruction problem is unnecessary as has been until now assumed essential. This leads to a considerably simpler approach.

## 2. Problem Formulation

The most popular approach to investigating GC connectivity is through modeling multivariate time series via linear vector autoregressive models [10], where the central idea is to compare prediction effectiveness for a time series ${x}_{i}\left(n\right)$ when the past of other time series is taken into account in addition to its own past. Namely,

$$\mathbf{x}\left(n\right)=\sum _{k=1}^{p}{\mathbf{A}}_{k}\mathbf{x}(n-k)+\mathbf{w}\left(n\right).$$

Under mild conditions, Equation (1) constitutes a valid representation of a linear stationary stochastic process where the evolution of $\mathbf{x}\left(n\right)={[{x}_{1}\left(n\right),\cdots ,{x}_{D}\left(n\right)]}^{\top}$ is obtained by filtering suitable $\mathbf{w}\left(n\right)={[{w}_{1}\left(n\right),\cdots ,{w}_{D}\left(n\right)]}^{\top}$ purely stochastic innovation processes, i.e., where ${w}_{i}\left(n\right)$ and ${w}_{j}\left(m\right)$ are independent provided $n\ne m$ [13]. If ${w}_{i}\left(n\right)$ are jointly Gaussian, so are ${x}_{i}\left(n\right)$ and the problem of characterizing connectivity reduces to well known procedures to estimate the ${\mathbf{A}}_{k}$ parameters in Equation (1) via least squares, which is the applicable maximum likelihood procedure. Nongaussian ${w}_{i}\left(n\right)$ translate into nongaussian ${x}_{i}\left(n\right)$ even if some actual (1) linear generation mechanism holds. Linearity among nongaussian ${x}_{i}\left(n\right)$ time series may be tested with help of cross-polyspectra [14,15], which, if unrejected, still allows for a representation like (1) whose optimal estimation requires a suitable likelihood function to accommodate the observed non-Gaussianity.

If linearity is rejected, ${x}_{i}\left(n\right)$ non-Gaussianity is a sign of nonlinear mechanisms of generation modelled by
which generalizes (1) where $\mathbf{x}\left({n}_{-}\right)$ stands for $\mathbf{x}\left(n\right)$’s past under some suitable dynamical law $\mathbf{g}(\xb7)$.

$$\mathbf{x}\left(n\right)=\mathbf{g}(\mathbf{x}\left({n}_{-}\right),\mathbf{w}\left(n\right)),$$

The distinction between (a) nonlinear ${x}_{i}\left(n\right)$ that are nonetheless linearly coupled as in (1) under nongaussian $\mathbf{w}\left(n\right)$ and (b) fully nonlinearly coupled processes is often overlooked. In the former case, linear methods suffice for connectivity detection [16] but fail in the latter case [17] calling for the adoption of alternative approaches. In some cases, however, linear approximations are inadequate in so far as to preclude connectivity detection [17].

In the present context, the solution to the connectivity problem entails a suitable data driven approximation of $\mathbf{g}(\xb7)$ whilst singling out the ${x}_{i}\left(n\right)$ and ${x}_{j}\left(n\right)$ of interest. To do so, we examine the employment of kernel methods [18] where functional characterization is carried out with the help of a high dimensional space representation
for $F=\mathrm{dim}\left(\mathbb{F}\right)\gg D=\mathrm{dim}(\mathbb{X}$), where $\mathit{\varphi}\left(\mathbf{x}\right(n\left)\right)$ is a mapping from the input space $\mathbb{X}$ into the feature space $\mathbb{F}$ whose role is to properly unwrap the data and yet ensure that the inner product $\langle \mathit{\varphi}\left(\mathbf{x}\right)\left|\mathit{\varphi}\right(\mathbf{y})\rangle $ can be written as a simple function of $\mathbf{x}$ and $\mathbf{y}$ dispensing with the need for computations in $\mathbb{F}$. This possibility is granted by chosing $\mathit{\varphi}\left(\mathbf{x}\right)$ to satisfy the so-called Mercer condition [19].

$$\mathbf{\varphi}:\mathbb{X}\to \mathbb{F},$$

A simple example of (3) is the mapping
for $x\in \mathbb{R}$ and $\langle \varphi \left(x\right)|\in \mathbb{F}$ using Dirac’s bra-ket notation. In this case, the Mercer kernel is given by
which is the simplest example of a polynomial kernel [18].

$$\varphi :x\mapsto \langle \varphi \left(x\right)|\phantom{\rule{0.277778em}{0ex}}=\phantom{\rule{0.277778em}{0ex}}{[c,\sqrt{2c}x,{x}^{2}]}^{\top},$$

$$\kappa (x,y)=\langle \mathbf{\varphi}\left(x\right)|\mathbf{\varphi}\left(y\right)\rangle ={(c+xy)}^{2},$$

In the multivariate time series case, we consider
where, for simplicity, we adopt the same transformation $\varphi (\xb7)={\varphi}_{i}(\xb7)={\varphi}_{j}(\xb7)$ for each ${x}_{i}\left(n\right)\in \mathbb{R}$ time series component so that the
is a matrix whose elements are given by ${K}_{ij}(n,m)=\langle \varphi \left({x}_{i}\left(n\right)\right)|\varphi \left({x}_{j}\left(m\right)\right)\rangle $. In the development below, we follow the standard practice of denoting the $\mathbf{K}\left(\mathbf{x}\right(n),\mathbf{x}(m\left)\right)$ quantities as $\mathbf{K}(m-n)$ in view of the assumed stationarity of the processes under study.

$$\mathbf{\varphi}:\mathbf{x}\left(n\right)\mapsto [\langle {\varphi}_{1}\left({x}_{1}\left(n\right)\right)|,\cdots ,\langle {\varphi}_{i}\left({x}_{i}\left(n\right)\right)|,\cdots ,{\langle {\varphi}_{D}\left({x}_{D}\left(n\right)\right)\left|\right]}^{\top},$$

$$\langle \mathbf{\varphi}\left(\mathbf{x}\right(n\left)\right)\left|\mathbf{\varphi}\right(\mathbf{x}\left(m\right))\rangle =\mathbf{K}(\mathbf{x}\left(n\right),\mathbf{x}\left(m\right))$$

Rather than go straight into the statement of the general theory, a simple example is more enlightening. In this sense, consider a bivariate stationary time series
where ${g}_{i}(\xb7)$ are nonlinear functions and only the previous instant is relevant in producing the present behaviour. An additional feature, thru (9), is that ${x}_{1}\left(n\right)$ is connected to (Granger causes) ${x}_{2}\left(n\right)$ but not conversely. Application of the kernel transformation leads to

$$\begin{array}{ccc}\hfill {x}_{1}\left(n\right)& =& {g}_{1}({x}_{1}(n-1),{w}_{1}\left(n\right)),\hfill \end{array}$$

$$\begin{array}{ccc}\hfill {x}_{2}\left(n\right)& =& {g}_{2}({x}_{1}(n-1),{x}_{2}(n-1),{w}_{2}\left(n\right)),\hfill \end{array}$$

$$\begin{array}{ccc}\hfill \langle \varphi \left({x}_{1}\left(n\right)\right)|& =& \langle \varphi \left({g}_{1}({x}_{1}(n-1),{w}_{1}\left(n\right))\right)|,\hfill \end{array}$$

$$\begin{array}{ccc}\hfill \langle \varphi \left({x}_{2}\left(n\right)\right)|& =& \langle \varphi \left({g}_{2}({x}_{1}(n-1),{x}_{2}(n-1),{w}_{2}\left(n\right))\right)|.\hfill \end{array}$$

However, if one assumes the possibility of a linear approximation in $\mathbb{F}$, one may write
where $[\langle {\tilde{w}}_{1}\left(n\right)|\phantom{\rule{0.277778em}{0ex}}{\langle {\tilde{w}}_{2}\left(n\right)\left|\right]}^{\top}$ stands for approximation errors in the form of innovations. Mercer kernel theory allows for taking the external product with respect to $\left[\right|\varphi \left({x}_{1}(n-1)\right)\rangle \phantom{\rule{0.277778em}{0ex}}|\varphi \left({x}_{2}(n-1)\right){\rangle ]}^{\top}$ on both sides of (12) leading to
after taking expectations on both sides where
and
since $\mathbb{E}\left[\langle {\tilde{w}}_{i}\left(n\right)|\varphi \left({x}_{j}\left(m\right)\right)\rangle \right]=0$ for $n>m$ given that $\langle {\tilde{w}}_{i}\left(n\right)|$ plays a zero mean innovations role.

$$\left[\begin{array}{c}\langle \varphi \left({x}_{1}\left(n\right)\right)|\\ \langle \varphi \left({x}_{2}\left(n\right)\right)|\end{array}\right]=\left[\begin{array}{cc}{\alpha}_{11}& {\alpha}_{12}\\ {\alpha}_{21}& {\alpha}_{22}\end{array}\right]\left[\begin{array}{c}\langle \varphi \left({x}_{1}(n-1)\right)|\\ \langle \varphi \left({x}_{2}(n-1)\right)|\end{array}\right]+\left[\begin{array}{c}\langle {\tilde{w}}_{1}\left(n\right)|\\ \langle {\tilde{w}}_{2}\left(n\right)|\end{array}\right],$$

$$\mathbf{K}\left(\mathbf{x}\right(n),\mathbf{x}(n-1\left)\right)=\mathbf{A}\phantom{\rule{0.277778em}{0ex}}\mathbf{K}\left(\mathbf{x}\right(n-1),\mathbf{x}(n-1\left)\right),$$

$$\mathbf{A}=\left[\begin{array}{cc}{\alpha}_{11}& {\alpha}_{12}\\ {\alpha}_{21}& {\alpha}_{22}\end{array}\right]$$

$$\mathbf{K}(\mathbf{x}\left(n\right),\mathbf{x}\left(m\right))=\left[\begin{array}{cc}\mathbb{E}\left[\langle \varphi \left({x}_{1}\left(n\right)\right)|\varphi \left({x}_{1}\left(m\right)\right)\rangle \right]& \mathbb{E}\left[\langle \varphi \left({x}_{1}\left(n\right)\right)|\varphi \left({x}_{2}\left(m\right)\right)\rangle \right]\\ \mathbb{E}\left[\langle \varphi \left({x}_{2}\left(n\right)\right)|\varphi \left({x}_{1}\left(m\right)\right)\rangle \right]& \mathbb{E}\left[\langle \varphi \left({x}_{2}\left(n\right)\right)|\varphi \left({x}_{2}\left(m\right)\right)\rangle \right]\end{array}\right],$$

It is easy to obtain $\mathbf{A}$ from sample kernel estimates. Furthermore, it is clear that (8) holds if and only if ${\alpha}_{12}=0$.

To (13), which plays the role of Yule–Walker equations and which can be written more simply as
and one may add the following equation to compute the innovations covariance
where only reference to the $m-n$ difference is explicitly denoted assuming signal stationarity so that (17) simplifies to

$$\mathbf{K}(-1)=\mathbf{A}\mathbf{K}\left(0\right),$$

$${\mathsf{\Sigma}}_{\langle \tilde{\mathbf{w}}\left(n\right)|}=\mathbf{K}(\mathbf{x}\left(n\right),\mathbf{x}\left(n\right))-\mathbf{A}\mathbf{K}(\mathbf{x}\left(n\right),\mathbf{x}\left(n\right)){\mathbf{A}}^{\top},$$

$${\mathsf{\Sigma}}_{\langle \tilde{\mathbf{w}}\left(n\right)|}=\mathbf{K}\left(0\right)-\mathbf{A}\mathbf{K}\left(0\right){\mathbf{A}}^{\top}=\mathbf{K}\left(0\right)-\mathbf{K}(-1){\mathbf{A}}^{\top}.$$

This formulation is easy to generalize to model orders $p>1$ and to more time series via
where
which is assumed as due to filtering appropriately modelled innovations $\langle \tilde{\mathbf{w}}\left(n\right)|$. For the present formulation, one must also consider the associated ‘ket’-vector
that when applied to (19) for $n>m$ after taking expectations $\mathbb{E}[\xb7]$ under the zero mean innovations nature of $\langle {\mathbf{w}}^{\varphi}\left(n\right)|$ leads to
where $l=m-n$ and ${\mathbf{K}}_{\mathbf{x}}^{\varphi}\left(m\right)$’s elements are given by $\mathbb{E}\left[\langle \varphi \left({x}_{i}(l-m)\right)|\varphi \left({x}_{j}\left(l\right)\right)\rangle \right]$ so that (22) constitutes a generalization of the Yule–Walker equations. By making $l=m-n=-1,\cdots ,-p$ one may reframe (22) in matrix form as
where ${\mathcal{K}}_{p}\left(0\right)$ is block Toeplitz matrix containing p Toeplitz blocks. Equation (23) provides $p{D}^{2}$ equations for the same number of unknown parameters in $\mathcal{A}$.

$$\langle \mathbf{\varphi}\left(\mathbf{x}\left(n\right)\right)|=\sum _{k=1}^{p}{\mathbf{A}}_{k}\langle \mathbf{\varphi}\left(\mathbf{x}(n-k)\right)|+\langle \tilde{\mathbf{w}}\left(n\right)|,$$

$$\langle \mathbf{\varphi}\left(\mathbf{x}\left(n\right)\right)|=[\langle \varphi \left({x}_{1}\left(n\right)\right)|,\cdots ,{\langle \varphi \left({x}_{D}\left(n\right)\right)\left|\right]}^{\top},$$

$$|\mathbf{\varphi}\left(\mathbf{x}\left(m\right)\right)\rangle =\left[\right|\varphi \left({x}_{1}\left(m\right)\right)\rangle ,\cdots ,|\varphi \left({x}_{D}\left(m\right)\right){\rangle ]}^{\top},$$

$${\mathbf{K}}_{\mathbf{x}}^{\varphi}\left(l\right)=\sum _{k=1}^{p}{\mathbf{A}}_{k}{\mathbf{K}}_{\mathbf{x}}^{\varphi}(l+k),$$

$${\overline{\mathbf{\kappa}}}_{p}=\left[\begin{array}{c}{\mathbf{K}}_{\mathbf{x}}^{\varphi}(-1)\\ \vdots \\ {\mathbf{K}}_{\mathbf{x}}^{\varphi}(-p)\end{array}\right]=\left[{\mathbf{A}}_{1}\phantom{\rule{0.277778em}{0ex}}\cdots {\mathbf{A}}_{p}\right]\left[\begin{array}{cccc}{\mathbf{K}}_{\mathbf{x}}^{\varphi}\left(0\right)& {\mathbf{K}}_{\mathbf{x}}^{\varphi}(-1)& \cdots & {\mathbf{K}}_{\mathbf{x}}^{\varphi}(-p+1)\\ {\mathbf{K}}_{\mathbf{x}}^{\varphi}\left(1\right)& {\mathbf{K}}_{\mathbf{x}}^{\varphi}\left(0\right)& \cdots & {\mathbf{K}}_{\mathbf{x}}^{\varphi}(-p+2)\\ \vdots & \ddots & \ddots & \vdots \\ {\mathbf{K}}_{\mathbf{x}}^{\varphi}(p-1)& {\mathbf{K}}_{\mathbf{x}}^{\varphi}(p-2)& \cdots & {\mathbf{K}}_{\mathbf{x}}^{\varphi}\left(0\right)\end{array}\right]=\mathcal{A}{\mathcal{K}}_{p}\left(0\right),$$

The high model order counterpart to (17) is given by

$${\mathsf{\Sigma}}_{\langle \tilde{\mathbf{w}}\left(n\right)|}={\mathbf{K}}_{\mathbf{x}}^{\varphi}\left(0\right)-\sum _{k=1}^{p}\sum _{l=1}^{p}{\mathbf{A}}_{k}{\mathbf{K}}_{\mathbf{x}}^{\varphi}(k-l){\mathbf{A}}_{l}^{\top}={\mathbf{K}}_{\mathbf{x}}^{\varphi}\left(0\right)-\mathcal{A}{\mathcal{K}}_{p}\left(0\right){\mathcal{A}}^{\top}.$$

It is not difficult to see that the more usual Yule–Walker complete equation form becomes

$$[\mathbf{I}\phantom{\rule{0.277778em}{0ex}}-\mathcal{A}]\phantom{\rule{0.277778em}{0ex}}{\mathcal{K}}_{p+1}\left(0\right)=\left[\begin{array}{c}{\mathsf{\Sigma}}_{\langle \tilde{\mathbf{w}}\left(n\right)|}\\ \mathbf{0}\end{array}\right].$$

There are a variety of ways for solving for the parameters. A simple one is to define $\mathbf{a}=\mathrm{vec}\left(\mathcal{A}\right)$ leading to

$$\mathrm{vec}\left({\overline{\mathbf{\kappa}}}_{p}\right)=({\mathcal{K}}_{p}^{\top}\left(0\right)\otimes \mathbf{I})\phantom{\rule{0.277778em}{0ex}}\mathbf{a}.$$

Even though one may employ least-squares solution methods to solve either (26) or (23), a Total-Least-Squares (TLS) approach [20] has proven a better solution since both members of the equations are affected by estimation inaccuracies that are better dealt with using TLS.

Likewise, (24) can be used in conjunction with generalizations of model order criteria of Akaike’s AIC type
where ${n}_{s}$ stands for the number of available time observations. In generalizing Akaike’s criterion to the multivariate case ${c}_{{n}_{s}}=2$, whereas ${c}_{{n}_{s}}=\mathrm{ln}\left(\mathrm{ln}\left({n}_{s}\right)\right)$ for the Hannan–Quinn criterion, our choice in this paper.

$${}_{\mathrm{g}}\mathrm{AIC}\left(k\right)=\mathrm{ln}(det\left({\mathsf{\Sigma}}_{\langle \tilde{\mathbf{w}}\left(n\right)|}\right))+{\displaystyle \frac{{c}_{{n}_{s}}}{{n}_{s}}}k{D}^{2},$$

Thus far, we have described procedures for choosing model order in the $\mathbb{F}$ space. In ordinary time series analysis, in addition to model order identification, one must also perform proper model diagnostics. This entails checking for residual whiteness among other things. This is usually done by checking the residual auto/crosscorrelation functions for their conformity to a white noise hypothesis.

In the present formulation because, we do not explicitly compute the $\mathbb{F}$ space series, we must resort to means other than computing the latter correlation functions from the residual data as usual. However, using the same ideas for computing (24), one may obtain estimates of the innovation cross-correlation in the feature space at various lags as
by replacing ${\mathbf{K}}_{\mathbf{x}}^{\varphi}(m-n+k-l)$ by their estimates and using ${\mathbf{A}}_{k}$ obtained by solving (22) for $m-n$ between a minimum $-L$ to a $+L$ maximum lag. The usefulness of (28) is to provide means to test model accuracy and quality as a function of $\varphi $ choice under the best model order provided by the model order criterion.

$${\mathsf{\Sigma}}_{\langle \tilde{\mathbf{w}}\left(n\right)|\tilde{\mathbf{w}}\left(m\right)\rangle}={\mathsf{\Sigma}}_{\tilde{\mathbf{w}}}(m-n)={\mathbf{K}}_{\mathbf{x}}^{\varphi}(m-n)-\sum _{k=1}^{p}\sum _{l=1}^{p}{\mathbf{A}}_{k}{\mathbf{K}}_{\mathbf{x}}^{\varphi}(m-n+k-l){\mathbf{A}}_{l}^{\top},$$

By defining a suitable normalized estimated lagged kernel correlation function (KCF)
which, given the inner product nature of kernel definition, satisfies the condition
as easily proved using the Cauchy–Schwarz inequality.

$${\mathrm{KCF}}_{ij}\left(\tau \right)={\displaystyle \frac{{K}_{ij}\left(\tau \right)}{\sqrt{{K}_{ii}\left(0\right){K}_{jj}\left(0\right)}}},$$

$$|{\mathrm{KCF}}_{ij}\left(\tau \right)|\le 1,$$

The notion of $\mathrm{KFC}\left(\tau \right)$ applies not only to the original kernels but also in connection with the residual kernel values given by (28) where, for explicitness, we write it as
where ${\mathsf{\Sigma}}_{ij}\left(\tau \right)$ are the matrix entries in (28).

$${\mathrm{KCF}}_{ij}^{\left(r\right)}\left(\tau \right)={\displaystyle \frac{{\mathsf{\Sigma}}_{ij}\left(\tau \right)}{\sqrt{{\mathsf{\Sigma}}_{ii}\left(0\right){\mathsf{\Sigma}}_{jj}\left(0\right)}}},$$

In the numerical illustrations that follow, we have assumed that ${\mathrm{KCF}}_{ij}^{\left(r\right)}\left(\tau \right)\sim \mathcal{N}(0,1/{n}_{s})$ asymptotically under the white residual hypothesis
This choice turned out to be reasonably consistent in practice. Along the same line of reasoning, other familiar tests over residuals, such as the Portmanteau test [10] were also carried out and consistently allowed verifying residual nonwhiteness.

$${\mathcal{H}}_{0}:{\mathrm{KCF}}_{ij}^{\left(r\right)}\left(\tau \right)=0.$$

One may say that the present theory follows closely the developments of ordinary second order moment theory with the added advantage that now nonlinear connections can be effectively captured by replacing second order moments by their respective lagged kernel estimates.

#### 2.1. Estimation and Asymptotic Considerations

The essential problem then becomes that of estimating the entries of ${\mathbf{K}}_{\mathbf{x}}^{\varphi}(n,m)$, entries. They can be obtained by averaging kernel values computed over the available data
for nonzero terms in the $s\in [1,{n}_{s}]$ range.

$${K}_{ij}(n,m)={\displaystyle \frac{1}{{n}_{s}}}\sum _{s}\langle \varphi \left({x}_{i}(n-s)\right)|\varphi \left({x}_{j}(m-s)\right)\rangle ,$$

Under these conditions, for an appropriately defined kernel function, the feature space becomes linearized and, following [21], it is fair to assume that the estimated vector stacked representation of the model coefficient matrices
is asymptotically Gaussian, i.e.,
where ${\mathsf{\Sigma}}_{\langle \tilde{\mathbf{w}}\left(n\right)|}$ is the feature space residual matrix given by (28) and where
for the ‘bra’-vector
and the ‘ket’-vector
which are used to construct the kernel scalar products. It is immediate to note that (36) is a Toeplitz matrix composed of suitably displaced ${\mathbf{K}}_{\mathbf{x}}^{\varphi}(\xb7)$ blocks.

$$\mathbf{a}=\mathrm{vec}\left([{\mathbf{A}}_{1}\cdots {\mathbf{A}}_{p}]\right)$$

$$\sqrt{{n}_{s}}(\widehat{\mathbf{a}}-\mathbf{a})\sim \mathcal{N}(\mathbf{0},{\mathsf{\Gamma}}^{-1}\otimes {\mathsf{\Sigma}}_{\langle \tilde{\mathbf{w}}\left(n\right)|}),$$

$$\mathsf{\Gamma}=\mathbb{E}\left[{\mathbf{y}}_{L}{\mathbf{y}}_{R}^{\top}\right]$$

$${\mathbf{y}}_{L}^{\top}={[\langle \mathbf{\varphi}\left(\mathbf{x}\left(n\right)\right)|,\cdots ,\langle \mathbf{\varphi}\left(\mathbf{x}(n-p+1)\right)\left|\right]}^{\top}$$

$${\mathbf{y}}_{R}^{\top}={\left[\right|\mathbf{\varphi}\left(\mathbf{x}\left(n\right)\right)\rangle ,\cdots ,|\mathbf{\varphi}\left(\mathbf{x}(n-p+1)\right)\rangle ]}^{\top},$$

An immediate consequence of (35) is that one may test for model coefficient nullity and thereby provide a kernel Granger Causality test. This is equivalent to testing for ${a}_{ij}\left(k\right)=0$ so that the statistic
where $\mathbf{C}$ is a contrast matrix (or structure selection matrix) so that the null hypothesis becomes

$${}_{g}{\lambda}_{W}={\widehat{\mathbf{a}}}^{\top}{\mathbf{C}}^{\top}{\left[\mathbf{C}\left({\mathsf{\Gamma}}^{-1}\otimes {\mathsf{\Sigma}}_{\langle \tilde{\mathbf{w}}\left(n\right)|}\right){\mathbf{C}}^{\top}\right]}^{-1}\mathbf{C}\widehat{\mathbf{a}},$$

$${\mathcal{H}}_{0}:\mathbf{C}\mathbf{a}=\mathbf{0}.$$

Hence, under (35),
where $\nu =\mathrm{rank}\left(\mathbf{C}\right)$ corresponds to the number of the explicitly imposed constraints on ${a}_{ij}\left(k\right)$.

$${}_{g}{\lambda}_{W}\stackrel{d}{\to}{\chi}_{\nu}^{2},$$

#### Data Workflow

Given ${x}_{i}\left(n\right)$, analysis proceeds by

- If ${\mathrm{KCF}}_{ij}^{\left(r\right)}\left(\tau \right)$ analysis does not suggest feature space model residual whiteness, p is increased by 1, and the procedure from step 1 is repeated until feature space model residual whiteness is obtained and ${}_{g}\mathrm{AIC}\left(k\right)$ attains its first local minimum meaning that the ideal model order has been reached;
- Once the best model is attained, one employs the (39) to infer connectivity.

These steps closely mirror those of ordinary time series model fitting and analysis.

## 3. Numerical Illustrations

The following examples consist of nonlinearly coupled systems that are simulated with the help of zero mean unit variance normal uncorrelated innovations ${w}_{i}\left(n\right)$. All simulations (10,000 realizations each) were preceded by an initial a burn-in period of 10,000 data points to avoid transient phenomena. Estimation results are examined as a function of ${n}_{s}=\{32,64,128,256,512,1024,2048\}$ with $\alpha =1\%$ significance.

For brevity, Example 1 is carried out in full detail, whereas approach performance for the other ones is gauged mostly through the computation of observed detection rates except for Examples 4 and 5 which also portray model order choice criterion behaviour.

Simulation results are displayed in terms of how true and false detection rates depend on realization length ${n}_{s}$.

#### 3.1. Example 1

Consider the simplest possible system whose connectivity cannot be captured by linear methods [17] as there is a unidirectional quadratic coupling from ${x}_{2}\left(n\right)$ to ${x}_{1}\left(n\right)$
with $a=0.2$, $b=0.6$ and $c=0.7$.

$$\left\{\begin{array}{cc}\hfill {x}_{1}\left(n\right)& =a{x}_{1}(n-1)+c{x}_{2}^{2}(n-1)+{w}_{1}\left(n\right),\hfill \\ \hfill {x}_{2}\left(n\right)& =b{x}_{2}(n-1)+{w}_{2}\left(n\right),\hfill \end{array}\right.$$

An interesting aspect of this simple system is the possibility of easily relating its coefficients a, b and c to those in (14) that describe its $\mathbb{F}$ space evolution. This may be carried out explicitly after substituting (42) into the computed kernels of Equation (13). After a little algebra, this leads to
where ${\theta}_{11}$ and ${\theta}_{12}$ depend on the computed kernel values. From (43), it immediately follows for example that $b={\alpha}_{22}$ and more importantly that ${\alpha}_{21}=0$ as expected. Vindication of the observation of these theoretically determined values also gives the means for testing estimation accuracy.

$$\left[\left[\begin{array}{cc}a& 0\\ 0& b\end{array}\right]-\left[\begin{array}{cc}{\alpha}_{11}& {\alpha}_{12}\\ {\alpha}_{21}& {\alpha}_{22}\end{array}\right]\right]=\left[\begin{array}{cc}c& 0\\ 0& 0\end{array}\right]\left[\begin{array}{cc}{\theta}_{11}& {\theta}_{12}\\ 0& 0\end{array}\right],$$

For illustrations’ sake, we write the kernel Yule–Walker Equations (22) with their respective solutions (${n}_{s}=512$) for one given (typical) realization
for the quadratic kernels ($\kappa (x,y)={\left(xy\right)}^{2}$) and
for the quartic kernels ($\kappa (x,y)={\left(xy\right)}^{4}$). Superscripts point to kernel order. One may readily notice approximate compliance to the expected ${\alpha}_{ij}$ coefficients.

$$\left[\begin{array}{cc}210.7583& 23.5416\\ 23.5416& 8.6450\end{array}\right]{\mathbf{A}}^{\left(2\right)}=\left[\begin{array}{cc}125.7501& 37.7803\\ 17.7389& 5.3788\end{array}\right]\to {\mathbf{A}}^{\left(2\right)}=\left[\begin{array}{cc}0.1559& 3.9456\\ 0.0211& 0.5648\end{array}\right],$$

$${10}^{5}\times \left[\begin{array}{cc}8.0302& 0.0868\\ 0.0868& 0.0052\end{array}\right]{\mathbf{A}}^{\left(4\right)}={10}^{5}\times \left[\begin{array}{cc}4.1597& 0.1755\\ 0.0594& 0.0025\end{array}\right]\to {\mathbf{A}}^{\left(4\right)}=\left[\begin{array}{cc}0.1843& 30.8922\\ 0.0027& 0.4386\end{array}\right],$$

Further appreciation of this example may be obtained via a plot of the normalized estimated $KCF\left(\tau \right)$ (29) shown in Figure 1.

The residual normalized kernel sequences (31) computed using (28) are depicted in Figure 2 for each kernel and show effective decrease below the null hypothesis decision threshold line vindicating adequate modelling.

Moreover, for this realization, one may show that the Hannan–Quinn Information Criterion (27) points to the correct order of $p=1$. In addition, Portmanteau tests do not reject whiteness in the $\mathbb{F}$ space for either kernel further confirming successful modelling in both cases.

To illustrate and confirm the Gaussian asymptotic behaviour discussed in Section 2.1, normal probability plots for ${\widehat{a}}_{21}$ are presented in Figure 3. Further objective quantification of the convergence speed towards normality is provided by the evolution towards 1 of the Filliben squared-correlation coefficient [22,23,24] as a function of ${n}_{s}$ (Figure 4).

#### 3.2. Example 2

Consider ${x}_{1}\left(n\right)$, a highly resonant ($R=0.99$) linear oscillator (at a normalized frequency of $f=0.1$) to be unidirectionally coupled to a low pass system ${x}_{2}\left(n\right)$ through a delayed squared term
where $c=0.1$ [17].

$$\left\{\begin{array}{cc}\hfill {x}_{1}\left(n\right)& =2R\mathrm{cos}\left(2\pi f\right){x}_{1}(n-1)-{R}^{2}{x}_{1}(n-2)+{w}_{1}\left(n\right),\hfill \\ \hfill {x}_{2}\left(n\right)& =-0.9{x}_{2}(n-1)+c{x}_{1}^{2}(n-1)+{w}_{2}\left(n\right),\hfill \end{array}\right.$$

#### 3.3. Example 3

The present example comes from a model in [27],
This choice was dictated by the nonlinear wideband character of its signals. The values ${c}_{1}=0.7$ and ${c}_{2}=0.9$ were adopted.

$$\left\{\begin{array}{cc}\hfill {x}_{1}\left(n\right)& =3.4{x}_{1}(n-1){[1-{x}_{1}^{2}(n-1)]\mathrm{e}}^{-{x}_{1}^{2}(n-1)}+{w}_{1}\left(n\right),\hfill \\ \hfill {x}_{2}\left(n\right)& =3.4{x}_{2}(n-1){[1-{x}_{2}^{2}(n-1)]\mathrm{e}}^{-{x}_{2}^{2}(n-1)}+{c}_{1}{x}_{1}^{2}(n-1)+{w}_{2}\left(n\right),\hfill \\ \hfill {x}_{3}\left(n\right)& =3.4{x}_{3}(n-1){[1-{x}_{3}^{2}(n-1)]\mathrm{e}}^{-{x}_{3}^{2}(n-1)}+{c}_{2}{x}_{2}^{4}(n-1)+{w}_{3}\left(n\right).\hfill \end{array}\right.$$

Figure 7 shows that connection detectability improves as signal duration ${n}_{s}$ increases except for the nonexisting ${x}_{3}\left(n\right)\leftarrow {x}_{1}\left(n\right)$ connection whose performance stays more or less constant with a false positive rate slightly above $\alpha =1\%$. All computations used quadratic kernels.

#### 3.4. Example 4

For this numerical illustration, consider the model presented in [28]

$$\left\{\begin{array}{cc}\hfill {x}_{1}\left(n\right)& =3.4{x}_{1}(n-1){[1-{x}_{1}^{2}(n-1)]\mathrm{e}}^{-{x}_{1}^{2}(n-1)}+0.8{x}_{1}(n-2)+{w}_{1}\left(n\right),\hfill \\ \hfill {x}_{2}\left(n\right)& =3.4{x}_{2}(n-1){[1-{x}_{2}^{2}(n-1)]\mathrm{e}}^{-{x}_{2}^{2}(n-1)}+0.5{x}_{2}(n-2)+c{x}_{1}^{2}(n-2)+{w}_{2}\left(n\right).\hfill \end{array}\right.$$

System (48) produces nonlinear wideband signals with a quadratic ($1\to 2$) coupling factor whose intensity is given by c taken here as $0.5$.

It is worth noting that, kernelized Granger causality true positive rate improves as sample size (${n}_{s}$) increases (Figure 8) and using the generalized Hannan–Quinn criterion, the order of kernelized autoregressive vector models identified for a typical realization was correctly identified and equals 2 as expected (see Figure 9).

#### 3.5. Example 5

As a last numerical illustration, consider data generated by
with ${c}_{1}=0.9$ and ${c}_{2}=0.4$.

$$\left\{\begin{array}{cc}\hfill {x}_{1}\left(n\right)& =3.4{x}_{1}(n-3){[1-{x}_{1}^{2}(n-3)]\mathrm{e}}^{-{x}_{1}^{2}(n-3)}+0.4{x}_{1}(n-4)+{w}_{1}\left(n\right),\hfill \\ \hfill {x}_{2}\left(n\right)& =3.4{x}_{2}(n-1){[1-{x}_{2}^{2}(n-1)]\mathrm{e}}^{-{x}_{2}^{2}(n-1)}+{c}_{1}{x}_{1}^{2}(n-2)+{w}_{2}\left(n\right),\hfill \\ \hfill {x}_{3}\left(n\right)& =3.4{x}_{3}(n-2){[1-{x}_{3}^{2}(n-2)]\mathrm{e}}^{-{x}_{3}^{2}(n-2)}+{c}_{2}{x}_{2}^{2}(n-3)+{w}_{3}\left(n\right),\hfill \end{array}\right.$$

Under the quadratic kernel and employing kernelized Hannan–Quinn information criterion (27) (see Figure 10), one can see that the estimated model order is $p=3$ as expected judging from the ${x}_{2}^{2}(n-3)$ term in (49). In addition, kernelized Granger causality detectability improves with record length ${n}_{s}$ increase (Figure 11).

## 4. Conclusions and Future Work

After a brief theoretical presentation (Section 2), we have shown that canonical model fitting procedures that involve (a) model specification with order determination and (b) explicit model diagnostic testing can be successfully carried out in the feature space $\mathbb{F}$ to detect connectivity via reproducing kernels. In dealing with Granger causality detection using kernels as in [29,30], this stands in sharp contrast as the latter depend on solving the reconstruction/pre-image problem to provide prediction error estimates in the original data space $\mathbb{X}$. In fact, part of the challenge in pre-image determination lies in its frequently associated numerical ill-condition [31].

The key result behind doing model diagnostics and inference in $\mathbb{F}$ is (28) by realizing that kernel quantities may be normalized much as correlation coefficients. It should be noted that (28) holds even in the case of (nonkernel) linear modelling by replacing the $\mathbf{K}$ matrices by auto/crosscorrelation matrices, something that, in practice, is never adopted in classical linear time series modelling because the necessary auto/crosscorrelations are more efficiently computed from model residuals that are easy to obtain as no pre-imaging problem is involved there.

Thus, what importantly sets the present approach apart from previous work is the lack of need for returning to the original input space $\mathbb{X}$ to gauge model quality as the reconstruction/pre-image problem can be fully circumvented bypassing unnecessary uncertainties.

As such, we showed that, because model adequacy testing can be performed directly in the feature space $\mathbb{F}$, directional Granger type connectivity can be detected for a variety of multivariate nonlinear coupling scenarios, thereby totally dispensing with the need for detailed ‘a priori’ model knowledge.

We observed that successful connectivity detection is achievable at the expense of a relatively short time series. A systematic comparison with other approaches [4,5,32,33,34,35] is planned for future work, but, at least for the cases we tested so far, savings of at least one order magnitude in record lengths are feasible.

One of the basic tenants of the present work is that model coefficients in the feature space are asymptotically normal, something whose consistency was successfully illustrated though the need for a more formal proof remains, especially in connection to explicit kernel estimates under the total-least-squares solution to (23). Our choice of TLS was dictated by its apparent superiority when compared to the ‘kernel trick’ [32] whose multivariate version we employed in [26,36,37].

In this context, it is important to note that, contrary to other methods that require time-consuming resampling procedures for adequate inference, the present approach relies on asymptotic statistics and is thus less susceptible to eventual problems derived from data shuffling.

One of the advantages of the present development is that the procedure allows for determining how far in the past to look via the model order criteria we employed (27).

Even though order estimation and model testing were successful upon borrowing from the usual linear modelling practices, further systematic examination is still needed and is under way.

One may rightfully argue that the kernels we chose for illustrating the present work are equivalent to modelling the original time series after the application of a suitable $\varphi (\xb7)$ transformation to the data and that they look for the causality evidence present in higher order momenta. This, in fact, explains why quadratic kernels converge much faster than quartic ones in Example 1. The merit of framing the time series transformation discussion for connectivity detection in terms of kernels produces a simple workflow and paves the way to developing future data-driven criteria towards optimum data transformation choice for a given problem. Other kernel choices are being investigated.

The signal model used in the present development does not contemplate additive measurement noise whose impact on connectivity detection we also leave for future examination.

One thing the present type of analysis cannot do is expose details of how the nonlinearity takes place. For example, coupling may be quadratic or involve higher exponential powers or some other function. What the present approach can do, however, is to expose existing connections, so that modelling efforts can be concentrated on them, thereby avoiding modelling parameter waste on non relevant links.

Finally, the present systematic empirical investigation sets the proposal of using feature space-frequency domain descriptions of connectivity like kernel partial directed coherence [26,36], and kernel directed transfer function [37] on sound footing, especially with respect to their asymptotic connectivity behaviour.

## Author Contributions

All authors conceptualized, cured the data, did the formal analysis, acquired the funding, did the investigation, obtained the resources, analyzed and interpreted the data, created the software used in the work, did the supervision, validation, visualization, drafted the work, and revised it.

## Funding

L.M. was funded by in part by an institutional Ph.D. CAPES Grant at Escola Politécnica, University of São Paulo, São Paulo, Brazil, and by CAPES, grant number $88887.161474/2017-00$ (PALEOCEANO Project) and by FAPESP, grant number 2015/50686-1 (PACMEDY Project), both at Instituto de Astronomia, Geofísica e Ciências Atmosféricas, University of São Paulo, São Paulo, Brazil. L.A.B. was funded by CNPq, grant number 308073/2017-7.

## Conflicts of Interest

The authors declare no conflict of interest.

## References

- Schumacker, R.E.; Lomax, R.G. A Beginner’S Guide to Structural Equation Modeling, 4th ed.; Taylor & Francis: New York, NY, USA, 2016. [Google Scholar]
- Applebaum, D. Probability and Information: An Integrated Approach, 2nd ed.; Cambridge University Press: Cambridge, UK; New York, NY, USA, 2008; p. 273. [Google Scholar] [CrossRef]
- Cover, T.M.; Thomas, J.A. Elements of Information Theory; Wiley Series in Telecommunications; Wiley: New York, NY, USA, 2006; p. 774. [Google Scholar]
- Hlaváčková-Schindler, K.; Paluš, M.; Vejmelka, M.; Bhattacharya, J. Causality detection based on information- theoretic approaches in time series analysis. Phys. Rep. Rev. Sect. Phys. Lett.
**2007**, 441, 1–46. [Google Scholar] [CrossRef] - Schreiber, T. Measuring Information transfer. Phys. Rev. Lett.
**2000**, 85, 461–464. [Google Scholar] [CrossRef] [PubMed] - Bendat, J.S.; Piersol, A.G. Engineering Applications of Correlation and Spectral Analysis; Wiley: New York, NY, USA; Chichester, UK, 1980. [Google Scholar]
- Baccalá, L.A.; Sameshima, K. Overcoming the limitations of correlation analysis for many simultaneously processed neural structures. Prog. Brain Res.
**2001**, 130, 33–47. [Google Scholar] [CrossRef] [PubMed] - Granger, C.W.J. Investigating causal relations by econometric models and cross-spectral methods. Econometrica
**1969**, 37, 424–438. [Google Scholar] [CrossRef] - Baccalá, L.A.; Sameshima, K. Multivariate time series brain connectivity: A sum up. Methods in Brain Connectivity Inference Through Multivariate Time Series Analysis; CRC Press: Boca Raton, FL, USA, 2014; pp. 245–251. [Google Scholar]
- Lütkepohl, H. New Introduction to Multiple Time Series Analysis; Springer: Berlin, Germany, 2005. [Google Scholar]
- Baccalá, L.A.; Sameshima, K. Partial directed coherence: A new concept in neural structure determination. Biol. Cybern.
**2001**, 84, 463–474. [Google Scholar] [CrossRef] [PubMed] - Vapnik, V.N. Statistical Learning Theory, 1st ed.; John Wiley and Sons: Hoboken, NJ, USA, 1998; p. 736. [Google Scholar]
- Priestley, M.B. Spectral Analysis and Time Series; Probability and Mathematical Statistics; Academic Press London: New York, NY, USA, 1981; p. 890. [Google Scholar]
- Nikias, C.; Petropulu, A.P. Higher Order Spectra Analysis: A Non-linear Signal Processing Framework; Prentice Hall Signal Processing Series; Prentice Hall: Upper Saddle River, NJ, USA, 1993; p. 528. [Google Scholar]
- Subba-Rao, T.; Gabr, M.M. An Introduction to Bispectral Analysis and Bilinear Time Series Models; Lecture Notes in Statistics; Springer: New York, NY, USA, 1984. [Google Scholar]
- Schelter, B.; Winterhalder, M.; Eichler, M.; Peifer, M.; Hellwig, B.; Guschlbauer, B.; Lücking, C.H.; Dahlhaus, R.; Timmer, J. Testing for directed influences among neural signals using partial directed coherence. J. Neurosci. Methods
**2006**, 152, 210–219. [Google Scholar] [CrossRef] [PubMed] - Massaroppe, L.; Baccalá, L.A.; Sameshima, K. Semiparametric detection of nonlinear causal coupling using partial directed coherence. In Proceedings of the 2011 Annual International Conference of the IEEE Engineering in Medicine and Biology Society, Boston, MA, USA, 30 August–3 September 2011; pp. 5927–5930. [Google Scholar] [CrossRef]
- Schölkopf, B.; Smola, A.J. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond; MIT Press: Cambridge, MA, USA, 2002. [Google Scholar]
- Mercer, J. Functions of positive and negative type, and their connection with the theory of integral equations. Philos. Trans. R. Soc. A Math. Phys. Eng. Sci.
**1909**, A, 415–446. [Google Scholar] [CrossRef] - Golub, G.H.; van Loan, C.F. Matrix Computations, 4th ed.; Number 3 in Johns Hopkins Studies in the Mathematical Sciences; Johns Hopkins University Press: Baltimore, MD, USA, 2013; p. 784. [Google Scholar]
- Hable, R. Asymptotic normality of support vector machine variants and other regularized kernel methods. J. Multivar. Anal.
**2012**, 106, 92–117. [Google Scholar] [CrossRef] - Filliben, J.J. The probability plot correlation coefficient test for normality. Technometrics
**1975**, 17, 111–117. [Google Scholar] [CrossRef] - Vogel, R.M. The probability plot correlation coefficient test for the normal, lognormal, and Gumbel distributional hypotheses. Water Resour. Res.
**1986**, 22, 587–590. [Google Scholar] [CrossRef] - Vogel, R.M. Correction to “The probability plot correlation coefficient test for the normal, lognormal, and Gumbel distributional hypotheses”. Water Resour. Res.
**1987**, 23, 2013. [Google Scholar] [CrossRef] - Massaroppe, L.; Baccalá, L.A. Método semi-paramétrico para inferência de conectividade não-linear entre séries temporais. In Proceedings of the Anais do I Congresso de Matemática e Computacional da Região Sudeste, I CMAC Sudeste, Uberlândia, Brazil, 20–23 September 2011; pp. 293–296. [Google Scholar]
- Massaroppe, L.; Baccalá, L.A. Detecting nonlinear Granger causality via the kernelization of partial directed coherence. In Proceedings of the 60th World Statistics Congress of the International Statistical Institute, ISI2015, Rio de Janeiro, RJ, USA, 26–31 July 2015; pp. 2036–2041. [Google Scholar]
- Gourévitch, B.; Bouquin-Jeannès, R.L.; Faucon, G. Linear and nonlinear causality between signals: Methods, examples and neurophysiological applications. Biol. Cybern.
**2006**, 95, 349–369. [Google Scholar] [CrossRef] [PubMed] - Chen, Y.; Rangarajan, G.; Feng, J.; Ding, M. Analyzing multiple nonlinear time series with extended Granger causality. Phys. Lett. A
**2004**, 324, 26–35. [Google Scholar] [CrossRef] - Amblard, P.O.; Vincent, R.; Michel, O.J.J.; Richard, C. Kernelizing Geweke’s measures of Granger causality. In Proceedings of the 2012 IEEE International Workshop on Machine Learning for Signal Processing, Santander, Spain, 23–26 September 2012; pp. 1–6. [Google Scholar] [CrossRef]
- Kallas, M.; Honeine, P.; Francis, C.; Amoud, H. Kernel autoregressive models using Yule-Walker equations. Signal Process.
**2013**, 93, 3053–3061. [Google Scholar] [CrossRef] - Honeine, P.; Richard, C. Preimage problem in kernel-based machine learning. IEEE Signal Process. Mag.
**2011**, 28, 77–88. [Google Scholar] [CrossRef] - Kumar, R.; Jawahar, C.V. Kernel approach to autoregressive modeling. In Proceedings of the Thirteenth National Conference on Communications (NCC 2007), Kanpur, India, 26–28 January 2007; pp. 99–102. [Google Scholar]
- Marinazzo, D.; Pellicoro, M.; Stramaglia, S. Kernel method for nonlinear Granger causality. Phys. Rev. Lett.
**2008**, 100, 144103. [Google Scholar] [CrossRef] [PubMed] - Park, I.; Príncipe, J.C. Correntropy based Granger causality. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Las Vegas, NV, USA, 31 March–4 April 2008; pp. 3605–3608. [Google Scholar] [CrossRef]
- Príncipe, J.C. Information Theoretic Learning: Rényi’s Entropy and Kernel Perspectives, 1st ed.; Number XIV in Information Science and Statistics; Springer Publishing Company, Incorporated: New York, NY, USA, 2010; p. 448. [Google Scholar]
- Massaroppe, L.; Baccalá, L.A. Kernel-nonlinear-PDC extends Partial Directed Coherence to detecting nonlinear causal coupling. In Proceedings of the 2015 37th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), Milan, Italy, 25–29 August 2015; pp. 2864–2867. [Google Scholar] [CrossRef]
- Massaroppe, L.; Baccalá, L.A. Causal connectivity via kernel methods: Advances and challenges. In Proceedings of the 2016 38th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), Orlando, FL, USA, 16–20 August 2016. [Google Scholar]

**Figure 1.**The sequence kernel correlation functions (KCF$\left(\tau \right)$) respectively for the quadratic and quartic kernels are contained in Figure 1a,b for Example 1. Horizontal dashed lines represent $95\%$ significance threshold interval out of which the null hypothesis ${\mathcal{H}}_{0}$ of no correlation is rejected. Asterisks (*) further stress signficant values.

**Figure 2.**The residue kernel correlation functions (KCF${}^{\left(r\right)}\left(\tau \right)$) respectively for the quadratic and quartic kernels are shown in Figure 2a,b for Example 1. Comparing them to Figure 1, it is clear that the kernel correlations are reduced after modelling as it is now impossible to reject KCF${}^{\left(r\right)}\left(\tau \right)$ nullity at $95\%$ as no more than $5\%$ of the values lie outside the dashed interval around zero. Asterisks (*) further stress significant values.

**Figure 3.**Ensemble normal probability plots for ${\widehat{a}}_{21}$, respectively for 3a quadratic and 3b quartic kernels, illustrate and confirm asymptotic normality.

**Figure 4.**Filliben squared-correlation coefficient convergence to Gaussianity as a function of ${n}_{s}$ for both kernels used in Example 1.

**Figure 5.**True positive and false positive rates from the kernelized Granger causality test for various samples sizes (${n}_{s}$) for $\alpha =1\%$. Note that the false-positive-rates for both kernels overlap.

**Figure 6.**True positive (${x}_{1}\to {x}_{2}$) and false positive rates (${x}_{2}\to {x}_{1}$) from the kernelized Granger causality test under a quadratic kernel as a function ${n}_{s}$ in Example 2.

**Figure 7.**True positive and false positive rates (Example 3) from the kernelized Granger causality test using a quadratic kernel as a function of ${n}_{s}$. Note that the false-positive-rate for the connections $1\leftarrow 2$, $2\leftarrow 3$ and $1\leftarrow 3$ overlap over the investigated ${n}_{s}$ range.

**Figure 8.**True positive and false positive rates from the kernelized Granger causality test under a quadratic kernel as function of record length ${n}_{s}$ in Example 4.

**Figure 9.**Generalized Hannan–Quinn criterion (${}_{\mathrm{g}}\mathrm{AIC}\left(k\right)$) with ${c}_{{n}_{s}}=\mathrm{ln}\left(\mathrm{ln}\left({n}_{s}\right)\right)$ as a function of model order for various observed record lengths ${n}_{s}$ using a typical realization from (48).

**Figure 10.**Generalized Hannan–Quinn criterion (${}_{\mathrm{g}}\mathrm{AIC}\left(k\right)$) as a function of model order for the various data lengths ${n}_{s}$ from a typical realization from (49).

**Figure 11.**Observed true positive and false positive rates from the kernelized Granger causality test under a quadratic kernel for various record lengths ${n}_{s}$ in Example 5. Note that the false-positive-rate for the connections $2\leftarrow 3$, $3\leftarrow 1$ and $1\leftarrow 3$ overlap over the ${n}_{s}$ range, except for $1\leftarrow 2$, which, however, attains the same level as the others after ${n}_{s}=128$.

© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).