Kernel-Based Phase Transfer Entropy with Enhanced Feature Relevance Analysis for Brain Computer Interfaces

: Neural oscillations are present in the brain at different spatial and temporal scales, and they are linked to several cognitive functions. Furthermore, the information carried by their phases is fundamental for the coordination of anatomically distributed processing in the brain. The concept of phase transfer entropy refers to an information theory-based measure of directed connectivity among neural oscillations that allows studying such distributed processes. Phase TE is commonly obtained from probability estimations carried out over data from multiple trials, which bars its use as a characterization strategy in brain–computer interfaces. In this work, we propose a novel methodology to estimate TE between single pairs of instantaneous phase time series. Our approach combines a kernel-based TE estimator deﬁned in terms of Renyi’s α entropy, which sidesteps the need for probability distribution computation with phase time series obtained by complex ﬁltering the neural signals. Besides, a kernel-alignment-based relevance analysis is added to highlight relevant features from effective connectivity-based representation supporting further classiﬁcation stages in EEG-based brain–computer interface systems. Our proposal is tested on simulated coupled data and two publicly available databases containing EEG signals recorded under motor imagery and visual working memory paradigms. Attained results demonstrate how the introduced effective connectivity succeeds in detecting the interactions present in the data for the former, with statistically signiﬁcant results around the frequencies of interest. It also reﬂects differences in coupling strength, is robust to realistic noise and signal mixing levels, and captures bidirectional interactions of localized frequency content. Obtained results for the motor imagery and working memory databases show that our approach, combined with the relevance analysis strategy, codes discriminant spatial and frequency-dependent patterns for the different conditions in each experimental paradigm, with classiﬁcation performances that do well in comparison with those of alternative methods of similar nature.


Introduction
Neural oscillations are observed in the mammalian brain at different temporal and spatial scales [1].Oscillations in specific frequency bands are present in distinct neural networks, and their interactions have been linked to fundamental cognitive processes such as attention and memory [2,3] and to information processing at large [4].Three properties characterize such oscillations: amplitude, frequency, and phase, the latter referring to the position of a signal within an oscillation cycle [5].Oscillation amplitudes are related to neural synchrony expansion in a local assembly, while the relationships between the phases of neural oscillations, such as phase synchronization, are involved in the coordination of anatomically distributed processing [6].Moreover, from a functional perspective, phase synchronization and amplitude correlations are independent phenomena [7], hence the interest in studying phase-based interactions independently from other spectral relationships.Additionally, phase relationships are linked to neural synchronization and information flow within networks of connected neural assemblies [8].Therefore, a measure that aims to capture phase-based interactions among signals from distributed brain regions should ideally include a description of the direction of interaction.A fitting framework for such measure is that of brain effective connectivity [9].
Effective brain connectivity, also known as directed functional connectivity, measures the influence that a neural assembly has over another one, establishing a direction for their interaction by estimating statistical causation from their signals [10].Directed interactions between oscillations of similar frequency can be captured through measures such as Geweke-Granger causality statistics, partially directed coherence, and directed transfer function [9,11].However, since these metrics depend on both amplitude and phase signal components, they do not identify phase-specific information flow [8].The phase slope index (PSI), introduced in [12], measures the direction of coupling between oscillations from the slope of their phases; still, it only captures linear phase relationships [13].In this context arises the concept of phase transfer entropy, a phase-specific nonlinear directed connectivity measure introduced in [8].Transfer entropy (TE) is an information-theoretic quantity, based on Wiener's definition of causality, that estimates the directed interaction, or information flow, between two dynamical systems [14,15].In [8], the authors first extract instantaneous phase time series by complex filtering the signals of interest in a particular frequency, since a signal's phase is only physically meaningful when its spectrum is narrowbanded [16].Such filtering-based approach has also been explored to obtain phase-specific versions of other information-theoretic metrics such as permutation entropy and timedelayed mutual information [7,16].Then, the authors compute TE from the obtained phase time series.Nonetheless, since conventional TE estimators are not well suited for periodical variables, in [8] phase TE estimates are obtained through a binning approach performed over multiple trials simultaneously, in a procedure termed trial collapsing.
Phase TE has found multiple applications in neuroscience, such as gaining insight into reduced levels of consciousness by evaluating brain connectivity [17], analyzing resting-state networks [18], and assessing brain connectivity changes in children diagnosed with attention deficit hyperactivity disorder following neurofeedback training [19].It has even been used to detect fluctuations in financial markets data [20].Nonetheless, phase TE, estimated as in [8], cannot be employed as a characterization strategy for braincomputer interfaces (BCI) since they require features extracted on an independent trial basis, i.e., each trial must be associated with a set of features.Effective connectivity measures, such as phase TE, can be used to assess the induced physiological variations in the brain occurring during BCI tasks [21].Discriminative information may be hidden in the dynamical interactions among spatially separated brain regions that characterization methods commonly employed in BCI are not able detect [22].This information could be relevant to address issues such as the inefficiency problem in some BCI systems [23].In that context, authors in [6] applied a binning strategy to estimate single-trial phase TE to set up classification systems for visual attention.Nonetheless, binning estimators for single trial-based estimation of information-theoretic measures exhibit systematic bias [8].Furthermore, spectrally resolved TE estimation methods that can obtain single-trial TE estimates have been recently proposed in the literature [24,25].Yet, phase TE is conceptually different from them [25], as they are not phase-specific metrics.
Here, we propose a novel methodology to estimate TE between single pairs of instantaneous phase time series.Our approach combines the kernel-based TE estimator we introduced in [10], with phase time series obtained by convolving neural signals with a Morlet Wavelet.The kernel-based TE estimator expresses TE as a linear combination of Renyi's entropy measures of order α [26,27] and then approximates them through functionals defined on positive definite and infinitely divisible kernel matrices [28].Its most important property is that it sidesteps the need to obtain the probability distributions underlying the data.Instead, the estimator computes TE directly from kernel matrices that, in turn, capture the similarity relations among data.It is robust to varying noise levels and data sizes and to the presence of multiple interaction delays in a network [10].In this work, we hypothesize that the above-described estimator could overcome the hurdles other single-trial TE estimators face when obtaining TE values from instantaneous phase time series since it would not have to explicitly obtain probability distributions from circular variables [8].Additionally, since our primary motivation to introduce a robust phase TE estimation methodology is the use of such measures in the context of BCI applications, we also explore a relevance analysis strategy based on centered kernel alignment (CKA) [29].The CKA-based analysis allows us to identify the set of pairwise channel connectivities relevant to discriminate between specific conditions, favoring the neurophysiological interpretation of our results and providing an option to avoid carrying out all to all channel connectivity estimations in practical BCI systems based on phase TE.
We employ simulated and real-world EEG data to test the introduced effective connectivity measure.The simulated data are obtained from neural mass models, mathematical models of neural mechanisms that generate time series with oscillatory behavior similar to electrophysiological signals.Obtained results for such data show that the proposed kernelbased phase TE estimation method successfully detects the direction of interaction imposed by the model.Indeed, it detects statistically significant connections in the frequency bands of interest, even for weak couplings and narrowband bidirectional interactions.It also displays robustness to realistic levels of noise and signal mixing.Regarding the EEG data, we consider two databases containing signals recorded under two different cognitive paradigms, consisting of motor imagery tasks and a change detection task designed to study working memory.Attained classification results demonstrate that our approach is competitive compared to real-valued and phase-based directed connectivity measures.Thus, this proposal extends the approach described in [10] by introducing a measure that captures directed interactions between the phases of oscillations at specific frequencies.Unlike alternative approaches in the literature, it can be obtained from single trial data, which allows it to be used as a characterization strategy in BCI applications.In addition, the results obtained for the EEG data show that our approach, coupled with the CKAbased relevance analysis, largely outperforms the real-valued kernel-based transfer entropy in [10] as characterization strategy for cognitive tasks such as working memory.
The remainder of the paper is organized as follows: in Section 2 we formally introduce the concept of phase TE and our kernel-based approach for single-trial phase TE estimation.We also describe the proposed CKA-based relevance analysis.Section 3 details the experiments we carried out using simulated and real EEG data in order to evaluate the performance of our proposal.In Section 4 we present and discuss our results, and finally, Section 5 contains our conclusions.

Phase Transfer Entropy
Transfer entropy (TE) is a Wiener-causal measure of directed interactions between two dynamical systems [14,15].Given two time series x = {x t } T t=1 and y = {y t } T t=1 , with t ∈ N a discrete time index, T ∈ N, the TE from x to y estimates whether the ability to predict the future of y improves by considering the past of both x and y, as compared to the case when only the past of y is considered.Formally, TE can be defined as: where x dx t , y dy t ∈ R D×d are time embedded versions of x and y, D = T − (τ(d − 1)) with d, τ ∈ N the embedding dimension and delay, respectively; u ∈ N represents the interaction delay between the driving and the driven systems, and p(•) indicates a probability density function [30] (Henceforth, the summation symbol is to be interpreted in an extended way, that is to say, as a summation or an integral depending on whether the variable is discrete or continuous).Regarding the time embeddings, we have that [31,32].Furthermore, using the definition of Shannon entropy, H S (X) = − ∑ x p(x)log(p(x)), where X is a discrete random variable (x ∈ X), we can also express Equation (1) as: where H S (•, •), and H S (•) stand for joint and marginal entropies.In phase TE, the time series x and y are replaced by instantaneous phase time series T and s y = ς y e iθ y ( f ) ∈ C T , which contain the complex-filtered values of x and y at frequency f , respectively, and with ς x , ς y ∈ R T the amplitude envelopes of the filtered time series [8].Thus, we have that where θ x,dx t and θ y,dy t are time embedded versions of θ x and θ y .Note that for the sake of notation simplicity we have dropped the explicit dependency of the phase time series on f .

Kernel-Based Renyi's Phase Transfer Entropy
In [10] we propose a TE estimator based on kernel matrices that approximate Renyi's entropy measures of order α.This data-driven approach has the advantage of sidestepping the need for probability distribution estimation in TE computation.First, we show that TE can be expressed as where H α (X) stands for Renyi's α entropy, a generalization of Shannon's entropy [26,27], defined as with α = 1 and α ≥ 0. In the limiting case where α → 1, it tends to Shannon's entropy.Then, using the kernel-based formulation for Renyi's α entropy introduced in [28], where A ∈ R n×n is a Gram matrix with elements a ij = κ(x i , x j ), κ(•, •) ∈ R stands for a positive definite and infinitely divisible kernel function, n for the number of realizations of X, and tr(•) for the matrix trace; along with the accompanying formulation for the Renyi's α entropy of joint probability distributions, where B ∈ R n×n is a second Gram matrix and the operator • stands for the Hadamard product, we estimate the TE α from x to y as: where the kernel matrices K y t , K For K y t , a i , a j ∈ R are the values of the time series y at times i and j.  8), while replacing the time series x and y for their instantaneous phase time series θ x and θ y at frequency f , respectively.

Phase Transfer Entropy
We obtain phase TE values through three different estimators that allow computing TE from individual signal pairs.First, the proposed kernel-based Renyi's phase TE estimator (TE θ κα ), defined in Equation (9).Second, the Kraskov-Stögbauer-Grassberger TE estimator (TE θ KSG ), a method that relies on a local approximation of the probability distributions needed to estimate the entropies in TE from the distances of every data point to its neighbors [33,34].Thirdly, an approach termed symbolic TE (TE θ Sym ) that relies on a symbolization scheme based on ordinal patterns.The symbolization scheme allows estimating the probabilities involved in the computation of TE directly from the symbols' relative frequencies [35].
In all cases, θ x and θ y are obtained by convolving the real-valued time series with a Morlet wavelet, defined as where f stands for the filter frequency, ξ t = m/2π f is the time domain standard deviation of the wavelet, and m defines the compromise between time and frequency resolution [8].

Phase Slope Index
The phase slope index (PSI) is an effective brain connectivity measure that assesses the direction of coupling between two oscillatory signals of similar frequencies [13].Given two time series x = {x i } l t=1 and y = {y i } l t=1 , the PSI is defined as the slope of the phase of the cross-spectra between x and y: where C xy ( f ) = S xy / S xy , S xy is the complex coherence, S xy ∈ C is the cross-spectrum between x and y, S xx , S yy ∈ C are the auto-spectrums of x and y, d f ∈ R + is the frequency resolution, F stands for the set of frequencies over which the slope is summed, and indicates selecting only the imaginary part of the sum [12].If the PSI, as defined in Equation (11), is positive, then there is directed interaction from x to y in F. Conversely, if the PSI is negative, the directed interaction goes from y to x.Note that by definition the PSI is an antisymmetric measure: PSI(x → y) = −PSI(y → x).

Granger Causality
We also characterize the simulated and EEG data using Granger causality (GC).Like TE, GC is derived from Wiener's definition of causality, and the two measures, in their original forms, are equivalent for Gaussian variables [36].Briefly, for two stationary time series x = {x t } T i=1 and y = {y i } T t=1 , the Granger causality from x to y is defined as: where e, e ∈ R T−o are vectors holding the residual or prediction errors of two autoregressive models, and var{•} stands for the variance operator.The errors in e come from an autoregressive process of order o that predicts y from its past values alone.On the other hand, the errors in e come from a bivariate autoregressive process of order o that predicts y from the past values of y and x [11].If the past of x improves the prediction of y then var{e} var{e } and GC(x → y) 0, if it does not, then var{e} ≈ var{e } and GC(x → y) → 0. In addition, in analogy to the concept of phase TE, we define GC θ (x → y, f ) = GC(θ x → θ y ), where θ x and θ y are instantaneous phase time series obtained by filtering x and y at frequency f , as a measure within the framework of GC that captures phase-based interactions.

Kernel-Based Relevance Analysis
When characterizing EEG data through effective brain connectivity measures for BCI-related applications, two common and related issues can arise.First, all to all channel connectivity analyses result in a large number of features, many of which may not provide useful information to discriminate between the conditions of the BCI paradigm of interest [10].This can add noise and complexity to any subsequent analysis stage.Second, estimating such a large number of pairwise channel connectivities can be computationally expensive, especially for measures such as TE and single-trial TE θ [8], which can hinder their inclusion in practical BCI systems.Both problems could be addressed by identifying the set of pairwise channel connectivities that are relevant to discriminate between specific conditions, which would also lead to a clearer neurophysiological interpretation of the obtained results [6,10].To that end, we explore a relevance analysis strategy based on centered kernel alignment (CKA).CKA allows quantifying the similarity between two sample spaces by comparing two characterizing kernel functions [29].First, assume we have a feature matrix Φ ∈ R N×P , and a corresponding vector of labels l ∈ Z N , with N the number of observations and P the number of features.For the case of connectivity-based EEG analysis, each element in Φ holds a connectivity value for a pair of channels, with each row of Φ containing multiple connectivity values (features) estimated for a single trial or observation.The corresponding element in l holds a label identifying the condition associated to that trial.Next, we define two kernel matrices K Φ ∈ R N×N and K l ∈ R N×N .The first matrix holds elements k Φ ij = κ Φ (φ i , φ j ) with φ i , φ j ∈ R P row vectors belonging to Φ, and a radial basis function (RBF) kernel [37], where d 2 (•, •) is a distance operator and σ ∈ R + is the kernel's bandwidth.The second matrix has elements k l ij = κ l (l i , l j ) with l i , l j ∈ l, and a dirac kernel, where δ(•) stands for the Dirac delta.Then, the CKA can be estimated as: where K ∈ R N×N is the centered version of K, obtained as K = IK I, where I = I − 1 1/N is the empirical centering matrix, I ∈ R N×N is the identity matrix, 1 ∈ R N is an all-ones vector, and K, K F = tr( K KT ) denotes the matrix-based Frobenius norm.Now, for κ Φ we select as distance operator the the Mahalanobis distance where Γ ∈ R P×Q , Q ≤ P, is a linear projection matrix, and ΓΓ is the corresponding inverse covariance matrix.Afterward, the projection matrix Γ is obtained by solving the following optimization problem: where the logarithm function is used for mathematical convenience.Γ can be estimated through standard stochastic gradient descent, as detailed in [38], through the update rule where µ ∈ R + is the step size of the learning rule, and r indicates a time step.Finally, we quantify the contribution of each feature to the projection matrix Γ, which maximizes the alignment between the feature and label spaces, by building a relevance vector index ∈ R P , whose elements are defined as: can then be used to rank the features in Φ according to their discrimination capability.A high p value indicates that the p-th feature in Φ, in our case a connection between a specific pair of channels, is relevant when it comes to distinguishing between the conditions contained in the label vector l.

Experiments
In order to test the performance of our single-trial phase TE estimator we carry out experiments on simulated data from neural mass models, and on real EEG data, obtained under motor imagery and visual working memory paradigms.We then compare our results with those obtained with the alternative approaches for phase-based effective connectivity estimation detailed in Section 2.3.

Neural Mass Models
Neural mass models (NMM) are biologically plausible mathematical descriptions of neural mechanisms [39].They represent the electrical activity of neural populations at a macroscopic level through a set of stochastic differential equations [40].NMMs allow generating mildly nonlinear time series with properties that resemble the oscillatory dynamics of electrophysiological signals, such as EEG, and how they change as a result of coupling between different cortical areas.Therefore, NMMs are useful to study the behavior of brain connectivity measures that aim to capture such interactions [8,24,40,41].Figure 1A shows a schematic representation of an NMM with two interacting cortical areas from which two signals, x and y (see Figure 1B), can be obtained.The parameters C 12 and C 21 are known as coupling coefficients, and they determine the strength of the coupling from Area 1 to Area 2, and from Area 2 to Area 1, respectively.The parameter ν represents the interaction lag between the two areas, while p 1 and p 2 are external inputs coming from other cortical regions.In this work, we use NMMs to generate interacting time series with known oscillatory properties in order to test the performance of the proposed phase TE estimator.In particular, we test our proposal in terms of its ability to detect directed interactions for different levels of coupling strength, under the presence of noise and signal mixing, and for bidirectional narrowband couplings.We proceed as follows: first, we set the model parameters describing Areas 1 and 2 as in [40], so as to generate signals with power spectrums peaking in the α (8 Hz-12 Hz) and lower β bands (12-20 Hz), as depicted in Figure 1C.Then, in order to generate unidirectionally coupled signals, with interactions from x to y, we set the parameter C 21 to 0 for all simulations.Also, the parameter ν is set to 20 ms, and the extrinsic inputs p 1 and p 2 are modeled as Gaussian noise [40].Afterward, we generate 50 pairs (trials) of 3 s long signals, using a simulation time step of 1 ms, equivalent to a sampling frequency of 1000 Hz, for each condition in the three scenarios detailed in Sections 3.1.1-3.1.3.Next, we select a 2 s long segment from the signals, from 0.5 s to 2.5 s, and downsample them to 250 Hz.Then, we compute connectivity estimates for the simulated data in the frequency range between 2 Hz and 60 Hz, in steps of 2 Hz.After that, we obtain net connectivity values, defined as where λ stands for any of the phase-based effective connectivity measures studied, except for the PSI, in which case all subsequent analyses are performed directly on the PSI values.Lastly, for each condition in the three scenarios and at each frequency evaluated, we perform permutation tests based on randomized surrogate trials [34,42] to determine which net couplings or directed connections are statistically significant.The permutation test employed uses the trial structure of the data to generate surrogate datasets for the null hypothesis (absence of directed interactions).It does so by shuffling the data from different trials.The significance level for the tests was set to 3.3 × 10 −4 after applying the Bonferroni correction to an initial alpha level of 0.01 in order to account for 30 independent tests, one for each evaluated frequency per condition.

Coupling Strength
In order to test the ability of our phase TE estimation method to detect phase-based directed interactions of varying intensity, we modify the coupling strength between the simulated signals, x and y, by varying the parameter C 12 in the range {0, 0.2, 0.5, 0.8}, with 0 indicating the absence of coupling and 0.8 a strong interaction between the two signals.

Noise and Signal Mixing
To asses the robustness of our proposal to realistic levels of noise and signal mixing, we do the following: we generate a noise time series η, with the same power spectrum of x, through the methodology proposed in [8].Then, we add x and η to generate a noisy version of x, x η = x + 10 − SNR 20 η, where SNR is the signal to noise ratio.Likewise for y.Then, we mix x η and y η to simulate one of the effects of volume conduction, by doing x w η = 1 − w 2 x η + w 2 y η , and y w η = 1 − w 2 y η + w 2 x η , with w the mixing strength.We set the parameters SNR and w to 3 and 0.25 respectively, based on the results obtained in [8] for realistic values of noise and mixing for EEG signals.The coupling coefficient C 12 is held constant at a value of 0.5 to simulate couplings of medium strength.

Narrowband Bidirectional Interactions
In this experiment, we aim to evaluate how our proposal deals with bidirectional interactions of localized frequency content.Particularly, we want to assess its performance for signals x and y containing a directed interaction from x to y at 10 Hz and an interaction in the opposite direction, from y to x, at 40 Hz.To generate such signals, first, we modify the model parameters of Area 2 so that it produces a signal y with a power spectrum peaking in the γ band [39].The power spectrum of x remains as before.The coupling coefficient C 12 is again held constant at a value of 0.5.The change in the parameters of Area 2 leads to strong directed interactions from x to y around 10 Hz and 40 Hz.Then, we use a Morlet wavelet (Equation ( 10)) to filter both x and y at those frequencies (10 Hz and 40 Hz).The obtained real-valued narrowband time series are then combined as follows: x * = x 10 Hz + y 40 Hz and y * = y 10 Hz + x 40 Hz .Next, x * and y * are added to broadband noise generated following the same approach described in Section 3.1.2,with an SNR of 6.

EEG Data
In order to test the performance of our phase TE estimator in the context of BCI, we obtain effective connectivity features from EEG signals recorded under two different cognitive paradigms: the first one consisting of motor imagery (MI) tasks and the second one of a change detection task designed to study working memory (WM).Our aims are to set up classification systems that allow discriminating between the conditions in each paradigm, using as inputs relevant directed interactions among EEG signals and then evaluate their performance in relation to the connectivity measures used to train them.To those ends, we employ two publicly available databases: the BCI Competition IV database 2a (http://www.bbci.de/competition/iv/index.html, accessed on 2 June 2021) and a database from brain activity during visual working memory (https://data.mendeley.com/datasets/j2v7btchdy/2, accessed on 2 June 2021).

Motor Imagery
Motor imagery (MI) is the process of mentally rehearsing a motor action, such as moving a limb, without actually executing it [43].The BCI Competition IV database 2a [44] comprises EEG data from 9 healthy subjects recorded during an MI paradigm consisting of four different MI tasks, namely, imagining the movement of the left hand, the right hand, both feet, or the tongue.Each trial of the paradigm starts with a fixation cross displayed on a computer screen, along with a beep.At second 2, a visual cue appears on the screen for a period of 1.25 s (an arrow pointing left, right, down, or up, corresponding to one of the four MI tasks).The cue prompts the subject to perform the indicated MI task until the cross vanishes from the screen at second 6.A representation of the paradigm's time course is shown in Figure 2A.Each subject performed 144 trials per MI task.The EEG data are acquired at a sampling rate of 250 Hz, from 22 Ag/AgCl electrodes (C = 22) placed according to the international 10/20 system, as depicted in Figure 2B.Next, the data are bandpass-filtered between 0.5 Hz and 100 Hz.A 50 Hz Notch filter is also applied.For each subject, the database contains a training dataset and a testing dataset, obtained following the same experimental paradigm [44].In this study, we consider a bi-class classification problem involving the left and right hand MI tasks, so we drop the trials associated with the feet and the tongue.Afterward, we also drop the trials marked for rejection in the database itself [44].Then, for all trials we select a 2 s long time window stretching from second 3 to second 5 (M = 500 samples), as schematized in Figure 2A.Finally, we compute the surface Laplacian of each remaining trial through the spherical spline method for source current density estimation, in order to reduce the deleterious effects of volume conduction on connectivity analyses [21,45,46].

Working Memory
The concept of working memory (WM) refers to a cognitive system of limited capacity that allows for temporary storage and manipulation of information [47].The database from brain activity during visual working memory, presented in [48], contains EEG data recorded from twenty-three subjects, with normal or corrected-to-normal vision, and without colorvision deficiency, while performing multiple trials of a change detection task [49].The task consists of remembering the colors of a set of squares, termed memory array, and then comparing them with the colors of a second set of squares located in the same positions, termed test array.A trial of the task begins with an arrow indicating either the left or the right side of the screen.Then, a memory array appears on the screen for 0.1 s.For every trial, memory arrays are displayed on both hemifields, but the subject must remember only those appearing on the side indicated by the arrow cue.Next, after a retention period lasting 0.9 s, a test array appears.The subject then reports if the colors of all the items in the memory and test arrays match.The task has three levels according to the number of elements in the memory array: low memory load (one square), medium memory load (two squares), and high memory load (four squares).A representation of the above-described experimental paradigm is depicted in Figure 3A.The color of one of the squares in the test array differs from its counterpart in the memory array in 50% of the trials.Each subject performed a total of 96 trials, with 32 trials for each memory load level.The EEG data are acquired at a sampling rate of 2048 Hz, using 64 electrodes (Biosemi ActiveTwo) arranged according to the international 10/20 extended system, as depicted in Figure 2B.Besides the EEG data, the database provides recordings from four EOG channels and two externals electrodes located on the left and right mastoids.In this study, we perform the following preprocessing steps before any further data analysis.First, we re-reference the data to the average of the mastoid channels.Next, we bandpass-filter the data between 0.01 Hz and 20 Hz using a Butterworth filter of order 2. Afterward, we extract the trial information from the continuous EEG data using a 1.4 s squared window.Each trial segment starts 0.2 s before the presentation of the memory array.Then, we perform a visual inspection of the data and discard two subjects (subjects number 11 and 17) because of the presence of strong artifacts in a very large number of trials.Subjects number 22 and 23 are reassigned as subjects 17 and 11, respectively.After that, we remove ocular artifacts from the EEG data by performing independent component analysis (ICA) on it and then eliminating the components that more closely resemble the information provided by the EOG data [48].Then, we discard all incorrect trials, i.e., trials for which the subjects incorrectly matched the memory and test arrays.Next, we select 32 out of the 64 channels in the EEG data (C = 32), as shown in Figure 3B.Then, we downsample the data to 1024 Hz, and segment, for each trial, the time window starting 0.3 s after the onset of the memory array and ending just before the presentation of the test array (see Figure 3A).The 0.7 s long segments (M = 717) cover most of the retention interval, the period when the subjects should maintain the stimulus information in their working memories, while leaving out any purely sensory responses elicited immediately after the presentation of the stimulus.Finally, with the aim of reducing the presence of spurious connections associated with volume conduction effects, we compute the surface Laplacian of each trial.

Classification Setup Feature Extraction
Let Ψ = {X n ∈ R C×M } N n=1 be an EEG set holding N trials from either an MI or a WM dataset, recorded from a single subject, where C stands for the number of channels and M corresponds to the number of samples.In addition, let {l n } N n=1 be a set whose n-th element is the label associated with trial X n .For the MI database l n can take the values of 1 and 2, corresponding to right hand and left hand motor imagination, respectively.Similarly, for the WM database, l n can take the values of 1, 2, and 3 corresponding to low, medium, and high memory loads.In both cases, our goal is to estimate the class label from relevant effective connectivity features extracted from X n .Because of the results obtained for the simulated data (see Section 4.1 for details), here we consider features from only three approaches for phase-based effective connectivity estimation, namely, TE θ κα , GC θ , and PSI.Additionally, we also characterize the data through the real-valued versions of TE κα and GC.
For the real-valued effective connectivity measures considered, we do the following: let λ(x c → x c ) be a measure of effective connectivity between channels x c , x c ∈ R M .By computing λ(x c → x c ) for each pairwise combination of channels in X n we obtain a connectivity matrix Λ ∈ R C×C .In the case when c = c , we set λ(x c → x c ) = 0.Then, we normalize Λ to the range [0, 1].After performing the above procedure for the N trials, we get set of connectivity matrices {Λ n ∈ R C×C } N n=1 .Then, we apply vector concatenation to Λ n to yield a vector φ n ∈ R 1×(C×C) .Next, we stack the N vectors φ n , corresponding to each trial, to obtain a matrix Φ ∈ R N×(C×C) holding all directed interactions, estimated through λ, for the EEG set Ψ. A graphical representation of the above-described steps, as well as of our overall classification setup, is depicted in Figure 4.For the phase-based effective connectivity measures of interest, we proceed in a similar fashion: let λ θ (x c → x c , f ) be a measure of phase-based effective connectivity between channels x c , x c at frequency f .By computing λ θ (x c → x c , f ) for each pairwise combination of channels in X n we obtain a connectivity matrix Λ( f ) ∈ R C×C (when c = c , we set λ θ (x c → x c , f ) = 0).For the MI database, we vary the values of f in the range from 8 Hz to 18 Hz, in 2 Hz steps, since activity in that frequency range has been associated with MI tasks [43].Then we define two bandwidths of interest ∆ f ∈ {α ∈ [8 − 12], β l ∈ [14 − 18]} Hz.Afterward, we average the matrices Λ( f ) within each bandwidth, normalize the resulting matrices to the range [0, 1], and stack them together, so that for each trial we have a connectivity matrix Λ ∈ R C×C×2 .Therefore, for the N trials, we get set of connectivity matrices {Λ n ∈ R C×C×2 } N n=1 .Then, we apply vector concatenation to Λ n to yield a vector φ n ∈ R 1×(C×C×2) .After that, we stack the N vectors φ n in order to obtain a single matrix Φ ∈ R N×(C×C×2) characterizing Ψ for the MI data.For the WM we follow the same steps, only that in this case we vary the values of f in the range from 4 Hz to 18 Hz, in 2 Hz steps, since oscillatory activity at those frequencies has been shown to play a role in the interactions between different brain regions during WM [50,51].Next, we define three bandwidths of interest } Hz, which leads to a connectivity matrix Λ ∈ R C×C×3 for each trial and ultimately to a matrix Φ ∈ R N×(C×C×3) characterizing Ψ for the WM data.Note that since the PSI is an antisymmetric connectivity measure, we only use the upper triangular part of the connectivity matrix associated with each trial to build Φ.

Feature Selection and Classification
After characterizing the EEG data, either through real-valued or phase-based effective connectivity measures, we set up a subject-dependent classification system for the MI and WM databases.
For the MI data, we do the following: Since the MI database has training and testing datasets, we divide our classification system into a training-validation stage and a testing stage.For the training-validation stage, we first specify a cross-validation scheme of 10 iterations.For each iteration, 70% of the trials of the training dataset are randomly assigned to a training set and the remaining 30% to a validation set.Then, we use CKA (see Section 2.4) over the connectivity features obtained from the training set to generate a relevance vector ∈ [0, 1] P , where P equals the number of features in Φ. P varies according to the connectivity measure used to characterize the data.Then, we use to rank Φ. Next, we select a varying percentage of the ranked features, from 5% to 100% in 5% steps, and input them to the classification algorithm.The features associated with the highest values of are input first, and as the percentage of features increases those associated with lower values of are progressively included.In this work, we use a support vector classifier (SVC) with an RBF kernel [52].All classification parameters, including the percentage of discriminant features, are tuned at this stage through a grid search.We select the parameters according to the classification accuracy, aiming to improve the system's performance.Then, for the testing stage, we train an SVC using the connectivity features from all trials in the training dataset as well as the parameters found in the previous stage.Lastly, we quantify the performance of the trained system in terms of the classification accuracy, obtained after predicting the MI task class labels of the testing dataset from its connectivity features.
The classification system we set up for the WM data closely resembles the one previously detailed for the MI data, with three changes.First, the WM database consists of one set of data for each subject, instead of two, so there is only a training-validation stage.Second, given the reduced number of trials available for each memory load level, each of the 10 iterations of the cross-validation scheme follows an 80-20% split for the training and validation sets (instead of a 70-30% split).Third, since the results provided by CKA are not stable for the low number of trials available from each subject (27.7 trials per class, on average), we opted to add an auxiliary cross-validation step, with the same characteristics as the one described above, and use it to estimate a single relevance vector ¯ , obtained as the average of the relevance vectors of each data split.Then, we use ¯ to perform feature selection in every iteration of the main cross-validation scheme.

Parameter Selection
We used in-house Python implementations of the algorithms for all the connectivity measures studied (The TE θ κα implementation is available at https://github.com/ide2704/Kernel_Phase_Transfer_Entropy, accessed on 13 July 2021), except for TE θ KSG .In that case, we used the implementation provided by the open access toolbox TRENTOOL, a TE estimation and analysis toolbox for Matlab [34].
Regarding the selection of parameters involved in the different effective connectivity estimation methods, we proceeded as follows: For the TE methods, we estimated all parameters from the real-valued time series, i.e., before extracting the phase time series.The embedding delay τ was set to 1 autocorrelation time (ACT), as proposed in [31].The embedding dimension d was selected from the range d = {1, 2, . . ., 10} using Cao's criterion [34,53].Note that for any signal pair, the embedding parameters selected are those of the driven or target time series, i.e., to estimate TE(x → y) we use for both time series the embedding parameters found for y.The interaction delay u was set as the value generating the largest TE from ranges that varied depending on the experiment: u = {1, 2, . . ., 10} for the NMMs, u = {1, 4, . . ., 25} for the MI data, and u = {50, 60, . . ., 250} for the WM data.Note that the meaning of u in terms of the time delay of the directed interaction between the driving and driven systems is associated with the sampling frequency, e.g., u = {1, 2, . . ., 10} for data sampled at 250 Hz translates to a time range between 4 ms and 40 ms.For TE θ κα we select a value of α = 2, which is neutral to weighting, a convenient choice when there is no previous knowledge about the values of the α parameter better suited for a particular application [10,28].In addition, as kernel function, we employ an RBF kernel with Euclidean distance (see Equation ( 13)).The bandwidth σ was set in each case as the median distance of the data [54].For TE θ KSG the Theiler correction window and the number of neighbors were left at their default values in TRENTOOL, 4 and 1 ACT, respectively [34].For the GC methods the order of the autoregressive models o was selected from the range o = {1, 3, . . ., 9} using Akaike information criterion [55,56].Furthermore, in order to estimate the PSI we employed a sliding window 5 frequency bins long (3 bins long for the WM data), centered on the frequency of interest.Finally, for all the connectivity methods involving the extraction of phase time series through Morlet wavelets, we varied the parameter m (see Equation ( 10)) from 3 to 10 in a logarithmic scale, according to the selected frequency of the filter.

Neural Mass Models Results
The experiments described in Section 3.1 are intended to assess whether the phasebased connectivity measures considered in this study correctly detect the direction of interaction between two time series of known oscillatory properties.Figure 5 presents the results obtained from such experiments.Namely, column A shows the connectivity values obtained for different levels of coupling strength, column B compares the connectivities estimated for ideal signals with those of signals contaminated with noise and mixing, and column C displays the results obtained for bidirectional narrowband couplings.The rows in Figure 5 correspond to each of the phase-based connectivity measures studied.The first row contains average PSI values computed on the frequency range between 2 Hz and 60 Hz, while rows two to five display average net connectivity values for TE θ κα , TE θ KSG , TE θ Sym , and GC θ , respectively.Circled values indicate statistically significant connectivities at a particular frequency, according to a permutation test based on randomized surrogate trials.The test identifies connectivity values that are, on average, significantly different from those expected for that connectivity measure applied to non-interacting signals.For the three experiments involving simulated data from NMMs, we use the PSI as a comparison standard, since it is a robust and well-stablished measure of linear directed interactions defined in terms of phase relations [12,13].Therefore, it is suited to analyze the coupled, mildly nonlinear time series generated by NMMs.
Regarding the first experiment, which modifies the coupling strength between the simulated signals, the obtained results (Figure 5, column A) show that all the measures studied satisfactorily detect the coupling direction of the simulated data.Note that since we set the NMMs to generate unidirectional interactions from x to y, and because of the way we defined ∆λ, then all net connectivity values for the simulated coupled signals should be positive.The same is true for the PSI(x → y).On the other hand, only the PSI, TE θ κα , and GC θ fulfill the criteria for an overall description of the phase-based interactions present in the data.First, we observe higher net connectivity values at higher coupling strengths, that is to say, stronger interactions lead to larger connectivity estimates.Second, for each coupling strength, there are higher net connectivity values around the frequencies corresponding to the main oscillatory components of the time series generated by the NMMs, in this case, oscillations in the range between 8 Hz and 20 Hz.Thirdly, there are statistically significant results for all the coupling strengths explored, except for noninteracting time series (C 12 = 0).TE θ KSG does not capture statistically significant interactions for a coupling coefficient value of 0.2, indicating a lower sensitivity to weak couplings.

While TE θ
Sym exhibits a very distorted connectivity profile when compared with the PSI.In addition, it has much larger standard deviations for all the coupling strengths considered.
The second experiment assesses the robustness of our proposal to realistic levels of noise and signal mixing, two sources of signal degradation that can lead to spurious connectivity results.In electrophysiological signals, such as EEG, signal mixing arises as a result of field spread, while noise is the result of technical and physiological artifacts [9,57,58].The results in Figure 5, column B, show that PSI, TE θ κα , and GC θ capture statistically significant interactions in the frequencies of interest for both the ideal (no noise or signal mixing) and realistic conditions.The smaller connectivity values for the data contaminated with noise and signal mixing, as compared with the ideal signals, are mostly explained by the reduction in asymmetry between the driving and driven signals caused by mixing [8].On the contrary, we observe that neither TE θ Sym nor TE θ KSG produce statistically significant results under the realistic scenario, indicating that those estimators are less robust to signal degradation.Finally, the third experiment aims to evaluate how our proposal deals with bidirectional interactions of localized frequency content.Because of our experimental setup, the obtained results should exhibit a positive deflection around 10 Hz in order to capture the directed interaction from x to y and a negative deflection around 40 Hz to represent the directed interaction from y to x. Figure 5, column C, shows that both PSI and TE θ κα successfully detect the change in the direction of interaction in localized frequency bands, with statistically significant connectivity values around the frequencies of interest.However, under this scenario, TE θ κα is less frequency specific for high-frequency interactions than PSI, with statistically significant connections present for a large range of frequency values around 40 Hz.This is probably due to the filtering step involved in the estimation of TE θ κα , while PSI is directly defined on the data spectra.Additionally, TE θ KSG and TE θ Sym fail to produce any significant results, while GC θ shows a statistically significant, non-existing coupling from y to x for frequencies under 10 Hz.Note that, ultimately, the permutation test indicates whether the connectivity values obtained are unlikely to be the result of chance and not whether they correctly capture the directed interactions present in the data.In this case, the statistically significant results mean that GC θ consistently found a directed interaction from y to x in the range mentioned before.
The results discussed above indicate that the proposed phase TE estimator is able to detect directed interactions between time series resembling electrophysiological data for different levels of coupling strength, under the presence of noise and signal mixing and for bidirectional narrowband couplings.Furthermore, they show that it is competitive with well-established approaches for phase-based net connectivity estimation, such as PSI, in the case of weakly nonlinear signals.Lastly, our results also show that commonly used single-trial TE estimators, such TE KSG and TE Sym , are ill-suited to measure directed interactions between instantaneous phase time series.

EEG Data Results
Table 1 presents the average accuracies achieved by the proposed classification systems for both the MI and WM databases, for each effective connectivity method studied.For the MI database, in the training-validation stage, the classifier based on TE θ κα features exhibited the highest average performance, closely followed by the one based on GC θ .In the testing stage, we observe the same overall accuracy ranking, although a smaller drop in the classification accuracy occurs for TE θ κα than for GC θ , which points to a better generalization capacity by the system trained using features extracted through phase TE.For the WM database, the classifier trained from TE θ κα features also displays the highest average accuracy.However, in this case, there is a large gap in performance between the TE θ κα -based classification system and the closest results from an alternative approach.Furthermore, the results in Table 1 show a consistent improvement in performance between the classifiers that use real-valued TE estimates and those that are trained from phase TE values.They also show relatively low accuracies for the classifiers trained using PSI features.We believe the latter can be explained by two factors.First, by definition, the PSI is unable to explicitly detect bidirectional interactions.It measures connectivity in terms of lead/lag relations, which leads to ambiguity regarding the meaning of PSI values close to zero, since they can be the result of either the lack of interaction or evenly balanced bidirectional connections.If the relevant information to discriminate among the conditions of a cognitive paradigm is related to the bidirectionality of interactions, such as those present in WM [50,51], then the PSI might not be an adequate characterization strategy.Secondly, the PSI, like GC, is a linear measure; its performance degrades for strongly nonlinear phase relationships.
In the sections below, we detail and further discuss the results obtained for each database.Figure 6 depicts the average classification accuracy for all subjects in MI database as a function of the number of selected features during the training-validation stage, for TE κα and TE θ κα .These results show there is a small improvement in the ability to discriminate between the MI tasks when using features extracted through phase TE, as compared with real-valued TE.In addition, they reveal that the CKA-based feature selection strategy successfully identified the most relevant connections for MI task classification.That is to say, the classification system has a stable performance even for a very reduced number of connectivity features.This is fundamental for any practical BCI application that intends to use phase TE as a characterization strategy, since estimating single-trial phase TE is computationally expensive [8].Therefore, it is important to reduce as much as possible the number of channel pair connectivity features required to achieve peak classification performance.Additionally, it is important to highlight that while classification accuracies in Figure 6, and in Table 1, are in the same range of those obtained through other connectivitybased characterization approaches [10,23], they are far below those obtained from methods such as common spatial patterns [59][60][61].A possible explanation is that bivariate TE might be more robust at describing long-range interactions rather than local ones [41], like those arising from MI-related activity, centered on the sensorimotor area.In addition, the differences with the results in [10], where we used TE κα to characterize the same database, lay mostly in the fact that in this study we select and analyze one 2 s long time window covering the period right after the end of the visual cue, while in [10] we report results from multiple overlapping windows covering the entirety of the task.Lastly, the large standard deviations from the average accuracies in Figure 6 point to disparate performances for different subjects.Figure 7A shows the highest average classification accuracy per subject for TE θ κα , GC θ and PSI, during the training-validation stage.The subjects are ordered from highest to lowest performance.The analogous information for the testing stage is presented in Figure 7.In both stages, the TE θ κα -based classifier performs slightly better than those based on alternative connectivity estimation strategies in most subjects.In addition, as inferred from Figure 6, there are large variations in performance for the different subjects in the database, consistent across the two classification stages.This behavior has been reported elsewhere [10,[59][60][61][62].In order to gain insight into the observed performance differences, in the case of TE θ κα , we exploited the second advantage provided by the CKA-based relevance analysis.The relevance vector index not only allows us to perform feature selection but also provides a one-to-one relevance mapping to each connectivity feature.That is to say, we can reconstruct normalized relevance connectivity matrices by properly reshaping , so as to visualize the connectivity pairs and frequency ranges that are discriminant for the task of interest.In that line, we followed the approach proposed in [23] to interpret the relevance information by clustering the subjects according to common relevance patterns.
First, for each subject and frequency band of interest, we obtained a relevance vector n,∆ f ∈ R C whose elements were associated with each node (EEG channel) in the data by computing the relevance of the total information flow of every node.Such magnitude was defined as the sum of the relevance values , obtained from all data in the training dataset, corresponding to all directed interactions targeting and originating from a particular node.Then, we concatenated the vectors n,∆ f ∈ R C for all frequency bands to obtain a single relevance vector n ∈ R 2C .Next, we reduced the dimension of the relevance vectors n of each subject through t-Distributed Stochastic Neighbor Embedding (t-SNE), which preserves the spatial relationships existing in the initial higher-dimensional space [63].Figure 8A shows the obtained two-dimensional representation of the relevance vectors for each subject in the MI database, colored according to their respective classification accuracy.
Note that the distribution of the subjects in the plot is related to their classification accuracies.This indicates that shared relevance patterns are related to the obtained classification results, meaning that subjects with similar n had close performances.Then, we grouped the subjects into two clusters using the k-means algorithm.The number of clusters was selected by visual inspection of the t-SNE results.Figure 8B  Finally, Figure 9 shows the average nodal relevance, as defined by n , and the most relevant connectivities for each group, discriminated by frequency band.For G. I we observe high node relevance mostly in the α band in right fronto-central, left-central, and centro-parietal regions.The most relevant connections in the α band tend to originate or target fronto-central nodes, while the ones in the β l band favor parietal and centroparietal areas.For G. II, the node relevance is concentrated around the right centro-parietal region, particularly channel CP4, for both frequency bands.The most relevant connections in the α band involve short-range interactions mainly between centro-parietal and central regions.The most relevant connections in the β l band, which display higher values than those of α, originate from CP3 and CP4 and target central and fronto-central nodes.Since the G. II includes all the subjects with good classification performances, we can conclude that the information that allows to satisfactorily classify the left and right hand MI tasks from TE θ κα features corresponds mostly to the incoming and outgoing information flow coded in the phases of the oscillatory activity in the centro-parietal region.These results are in line, in terms of spatial location, with those we found in [10], and with physiological interpretations that argue that MI activates motor representations in the parietal areal and the premotor cortex [64].

Working Memory Results
Figure 10 presents the average classification accuracy for all subjects in the WM database as a function of the number of selected features, for TE κα and TE θ κα .The results show that the classifier trained from phase TE features markedly outperforms the one trained using real-valued TE estimates, as long as the appropriate percentage of features is selected.This difference might be attributed to the hypothesized phase-based nature of directed interactions during WM tasks [35,50], which would be better captured by phase TE.Furthermore, both accuracy curves highlight the importance of feature selection, since they show a steep performance degradation as more features are used to train the classifiers.In this case, the CKA-based relevance analysis not only allows reducing the number of features needed to successfully classify the three cognitive load levels present in the WM data but also prevents the classifiers from being confounded by connections that do not hold relevant information to discriminate between the target conditions.obtained for the MI database, we do not observe an underperforming group of subjects, especially after considering the fact that for the WM database the classifiers must discriminate among three classes instead of two.On the other hand, in this case, the TE θ κα -based classifier largely outperforms those based on alternative connectivity estimation strategies in most subjects.Here, we must point out that the auxiliary cross-validation step introduced for feature selection, aiming to obtain stable CKA results for the reduced number of available trials, leads to data leakage.This is because, ultimately, it requires all the available data to estimate ¯ , which renders it a nonviable approach for practical BCI implementations and can inflate performance evaluations, such as the accuracy results previously discussed.However, since the same strategy was implemented for all classification systems and connectivity measures considered for the WM database, comparisons among them remain valid, and the relative differences in performance are still informative.In order to elucidate the pairwise connectivities, and their corresponding frequency bands, that allow the TE θ κα -based classification system to successfully discriminate among different memory loads, we proceeded as described in Section 4.2.1 and from ¯ obtained a node relevance vector ¯ n ∈ R 3C .Then, we applied t-SNE on ¯ n .Figure 12A shows the obtained two-dimensional representation of the relevance vectors for each subject in the WM database.Unlike the results observed before for the MI database, there is not a clear association between the subject distribution on the plot and their classification accuracies.Nonetheless, Figure 12A shows the presence of well-defined groups sharing similar relevance patterns.As before, we grouped the subjects into clusters using the k-means algorithm.The number of clusters was selected as three by visual inspection of the t-SNE results.Figure 12B displays the three groups, termed G. I, G. II, and G. III.The TE θ κα -based classifier has average accuracies of 0.94 ± 0.04, 0.92 ± 0.08, and 0.93 ± 0.08 for the subjects in G. I, G. II, and G. III, respectively.
Lastly, Figure 13 shows the average nodal relevance, as defined by n , and the most relevant connectivities for each group, discriminated by frequency band.For G. I we observe widespread high node relevance in both the α and β l bands and low node relevance in the θ band.Most relevant connections are present in the β l band with many connections originating in the parieto-occipital region and targeting frontal and centro-frontal areas.For G. II and G. III node relevance is more evenly distributed across the three frequency bands considered.Spatially, it is more prominent around some pre-frontal, frontal, centroparietal, and parietal nodes.In terms of the most relevant connections, we observe longrange contralateral interactions involving mostly the regions previously listed, as well as some connections to and from temporal areas.Therefore, we argue that the information flow between frontal, parietal, and temporal regions, coded in the phases of oscillatory activity in the θ, α, and β l bands, is what allowed us to discriminate among different memory loads from TE θ κα features.These results agree with several studies that identify fronto-parietal and fronto-temporal neural circuits operating in frequency ranges spanning from θ to β as key during the activation of working memory [35,50,51].

Limitations
In this study, we employed Morlet wavelets as filters for instantaneous phase extraction prior to phase TE estimation, as proposed in [8].However, as discussed by the authors in [8], the choice of filter can influence the behavior of phase TE.This is an aspect we have yet to explore for our proposal.In the same line, in [42] the authors showed, using the Kraskov-Stögbauer-Grassberger TE estimator on real-valued filtered signals, that filtering and downsampling are deleterious for TE, since they can lead to altered time delays and hide certain causal interactions.Furthermore, from a conceptual perspective, while filtering dampens spectral power, it does not always remove the information contained in specific frequencies [25].This would hinder the isolation of frequency specific interactions in TE estimates from real-valued filtered data, the most common approach to obtain spectrally resolved TE values.Whether those effects are also present in the case of phase TE is yet to be analyzed; however, as pointed out in [25], phase TE is conceptually different from spectrally resolved TE.Additionally, the results obtained with our phase TE estimator for the NMM data closely follow those obtained with the PSI, a measure that does not rely on data filtering, which points to a certain degree of robustness to the negative effects that might be associated with phase extraction through complex filtering.A related issue is the possible effects on our proposal of the preprocessing pipelines employed on the EEG data, which involve spectral and spatial filtering.Regarding the former, we have not studied its effects in this work; while for the latter, surface Laplacian positively impacted the discrimination capability of the connectivity features obtained from the different measures considered.
In addition, we are yet to examine the effects of the parameter α in Renyi's entropy on the proposed phase TE estimator.In [10], we showed that the choice of α indeed modified the performance of the TE κα .The same must hold true for TE θ κα .Additionally, we selected the autocorrelation time and Cao's criterion to obtain the embedding parameters for all the TE estimation methods.More complex approaches such as time-delayed mutual information and Ragwitz criterion may yield better results [34].However, since our motivation was to propose a single-trial phase TE estimator suited as characterization method for BCI applications, the choice of simple parameter estimation methods is justified.As a matter of fact, a practical implementation of a phase TE-based BCI system would likely require further simplifications regarding parameter estimation, in order to facilitate the computation of phase TE in real time.Furthermore, our proposed phase TE estimator inherits the limitations of TE κα [10].Namely, it is ill suited to analyze long time series (several thousands of data points) because of the increase in computational cost, especially for non-integer values of the parameter α.In addition, it assumes stationary or weakly non-stationary data.Finally, since the definition of causality underlying TE is observational, the proposed phase TE estimator is blind to unobserved common causes, including those resulting from different driving delays.

Conclusions
In this work, we proposed a single-trial phase TE estimator.Our method combines a kernel-based TE estimation approach, which defines effectivity connectivity as a linear combination of Renyi's entropy measures of order α, with instantaneous phase time series extracted from the data under analysis.We tested the performance of our proposal on synthetic data generated through NMMs and on two EEG databases obtained under MI and WM paradigms.We compared it with commonly used single-trial TE estimators, applied to phase time series, and the PSI and GC.Our results show that the proposed phase TE estimator successfully detects the direction of interaction between individual pairs of signals, capturing the differences in coupling strength and displaying statistically significant results around the frequencies corresponding to the main oscillatory components present in the data.It also succeeds in detecting bidirectional interactions of localized frequency content and is robust to realistic noise and signal mixing levels.Moreover, our method, coupled with a CKA-based relevance analysis, revealed discriminant spatial and frequency-dependent patterns for both the MI and WM databases, leading to improved classification performance compared with approaches based on real-valued TE estimation.In all our experiments, the proposed single-trial kernel-based phase TE estimator is competitive with the comparison methods previously listed in terms of the performance assessment metrics employed.
As future work, we will look into developing a cross-spectral representation for our phase TE estimator to study directed interactions between oscillations of different frequencies [65].We will also explore the effects of the choice of filter on the proposed estimator as well as those of the parameters involved in time embedding and in our kernel-based TE estimation approach.

Figure 1 .
Figure 1.(A) Schematic representation of a neural mass model.(B) 1 s long unidirectionally coupled time series generated by the model.(C) Average power spectrums peaking in the α and lower β frequency bands.

Figure 2 .
Figure 2. (A) Schematic representation of the MI protocol.(B) EEG channel montage used for the acquisition of the MI database.

Figure 3 .
Figure 3. (A) Schematic representation of the WM protocol.(B) EEG channel montage used for the acquisition of the WM database.

Figure 4 .
Figure 4. Schematic representation of our overall classification setup.

Figure 5 .
Figure 5. Obtained results for the experiments performed using simulated data from NMMs.Column (A) shows the average connectivity values obtained for different levels of coupling strength.Column (B) presents the average connectivity values estimated for ideal signals and for signals contaminated with noise and signal mixing.Column (C) displays the average connectivity values obtained for bidirectional narrowband couplings.The rows correspond to each of the net phase-based effective connectivity estimation approaches considered for the aforementioned experiments.Circled values indicate statistically significant results at a Bonferroni-corrected alpha level of 3.3 × 10 −4 , according to a permutation test based on randomized surrogate trials.

Figure 6 .
Figure 6.Average classification accuracies, and their standard deviations, for all subjects in the MI database as a function of the number features selected to train the classifiers.

Figure 7 .
Figure 7. (A) Highest average classification accuracy for each subject in the MI database during the training-validation stage.(B) Accuracies obtained for each subject during the testing stage.The subjects are ordered from highest to lowest performance according to the accuracies obtained for the TE θ κα -based classifier in the training-validation stage.

Figure 8 .
Figure 8. (A) Two-dimensional representation of the relevance vectors for each subject in the MI database obtained after applying t-SNE on n .(B) Groups identified by k-means.For the TE θ κα -based classifier the subjects grouped in G.I have an average accuracy of 0.59 ± 0.05, while those in G. II have an average accuracy of 0.80 ± 0.09.

Figure 9 .
Figure 9. Topoplots of the average node (channel) relevance for each group of clustered subjects and frequency band of interest in the MI database (see Figure8).The arrows represent the most relevant connectivities for each group.For visualization purposes, only 3% of the connections, those with the highest average relevance values per group, are depicted.

Figure 10 .
Figure 10.Average classification accuracies, and their standard deviations, for all subjects in the WM database as a function of the number features selected to train the classifiers.

Figure 11
Figure 11 depicts the highest average classification accuracy per subject for TE θ κα , GC θ and PSI.The subjects are ordered from highest to lowest performance.Unlike the results

Figure 11 .
Figure 11.Highest average classification accuracy for each subject in the WM database.The subjects are ordered from highest to lowest performance according to the accuracies obtained for the TE θ κα -based classifier.

Figure 12 .
Figure 12. (A) Two-dimensional representation of the relevance vectors for each subject in the WM database obtained after applying t-SNE on n .(B) Groups identified by k-means.For the TE θ κα -based classifier the subjects grouped in G. I, have an average accuracy of 0.94 ± 0.04, while those in G. II and G.III have average accuracies of 0.92 ± 0.08 and 0.93 ± 0.08, respectively.

Figure 13 .
Figure 13.Topoplots of the average node (channel) relevance for each group of clustered subjects and frequency band of interest in the WM database (see Figure12).The arrows represent the most relevant connectivities for each group.For visualization purposes, only the 1% of the connections, those with the highest average relevance values per group, are depicted.
While for K , the vectors a i , a j ∈ R d contain the time embedded version of y, y

Table 1 .
MI and WM classification results in terms of the classification accuracy for all the effective connectivity measures considered.