Estimating Directed Phase-Amplitude Interactions from EEG Data through Kernel-Based Phase Transfer Entropy

: Cross-frequency interactions, a form of oscillatory neural activity, are thought to play an essential role in the integration of distributed information in the brain. Indeed, phase-amplitude interactions are believed to allow for the transfer of information from large-scale brain networks, oscillating at low frequencies, to local, rapidly oscillating neural assemblies. A promising approach to estimating such interactions is the use of transfer entropy (TE), a non-linear, information-theory-based effective connectivity measure. The conventional method involves feeding instantaneous phase and amplitude time series, extracted at the target frequencies, to a TE estimator. In this work, we propose that the problem of directed phase-amplitude interaction detection is recast as a phase TE estimation problem, under the hypothesis that estimating TE from data of the same nature, i.e., two phase time series, will improve the robustness to the common confounding factors that affect connectivity measures, such as the presence of high noise levels. We implement our proposal using a kernel-based TE estimator, deﬁned in terms of Renyi’s α entropy, which has successfully been used to compute single-trial phase TE. We tested our approach on the synthetic data generated through a simulation model capable of producing a time series with directed phase-amplitude interactions at two given frequencies, and on EEG data from a cognitive task designed to activate working memory, a memory system whose underpinning mechanisms are thought to include phase–amplitude couplings. Our proposal detected statistically signiﬁcant interactions between the simulated signals at the desired frequencies for the synthetic data, identifying the correct direction of the interaction. It also displayed higher robustness to noise than the alternative methods. The results attained for the working memory data showed that the proposed approach codes connectivity patterns based on directed phase– amplitude interactions, that allow for the different cognitive load levels of the working memory task to be differentiated.


Introduction
Biological neural systems exhibit rhythmic activity across many scales, from the spiking activity of individual neurons to the electric potentials generated by large populations of neurons, measurable from the scalp in the form of electroencephalographic (EEG) recordings [1]. Interactions between oscillations of different frequencies, known as cross-frequency couplings (CFCs), have been hypothesized to be directly related to the integration of distributed information in the brain [2] by regulating and synchronizing multi-scale communication within and across neural ensembles [3]. The most widely studied instance of CFC is the modulation of the amplitude envelope of high-frequency oscillations by the phase evolution of low-frequency activity, known as phase-amplitude coupling (PAC) [2][3][4]. These phase-amplitude interactions seem to be linked to normal and pathological brain processes in different mammalian species, including humans [5], and they were locally and interregionally observed across a wide range of cognitive tasks [6]. Theoretically, PAC allows for information transfer from large-scale brain networks associated with low-frequency oscillations to local, fast cortical processing areas exhibiting high-frequency activity [7].
Phase-amplitude interactions are commonly assessed using electrophysiological data through metrics of statistical dependency, such as the modulation index, mean vector length, and variations in the concept of mutual information [8][9][10]. However, they are unable to capture the directionality and delay of phase-amplitude interactions, quantities that are intrinsic to the concept of information being sent from one neural assembly to another [1,5,11]. A natural solution to this limitation, within the framework of information theory, would be to assess PACs using transfer entropy [3,9,[11][12][13]. Transfer entropy (TE) is a model-free connectivity measure that estimates the directed interaction, or information flow, between two dynamical systems [14,15]. It is specially well-suited to exploratory analysis in neuroscience because of its ability to detect unknown non-linear interactions [16,17]. However, as a model-free information-theoretic measure, it does not capture the details of how the information transfer is carried out [1]. Standard TE will not reveal anything about the spectral characteristics of the interactions it detects. The most common approach explored in the literature to assess directed phase-amplitude interactions through TE involves bandpass-filtering the signals in the target frequency bands before the extraction of the instantaneous phase and amplitude time series, which are then fed to a TE analysis [5,18]. Nonetheless, it has been argued that filtering before TE computation negatively affects the results of TE [19,20], leading to false positives, and that it may not have the desired frequency-specific effects [1].
In this work, we reframe the problem of estimating directed phase-amplitude interactions through TE as the computation of TE between two instantaneous phase time series, known as phase TE [21], by borrowing the underlying premise of the cross-frequency directionality (CFD), a linear connectivity measure capable of estimating the PAC direction [11]. We previously addressed the problem of phase TE estimation from single trial data [22] and hypothesized that, for directed PAC, the proposed approach can correctly identify the interacting frequencies, as well as the direction of interaction, while being robust to common factors that degrade the performance of connectivity estimation methods, such as the presence of high levels of noise and volume conduction effects [23].
To test our proposal, we use a simulation model that allows generating synthetic data with unidirectionally phase-amplitude couplings at two target frequencies [11], and real EEG data from a change detection task with several difficulty levels designed to study visuospatial working memory [24,25]. Working memory (WM) is a memory system of limited capacity with the ability to store and manipulate information for a short period of time [26,27]. Current hypotheses for the mechanisms underpinning WM highlight the role of PAC [28], pointing to bidirectional interactions between the θ (4 Hz-7 Hz), α (8 Hz-12 Hz), and β bands (13 Hz-30 Hz) (particularly around 13.5 Hz to 16 Hz for the latter) linking the prefrontal cortex to parieto-occipital and medial temporal regions during the activation of WM [6,29]. Obtained results for the simulated data show that the proposed approach successfully captures statistically significant phase-amplitude interactions, correctly identifying the direction of interaction and the target frequencies under noisy and signal-mixing conditions. Furthermore, the results for the WM data reveal that our proposal captures discriminant phase-amplitude connectivity patterns that allow the cognitive load associated with a trial of the change-detection task to be detected.
The remainder of the paper is organized as follows: in Section 2, we present our proposal to estimate directed phase-amplitude interactions through a phase TE-based approach. Section 3 describes the two experiments we carried out to test the performance of our method. In Section 4, we present and discuss our results and, finally, we provide conclusions in Section 5.

Transfer Entropy
Transfer entropy (TE) is an information-theoretic measure of directed interactions between two dynamical systems [14,15]. Given two time-series x = {x t } T t=1 and y = {y t } T t=1 , with t ∈ N a discrete time index, T ∈ N, TE is defined as: where x dx t , y dy t ∈ R D×d are time-embedded versions of x and y, D = T − (τ(d − 1)) with d, τ ∈ N the embedding dimension and delay, respectively; u ∈ N stands for the interaction delay between the two systems, and p(·) represents a probability density function [30]. The time embeddings are defined as [31,32]. TE, as in Equation (1), evaluates the statistical causality from x to y by measuring whether the information contained in the previous x, alongside that of the previous y, is better at predicting the future of y than the information from the past of y alone. If this is the case, then x causes y (in the sense of Wiener's definition of causality [14]). For estimation convenience, we can also express TE as a linear combination of Shannon entropies: where H S (X) = − ∑ x p(x)log(p(x)), X is a discrete random variable (x ∈ X), and H S (·, ·), H S (·) stand for joint and marginal entropies.

Transfer Entropy for Directed Phase-Amplitude Interactions
The conventional approach to estimate directed phase-amplitude interactions through TE consists of two stages. The first stage is a component extraction stage, which involves complex-filtering, or performing a phase/amplitude decomposition, to extract instantaneous phase and amplitude time series (see Figure 1A). Then, there is a TE computation stage that simply consists of estimating the information flow between the previously extracted data [5,18]. Formally, given two time series x and y, to estimate the TE from the phase of x at a frequency f l (usually a low frequency) to the amplitude envelope of y at a frequency f h (commonly a higher frequency than f l ), we obtain complex time series s x ( f l ) = ς x e iθ x ∈ C T and s y ( f h ) = ς y e iθ y ∈ C T , which contain the filtered values of x and y at f l and f h , respectively; where θ x , θ y ∈ [−π, π] T t=1 are instantaneous phase time series, and ς x , ς y ∈ R T are amplitude envelopes [21]. Then, we compute the desired TE as: where θ x,dx t and ς y,dy t are time-embedded versions of θ x and ς y .

Cross-Frequency Directionality
The cross-frequency directionality (CFD) estimates the direction of interaction between the phase of low-frequency ( f l ) oscillations and the amplitude of faster, higher-frequency ( f h ) oscillations [11]. This is based on the phase slope index (PSI), which measures the coupling directionality between two oscillatory signals of similar frequencies [33]. Given two time series x and y, the CFD from x to y is computed as the PSI between x, the time series containing the slow oscillations of interest, and the amplitude envelope of y at f h (ς y ). Thus: where C xς = S xς / S xx , S ςς is the complex coherence, S xς ∈ C is the cross-spectrum between x and ς y , S xx , S ςς ∈ C are the auto-spectrums of x and ς y , d f ∈ R + corresponds to the frequency resolution, F indicates the frequency range over which the slope is summed, and stands for the fact that only the imaginary part of the sum is selected [33]. Therefore, the CDF estimates the slope of the phase difference, between the phases of x and ς y , as a function of frequency. That is to say, it translates the problem of estimating the direction of phase-amplitude interactions to estimating interaction between phases.

Phase Transfer Entropy and Directed Phase-Amplitude Interactions
We begin by noting that the conventional approach to estimate directed phaseamplitude interactions though TE, as expressed by Equation (3), implies the computation of TE from data of different properties, a phase time series θ x , which represents a circular variable, and a smooth, continuous amplitude envelope ς y . In this work, we reformulate the problem of directed phase-amplitude interaction detection using TE as a phase TE-estimation task [21]. We do this by applying the idea behind the CFD: obtaining the directionality of interactions between phase and amplitude time series can be redefined as estimating the direction of interaction between two-phase time series. We hypothesize that such a change can improve the robustness of phase-amplitude TE estimates to signal degradation by noise and volume conduction effects.
Given two time series x and y, we want to estimate the TE from the phase of x at a frequency of f l to the amplitude envelope of y at a frequency of f h . As before, θ x ∈ [−π, π] T t=1 and ς y ∈ R T are the corresponding phase and amplitude time series, obtained at the adequate frequencies. However, before TE computation, we obtain a complex representation of ς y at f l , s ς ( f l ) = ς ς e iθ ς ∈ C T , where θ ς , ∈ [−π, π] T t=1 is an instantaneous phase time series, and ς ς ∈ R T is an amplitude envelope (See Figure 1B). Next, we define: where θ x,dx t and θ ς,dς, t are time-embedded versions of θ x and θ ς .
x y Figure 1. (A) The conventional approach to capture directed phase-amplitude interactions through TE consists of estimating TE from the instantaneous phase and amplitude time series that were extracted at frequencies f l and f h , respectively. (B) Our approach implies a further step, comprising the extraction of the phase of the amplitude time series at frequency f l , to reformulate the problem as a phase TE estimation task.
The quantity in Equation (5) indicates the estimation of TE from two-phase time series extracted at the same frequency ( f l ). Thus, it corresponds to the definition of phase TE, a phase-specific, nonlinear directed connectivity measure introduced in [21]. In that sense, a robust, kernel-based approach to estimating phase TE from single-trial data was recently proposed by our group [22]. Following this approach, we can redefine Equation (5) as: with α = 1 and α ≥ 0, and X a discrete random variable. This is a generalization of Shannon's entropy, and tends towards it in the limiting case where α → 1. Its kernel-based formulation can bypass the need for direct probability estimation from the data, instead relying on the kernel matrices that capture similarity relationships [34]. It is defined as: where A ∈ R n×n is a kernel matrix containing elements a ij = κ(x i , x j ), n is the number of realizations of X, and tr(·) stands for the matrix trace. We used this formulation, along with its definition for joint probability distributions: where B ∈ R n×n is another kernel matrix, and the operator • stands for the Hadamard product, to successfully and robustly estimate TE for real-valued and instantaneous phase time series [16,22]. In this study, we apply it in the context of directed phase-amplitude interaction estimation, following the TE estimation approach presented in Equation (6).

Simulation Model
To evaluate the performance of our proposal, we generate simulated time series using a modified version of the PAC modeling strategy introduced in [11]. The model simulates a directed interaction from the phase of a time series x ∈ R N , at a low frequency f l , to the amplitude of a time series y ∈ R N , at a high frequency f h . We implement the model as follows: first, we build d time series segments x i corresponding to a sinusoidal signal period: . . , T i }, and dt = 1/ f s , with f s the sampling frequency in Hz; as shown in Figure 2A. For each segment i, the amplitude A i and the period T i are drawn from Gaussian distributions with means A = 1 and T = 1/ f l and standard deviations of 0.1A and 0.2T, respectively. Then, we concatenated the d segments to obtain a continuous signal of varying amplitude, as depicted in Figure 2B, x ∈ R N = x 1 , x 2 , . . . , x d , and with a power spectrum peaking at around f l . Next, we generate a signal oscillating at f h , whose amplitude is a function of x , by defining: where t = {0, dt, 2dt, . . . , (N − 1)dt}, ζ = f l / f h , c = 0.6, and a = 10 (see Figure 2C). The constant c represents a threshold value, so that when x < c, the amplitude of y increases, a controls the steepness of that increase. Next, we imposed a directionality of interaction from x to y by time-shifting y by ∆t seconds y ∆t = y (t + ∆t). Afterward, we construct two pairs of auxiliary signals following the steps described above. From one pair, we select the signal with low-frequency components, x . From the remaining pair, we select the signal oscillating at the highest frequency, y . Finally, we define x = x + y , and y = x + y ∆t , so that, by design, x and y have closely resembling power spectra (see Figure  2D-F).
x d x y ∆t

Experimental Setup
We simulate 100 pairs (trials) of 2 s long signals, sampled at 1000 Hz, with directed phase-amplitude interactions from the θ band ( f l = 6 Hz) to the β band ( f h = 24 Hz), with a timeshift of 20 ms. We detrend and normalized the simulated signals, before contaminating them with normalized white (η) and pink (η p ) noise, as follows: where the parameter SNR controls the signal to noise ratio. We vary this to simulate low (SNR = 6), moderate (SNR = 3), and high (SNR = 1) noise conditions. For each scenario, we also mix the noisy signals, x η and y η , aiming to reproduce the effects of volume conduction, by defining x w η = 1 − w 2 x η + w 2 y η , and y w η = 1 − w 2 y η + w 2 x η , with w = 0.25 the mixing strength. Then, we downsample x w η and y w η by a factor of 4 (we keep only every fourth sample) to 250 Hz, and estimate the directed phase-amplitude interactions present in the data for a square frequency grid ranging from 3 Hz to 45 Hz, in 3 Hz steps, using the three approaches described in Section 2. Finally, to determine whether the estimated interactions are statistically significant, we perform permutation tests based on randomized surrogate trials [19,36] for each evaluated condition and frequency pair. The Bonferroni-corrected significance level for the tests is set at 4.4 × 10 −5 .

Database
The database of brain activity during visual working memory [24] consists of the EEG data recorded from twenty-three subjects while they performed a change detection task [25] (https://data.mendeley.com/datasets/j2v7btchdy/2, accessed on 4 September 2021). The subjects had normal or corrected-to-normal vision, and did not have color-vision deficiency. The EEG data were acquired from 64 electrodes (Biosemi ActiveTwo), arranged according to the international 10/20 extended system, at a sampling rate of 2048 Hz. In addition to the EEG data, the database contains recordings from two external electrodes placed on the left and right mastoids, and four EOG channels. The goal of the change-detection task is to remember the colors of a set of squares (the memory array), and then compare them with the colors of a second set of squares (the test array), which appear in the same locations as the first set. The task has three levels: low-, medium-, and high-memory load. Each level has a different number of elements in the memory array: one square (low-memory load), two squares (medium-memory load), and four squares (high-memory load). At the beginning of each task trial, an arrow pointing to the left or right hemifield appears on the screen, signaling to the subject that they must remember only the stimuli that will be displayed on that side of the screen. Next, a memory array is presented for 0.1 s, followed by a retention period of 0.9 s. Afterward, a test array appears, and the subject reports whether the colors of the squares in the memory and test arrays are the same. Figure 3A depicts the task's experimental paradigm. Each subject performed 96 trials (32 trials for each memory load level). The colors of the squares in the test and memory arrays were different in 50% of the trials.

Preprocessing
First, the data were re-referenced to the average of the recordings from the electrodes located on the mastoids, bandpass-filtered between 0.01 Hz and 20 Hz using an order 2 Butterworth filter, and segmented using a 1.4 s squared window in order to extract the data corresponding to each trial. A trial segment started 0.2 s before the presentation of the memory array. Then, independent component analysis (ICA) was performed on the trial data to eliminate ocular artifacts using the fastICA algorithm, as implemented in the MNE python package, and the EOG information [24]. Next, all trials for which the subjects incorrectly matched the memory and test arrays were dropped from further analysis. Afterward, the 32 channels (C = 32) depicted in Figure 3B were chosen from the 64 channels contained in the EEG data. Then, the trial data were downsampled by a factor of 2 (only every second sample is kept) to 1024 Hz, and underwent a final segmentation stage to select the part of the retention interval when the subject's working memories held the stimulus information. To that end, a 0.7 s long time window (M = 717) was used, starting 0.3 s after the onset of the memory array and ending before the test array appearsedon the screen, as schematized in Figure 3A. Finally, to reduce the effects of volume conduction, the surface Laplacian of each trial was computed through the spherical spline method for source current density estimation [37][38][39].     Subjects number 11 and 17 were discarded during preprocessing because their data contained strong artifacts in a very large number of trials. Subjects number 22 and 23 were reassigned as subjects 17 and 11, respectively.

Classification Setup
Our aim is to set up a subject-dependent classification system (one classifier per subject) that allows the different levels of the change detection task, to be differentiated using relevant directed phase-amplitude interactions captured through the proposed TE θθ ς κα approach as inputs.

Feature Extraction
Let Ψ = {X n ∈ R C×M } N n=1 be the EEG set holding N trials from the WM dataset, recorded from a single subject, with C as the number of channels and M as the number of samples of each trial. Let {l n } N n=1 be a set whose n-th element corresponds to a label assigned to the n-th trial X n . The elements of l n can take the values of 1, 2, and 3 to indicate low-, medium-, and high-memory loads, respectively. Our aim is to predict the label l n from the transfer entropy features TE θθ ς κα that capture the directed phase-amplitude interactions present in X n .
Let λ(x c → x c , f c , f c ) be a TE measure between the phase of channel x c at frequency f c , and the amplitude envelope of x c at frequency f c , as defined in Equation (6). For all pairwise combinations of channels in X n , computing λ( The values of f c , f c vary in the range from 4 Hz to 18 Hz, in 2 Hz steps, since activity in that frequency range has been associated with interactions between different brain regions during WM [6]. Then, three bandwidths are defined ∆ f ∈ {θ ∈ [4 − 6], α ∈ [8 − 12], β l ∈ [14 − 18]} Hz, and the matrices Λ( f c , f c ) are averaged within each pairwise combination of bandwidths (θ − α, θ − β l , etc.). After that, each of the averaged matrices are normalized to the range [0, 1], and stacked together. Therefore, each trial is characterized by a connectivity matrix Λ ∈ R C×C×6 . For the N trials, a set of connectivity matrices {Λ n ∈ R C×C×6 } N n=1 is used. Then, by applying vector concatenation to Λ n a vector φ n ∈ R 1×(C×C×6) is obtained. Finally, the N vectors φ n are stacked together in order to obtain a single bi-dimensional matrix Φ ∈ R N×P , P = C × C × 6, that characterizes Ψ in terms of phase-amplitude TE measures. In practice, for each subject, the training data Φ have the following dimensions: N rows, which correspond to the number of trials of the task correctly performed by the subject (N ≤ 96), and P = C × (C − 1) × 6 = 5952 features, where C − 1 accounts for the fact that the TE of a time series is not defined, and, therefore, the main diagonals of the Λ n matrices are empty (or all zeros, as in our case).

Feature Selection and Classification
In order to identify the TE features relevant to the discrimination between the cognitive load levels in the WM database, we rely on a relevance analysis methodology based on centered kernel alignment (CKA). CKA quantifies the similarity between two sample spaces by comparing two kernel functions that characterize each space [40]. This allows for a relevance vector index ∈ [0, 1] P to be built, which can be used to rank the features in the characterizing matrix Φ according to their discrimination ability. Each element p in signals how relevant the p-th feature is regarding its usefulness in distinguishing between the class labels in {l n } N n=1 , with higher values of p indicating higher relevance [22]. However, estimating CKA from data with a very low number of trials per class leads to unstable results. Given the fact that, in the WM database, each subject's dataset contains an average of less than 30 correct trials per class, we began by setting up an auxiliary cross-validation scheme to obtain an average per subject, termed¯ . This auxiliary crossvalidation scheme has 10 iterations. For each iteration, 80% of the trials were randomly assigned to a training set and the remaining 20% were assigned to a validation set (80/20). After the 10 iterations were performed,¯ was obtained as the average of the vectors computed in each iteration. Then, we set up a 10 iteration, 80/20, cross-validation scheme for use in subject-dependent classification. In this study, we used a support vector classifier (SVC) with an RBF kernel [41]. All classification parameters were tuned through a grid search and selected according to the classification accuracy, which was used as the system performance evaluation metric. For each iteration, we used¯ to rank the features in Φ from the highest to lowest relevance. Then, we selected a percentage of the ranked features, raging from 5% to 100% in 5% increments, and input the chosen features to the classification algorithm, with the most relevant features being input first. The percentage of discriminant features was tuned along with the other classification parameters.

Parameter Selection
For the TE methods, all parameters were estimated before extracting the phase and amplitude time series, that is to say, from the initial, real-valued time series data. The value of the embedding delay τ was set to 1 autocorrelation time (ACT) [31]. The embedding dimension d was automatically selected using Cao's criterion [36,42] from the range d = {1, 2, . . . , 10} [22]. The parameter u was selected as the delay producing the largest TE from the following ranges : u = {4, 8, . . . , 40} for the simulated data, and u = {50, 60, . . . , 250} for the WM data. The parameter α in Renyi's entropy was set to 2, which is neutral to weighting, an a reasonable, a priori choice [16,34]. A radial basis function (RBF) kernel with Euclidean distance [43], defined as: was employed as kernel function, where d 2 (·, ·) is a distance operator, and σ ∈ R + is the kernel's bandwidth. The bandwidth σ was set in each case as the median distance of the data [44]. For CFD estimation, we used a sliding window 5 frequency bins long. Furthermore, for all measures, the required phase and amplitude decompositions we carried out by convolving the real-valued data with a Morlet wavelet, defined as: where f is for the filter frequency, ξ t = m/2π f is the standard deviation of the wavelet in the time domain, and m controls the time/frequency resolution [21]. The parameter m was varied from 3 to 10 in a logarithmic scale, according to the selected frequency of the filter. Finally, all connectivity values were obtained through in-house implementations of the algorithms for the different measures studied. The Python implementation of the proposed TE θθ ς κα approach is available at https://github.com/ide2704/Directed_PAC_ through_kernel-based_Phase_Transfer_Entropy (accessed on 7 September 2021). Figure 4 presents the results obtained through the proposed TE θθ ς κα approach for the simulated data in the case where SNR = 3. Figure 4A shows the average values obtained for TE θθ ς κα (x w η → y w η , f l , f h ), with f l and f h varying in the range from 3 Hz to 45 Hz, in 3 Hz steps. That is to say, it shows the TE values that were computed following our proposal, assuming that the oscillation phases in x w η drove the amplitude envelopes of the oscillations in y w η . Similarly, Figure 4B shows the average obtained TE θθ ς κα (y w η → x w η , f l , f h ) values, where the underlying assumption is that oscillation phases in y w η are causal to the amplitude envelopes of the oscillations in x w η . Figure 4C,D display the results returned by the permutation tests carried out over the TE data that were estimated for all the simulated trials, and whose average values are displayed in Figure 4A,B, respectively (statistically significant results are indicated in white). The results show that the proposed approach captures strong and statistically significant phase-amplitude-directed interactions around the target frequencies used to generate the simulated data. Statistically significant results were only obtained when TE was estimated from the phase of x w η in the θ band, at around 6 Hz, to the amplitude of y w η in the β band, at around 24 Hz, and not when estimated assuming causality in the opposite direction.  Figure 5 shows the results of the permutation tests carried out on the connectivity values estimated for all trials from the simulated data using, from top to bottom, CFD, TE θς κα (conventional approach to estimate directed phase-amplitude through TE, using the kernel-based Renyi's α TE estimator proposed in [16] ), and TE θθς κα , under the three modeled noise conditions, from left to right, low (SNR = 6), moderate (SNR = 3) and high (SNR = 1) noise levels. In this case, all connectivity measures were obtained assuming the correct direction of causality, from x w η to y w η . As before, statistically significant results are displayed in white. For the low and moderate noise levels, the three connectivity estimation methods successfully capture statistically significant, phase-amplitude directed interactions around the target frequencies. Note that the CFD and TE θς κα display significant results on narrower frequency ranges around the interacting frequencies that were actually present in the data than TE θθς κα . This is likely to be associated with the additional filtering stages involved in the computation of TE θθ ς κα and, in the case of CFD, its high-frequency specificity is probably linked to its robustness against false positives [11]. However, unlike TE θθ ς κα , both CFD and TE θς κα failed to capture any statistically significant interactions for the high-noise scenario. Therefore, our proposal exhibits higher robustness to the noise present in the data. Thus, from the results presented in Figures 4 and 5, we can argue that the proposed TE θθ ς κα approach allows for directed phase-amplitude interactions to be uncovered, capturing their strength and direction, even in the presence of high levels of noise and a confounding factor such as signal-mixing due to volume conduction.

Working Memory Data
The ability to store and manipulate information for short periods of time, provided by WM, plays a key role in complex cognitive tasks such as comprehension, reasoning, planning and learning [29,45]. WM consists of three distinct stages of information-processing: encoding, maintenance or retention, and retrieval [6], with the retention interval being considered as a defining component of WM, since it differentiates it from other memory types [27]. The most widely recognized model of WM [26] describes it as a several separate but interacting subsystems: a central component (central executive), two stimuli-dependent storage subsystems (the phonological loop and the visuo-spatial sketchpad), and a system of limited capacity that allows for interaction between the other components (episodic buffer) [46]. Moreover, multiple studies have found that neural oscillatory activity in a wide range of frequencies is modulated during WM [47]. High-frequency activity, in the β and γ (>30 Hz) bands, seems to play a role in the encoding, retrieval, and maintenance of the stimulus; activity at lower frequencies, in the θ and α ranges, especially in frontal areas, is associated with the coordination and integration of different cognitive processes during the execution of WM tasks [48]. This has led to hypotheses about the cross-frequency coupling mechanisms underpinning WM, with oscillatory activity in the central executive component interacting with oscillations at other frequencies in the peripheral storage systems [12,49]. PAC interactions are thought to play a crucial role in WM [28], with bidirectional interactions linking the prefrontal cortex to parieto-occipital and medial temporal regions. [6,29] In this study, we test the proposed TE θθ ς κα approach for the estimation of directed phase-amplitude interactions in the context of an important topic in the study of WM: how the characteristics of brain activity are modulated by changes in memory loads, as induced by WM tasks with different difficulty levels [48]. This topic is closely related to the concept of WM capacity, the amount of information that can be maintained and manipulated in WM [45]. WM capacity is, in turn, linked to important abilities, including non-verbal reasoning and control of attention, among others, and can be altered in people with psychiatric disorders [50].
)HDWXUHV $FFXUDF\ We built a subject-dependent classification system based on features extracted through the proposed TE θθ ς κα method from the EEG data recorded during the retention interval of a change-detection task: a task designed to test visuospatial working memory capacity. The goal of the classifier is to assign a particular cognitive load (number of elements in the memory array) to EEG data from a task trial characterized using our proposal. Figure  6 presents the classification accuracy for all subjects in the WM database as a function of the percentage of features used to train the classifier. The best results are obtained when 5% of the features are selected. These are those with the with the highest relevance values according to the CKA-based relevance vector¯ , achieving an average classification accuracy of 95.9 ± 3.1% for the three classes corresponding to each cognitive load level in the change detection task. Figure 7 shows the highest accuracy obtained for each subject in the WM database, as well as the precision, recall, and F1 score values. Although the subjects differ in performance, the proposed classification system exhibits accuracies well above what would be expected in a three-class classification task for all of them. The precision, recall and F1 score values indicate that the accuracies obtained by the classifiers for each subject are not the result of class imbalances in the data or biased class selection by the classifiers themselves. Additionally, all subjects, except subject 3, achieve peak performance when the classifier is trained with only 5% of the connectivity features (subject 3 does so when 10% of the features are selected). These results imply that our TE θθ ς κα characterization strategy successfully captures the discriminant directed phase-amplitude interactions elicited during the change-detection task. Moreover, only a small fraction of those interactions are required to discriminate between the task's levels. In fact, employing a higher percentage of features leads to a pronounced decrease in classification performance, as shown in Figure 6. This phenomenon can be explained because all channel connectivity analyses lead to datasets with a large number of features (see Section 3.2.3), which, in turn, leads to a well-known problem in machine learning: the curse of dimensionality. The larger the dimensionality of the data, the higher the chance that most training instances will be far away from each other, and that new instances will also be far away from those used to train the machine learning system. This makes it difficult to make good predictions [51]. In additiona, many of the obtained connectivity features will not provide useful information to discriminate between the conditions of the cognitive paradigm of interest [16], and will only add noise and complexity to the classification stage. Adequate feature selection reduces the dimensionality of the data by getting rid of non-relevant features (connectivity values), which can, as in our case, help with classification performance. At this point, it is worth noting that the vector¯ , used to rank the features, is obtained as the average of the relevance vectors stemming from the different folds of an auxiliary cross-validation scheme, outside of the cross-validation process for classification. Therefore, our classification setup suffers from data leakage, and, as a consequence, the above results area is probably higher than that which could be obtained from a classification system suitable for real-world applications. Nonetheless, our results still show that relatively few connectivity values describing directed phase-amplitude interactions can successfully discriminate between the brain activity elicited at different levels of the change-detection task. Furthermore, the vector¯ provides valuable insights regarding the directed phaseamplitude interactions that arise while the subjects perform the task. It does this by assigning a relevance value p to each column, or feature, in the characterizing matrix Φ. In the case of our TE θθ ς κα analysis, p indicates whether a particular connectivity between two channels, for a specific frequency-band pair (θ − α, θ − β l , etc), allows for discrimination of the different cognitive loads of the task. A feature's discriminant ability is tantamount to its variation across classes. That is to say, the most relevant connectivities in¯ are those that change consistently across trials as a function of the cognitive load, which points to the involvement of those directed phase-amplitude interactions in the underlying working memory systems activated during the task.  Figure 8 displays the most relevant connectivities (average for all subjects), according to the relevance vector¯ , discriminated by the frequency band pair. The background topoplots show the average nodal relevance, which corresponds to the relevance of the total information flow of every node. This is defined as the sum of all the p in¯ associated with all directed interactions originating from and targeting a specific node or EEG channel [22]. We note that, in general, there is high nodal relevance for interactions where the phases of oscillations in the θ band drive the amplitudes of oscillations in the α and β l bands. However, the highest nodal relevance is achieved by the phase-amplitude interactions from α to β l band activity. As expected, the relevant connectivities are distributed in a similar way in terms of frequency. Spatially, they tend to involve long-range interactions linking frontal, temporal and parietal regions. In particular, for the phase-amplitude interactions from α to β l , there are multiple bidirectional connections between channels on yhr frontal and prefrontal areas and channels on yhr parietal and parieto-occipital areas. There are also several relevant long-range connections targeting the right temporal region. These results coincide with those previously obtained from the within-frequency, phase-based connectivity analysis carried out on the same database [22], as well as with studies that identified the presence of fronto-parietal and fronto-temporal interactions during cognitive tasks that activate visuo-spatial working memory [6,29,49].

Limitations
The results obtained for the simulated data show that our proposal is, unlike the other tested approaches, able to detect statistically significant, directed phase-amplitude interactions in data with high levels of noise. However, it tends to be less frequency-specific than the CFD and the conventional approach for the estimation of directed phase-amplitude interactions through TE. Therefore, our method is well-suited to the analysis of noisy data, but care should be taken with results under more ideal conditions. The proposed approach also relies on the kernel-based, phase TE-estimation method introduced in [22]; therefore, it suffers from the same limitations, especially regarding the selection of the many parameters involved in TE computation. Here, we employed relatively simple parameter selection strategies, as described in Section 3.3, but it is possible that more elaborate strategies [52] may lead to better results. Additionally, the TE estimator used assumes stationary or weakly non-stationary data, and cannot distinguish direct interactions from those originating from unobserved common causes [16]. Finally, our proposal is strictly limited to the detection of directed phase-amplitude interactions. It is unable to capture any of the other types of cross-frequency interactions that arise in oscillatory neural activities (phase-phase coupling, phase-frequency coupling, etc.) [3]. it was also not designed to be full information-theoretic analysis that simultaneously accounts for the multiple rhythms that can interact at both the sender and receiver side of coupled dynamical systems [1].

Conclusions
In this work, we proposed a novel approach to estimate directed phase-amplitude interactions through TE. The central idea behind our proposal is to recast the problem of detecting directed PAC as that of estimating directed interactions between phase time series. Doing this allowed us to employ a single-trial, kernel-based phase TE-estimator to assess the cross-frequency interactions of interest. We tested the performance of our proposal on synthetic data containing directed phase-amplitude interactions and on an EEG database obtained under a WM paradigm. The obtained results for the synthetic data show that our approach successfully detects the direction of interaction and the interacting frequencies, while being more robust to noise than alternative methods. Additionally, for the WM data, our proposal revealed discriminant, directed, phase-amplitude interactions associated with the different cognitive loads of the task.
In future work, we will attempt to optimize the strategies for selecting the parameters involved in the filtering and time-embedding stages of the proposed approach and the kernel-based TE estimator. Álvarez-Meza thanks the project Prototipo de interfaz cerebro-computador multimodal para la detección de patrones relevantes relacionados con trastornos de impulsividad-HERMES 50835, funded by Universidad Nacional de Colombia.

Institutional Review Board Statement:
In this study, we use a public-access EEG database introduced in a previously published work, and made freely available by its authors [24]. We did not collect any data from human participants ourselves.

Informed Consent Statement:
This study uses an anonymized public database introduced in a previously published work by another group [24].

Data Availability Statement:
The database used in this study is public and can be found at the following link: database from brain activity during visual working memory https://data.mendeley. com/datasets/j2v7btchdy/2 (accessed on 4 September 2021).

Conflicts of Interest:
The authors declare that this research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.