A Discriminative Multi-Output Gaussian Processes Scheme for Brain Electrical Activity Analysis

: The study of brain electrical activity (BEA) from different cognitive conditions has attracted a lot of interest in the last decade due to the high number of possible applications that could be generated from it. In this work, a discriminative framework for BEA via electroencephalography (EEG) is proposed based on multi-output Gaussian Processes (MOGPs) with a specialized spectral kernel. First, a signal segmentation stage is executed, and the channels from the EEG are used as the model outputs. Then, a novel covariance function within the MOGP known as the multispectral mixture kernel (MOSM) allows us to find and quantify the relationships between different channels. Several MOGPs are trained from different conditions grouped in bi-class problems, and the discrimination is performed based on the likelihood score of the test signals against all the models. Finally, the mean likelihood is computed to predict the correspondence of new inputs with each class’s existing models. Results show that this framework allows us to model the EEG signals adequately using generative models and allows analyzing the relationships between channels of the EEG for a particular condition. At the same time, the set of trained MOGPs is well suited to discriminate new input data.


Introduction
From the neuroscience perspective, each physiological or cognitive process will produce a particular pattern of electrical interactions linking neurons from different brain regions. Therefore, the electrical response related to the interactions between neuron assemblies allows studying the brain function. Under such a perspective, the neuroscientists have found brain electrical activity (BEA) patterns associated with motor functions, cognitive processes, and neuropathologies [1]. However, the wide range of possible mental conditions hampers the BEA analysis and poses very challenging tasks [2]. In particular, capturing BEA by placing a set of electrodes over the scalp, known as electroencephalography (EEG), gathers and amplifies currents reflected in the brain cortex from all the possible brain sources, yielding a mixture of latent activity sources at each channel. For discovering such a latent activity, the literature considers different types of EEG analyses ranging from time and spectral domain processing [3,4], through connectivity measures between channels [5], to the complex network analysis [6].
Due to the non-stationarity behavior of EEG data, the classical temporal analyses cannot decode differences among several mental states or conditions. Besides, several studies demonstrate that the variations in the EEG oscillatory patterns play a fundamental role in the maintenance of brain functions and the identification of different neural conditions [7]. Hence, spectral decomposition approaches look for relevant information at the brain rhythms (namely alpha, beta, theta, delta, and gamma). For instance, performing working memory or creative tasks evokes discriminative oscillatory patterns [8,9]. In addition, neuropathologies, such as Alzheimer's disease, cause abnormal cortical neural synchronization at resting-state rhythms [10]. The general spectral analysis procedure consists of a channel-wise frequency band splitting, followed by an identification of relevant time intervals, usually supervised by a specialist. The subsequent stages compute descriptors from the time-frequency representation. The power spectral density (PSD) and spectral entropy are among the most considered descriptors due to their straightforward interpretation [11,12]. Although EEG frequency analysis has proven to extract useful information for brain function understanding, the channel-wise approach still lacks the interpretability of several cognitive processes because each channel only holds a reflected version of BEA from a neural assembly [13]. Besides, the low spatial resolution of EEG restricts the extracted information about the behavior of some brain regions involved in high-level cognitive or physiological processes [4].
In recent years, connectivity analysis techniques account for channel interdependencies to enhance EEG spatial resolution through functional relationships captured in the BEA. Metrics from several domains attempt to quantify different properties of the BEA. Specifically, the coherence (COH), phase value index (PVI), and phase-locking value (PLV) capture pair-wise channel dependencies in the frequency domain. Moreover, the latter two gained attention due to their non-linear capability for unraveling latent connectivity patterns, which have proven valuable for applications, such as motor imagery and emotion recognition [14,15]. Nevertheless, the selection of the connectivity measure is not entirely straightforward. Their dependence on an estimated cross-spectrum and subject data variability yield to low generalization capability in a wide range of scenarios [16,17].
The above issues demand reliable brain connectivity approaches to automatically identify the relevant spectral information from the inherent EEG uncertainty. In this regard, a probabilistic modeling framework, such as Gaussian Processes (GPs), along with an appropriate covariance function possesses the capability for characterizing the latent processes from BEA data [18,19]. Moreover, the extension to vector-valued processes or multi-output GPs (MOGPs) adjusts the probabilistic model to map inputs into a multidimensional output space as an EEG channel array [20,21]. Former GP applications to EEG analysis in stress detection and cognitive stimulus recognition demonstrated the potential of GPs for BEA data modeling [22,23]. Furthermore, a recently proposed covariance function, known as the multi-output spectral mixture (MOSM) analyzes dependencies with spectral information for multidimensional output processes [24]. The proposal novelty relies on the PSD design, ensuring the frequency constraints for real-valued signals. Additionally, the inverse Fourier transform of the MOSM covariance results in a temporal kernel with positive definiteness conditions that are harder to accomplish during kernel design.
In this work, we propose the extension of MOGPs with an MOSM kernel for BEA discrimination, termed MOSM-GP. To this end, we learn an MOSM-GP for each EEG trial in a training set to model and quantify channel relationship patterns associated with particular EEG conditions. Then, we implement a likelihood measure to label new trial data into a specific class. The proposed framework of discriminative MOSM-GP, termed DMOSM-GP, is tested in two publicly available EEG datasets acquired under emotional [25] and motor imagery [26] conditions. The attained results show that the likelihood measure of testing data on the trained MOSM-GPs corresponds to a reliable discriminative index.
The manuscript organization is as follows: Section 2.2 describes the theoretical background of MOSM-GPs and introduces the proposed framework. Section 3 presents the attained results on both EEG modeling and classification performance. Finally, Section 4 concludes the works with the most relevant findings.

EEG Databases
This work considers two publicly available EEG datasets for testing the proposed DMOSM-GP methodology. The two datasets are widely used in the development of neuroscience techniques as they hold challenging cognitive conditions. Contained EEG data allows testing the quantification of channel dependencies at the frequency level by the DMOSM-GP for binary classification experiments.
DEAP dataset "A Database for Emotional Analysis using Physiological data" (DEAP) contains EEG data from 32 subjects acquired under 40 emotion elicitation experiments, with one-minute recordings at 128 Hz and 32 channels distributed over the scalp [25]. Each participant rated the emotional stimulus at the end of a video, following two dimensions: arousal and valence.
Other scores, such as dominance and liking, were also reported, although arousal and valence are the most prominent dimensions for affective computing works. These dimensions characterize a more extensive range of emotions than just the classical categorical description in six basic emotions. For our experiments, we consider the classification valence dimension as either low (ranging from one to five) or high (from five to nine) valence. MI dataset The "Brain Computer Interface (BCI) competition 2008-Graz data set A" contains motor imagery experiments from nine subjects performing four specific tasks involving movements of hands and feet [26] under the motor imagery (MI) paradigm. The set of 22 EEG channels was band-pass filtered between 0.5 Hz and 100 Hz, and further down-sampled at 128 Hz. Each subject performed two experiment sessions, consisting of six runs and 48 six second-long trials per run and task. From the BCI dataset, we select the left-and right-hand movement imagination tasks for evaluating the proposed discriminative framework in a subject-wise scheme.

Cross-Spectral Estimation from Kernel Mixtures
The communication between neuron cells is the basis of every neuronal processing task. The electrical impulses result in every possible cognitive or physiological condition, such as behavior, sensation, thoughts, and emotions [27]. Due to equally measuring normal and abnormal BEA, EEG is considered a well-suited neuroimaging technique for diagnosis, treatment, and clinical procedures across several neurological pathologies [1]. Equation (1) presents the mathematical representation of the BEA from an EEG of C channels holding T time instants [28].
with t as the time sample positions of the recordings, and X = {x i } C i=1 holding the brain electrical responses measured by the EEG array at channel i. To quantify the spectral content between channels, the introduced cross-spectrum estimation relies on specific covariance functions. The Cramer's theorem states that a family of integrable functions {κ ij (τ)} C i,j=1 are the covariance functions of a weakly-stationary stochastic process if and only if they admit the following representation: being ι the imaginary unit, each S ij (ω) an integrate complex-valued function S ij : R → C that is also positive definite, and i, j the indices of two EEG channels. This relationship between covariance functions κ ij in the time domain with argument τ ∈ R and their corresponding spectral density S ij with arguments ω in the Fourier domain allows designing a desired spectral density and obtaining a covariance function [24]. Now, a family S = {S ij } C i,j=1 ∈ R C×C of positive-definite complex-valued functions can be used as cross-spectral densities for multi-output data [24]. These functions are designed by including specific parameters that allow physical interpretation of the obtained covariance kernel regarding the input data. Moreover, complex-valued and positive-definite matrices can be decomposed in the form S(ω) = R H (ω)R(ω) where R(ω) ∈ R Q×C , Q represents the rank of decomposition, and (·) H denotes the Hermitian operator. Since Fourier transforms and multiplications of squared exponential (SE) functions are also SE, the autocovariance function R i (ω) of i-th channel is modeled as the complex-valued SE in Equation (3).
With such a choice of functions, the cross-spectral density between channels i and j is given by Equation (4) Finally, in order to guarantee that the model is restricted to real-valued stochastic processes, the spectral density is reassigned to become symmetric with respect to ω by S ij (ω) → 1 2 (S ij (ω) + S ij (−ω)). Then, the inverse Fourier transform of the resulting cross-spectral density becomes the corresponding temporal domain real-valued kernel; the kernel and the symmetric version of the spectral density are presented in Equations (5) and (6), respectively.
where the term α ij = w ij √ 2π|σ ij | 1/2 absorbs the constant resulting from the inverse Fourier transform. Equation (5) allows computing the real-valued autocovariances (i = j) and cross-covariances (i = j) with negatively and positively correlated channels through the magnitude parameter α ij ∈ R; delayed channels through the θ ij = 0 delay parameter, and channels out-of-phase through the φ ij = 0 phase parameter. Moreover, increasing the rank of decomposition Q corresponds to considering more components in the multiple-output spectral mixture (MOSM) kernel as shown in Equations (7) and (8).
denoting the superindex (q) the q−th spectral component. Then, MOSM effectively computes autocovariance and cross-covariances through the spectral-mixture of positive-definite kernels from the Fourier transform of spectral functions S ij (ω). In practice, the adjustment of the cross-spectrum parameters should be performed on the evidence of the EEG data.

Multi-Output Spectral Mixture Gaussian Process
Given an EEG trial, the Gaussian Process (GP) probabilistic framework computes the mixture parameters by maximizing the data likelihood as the cost function. A Gaussian Process (GP) is a real-valued stochastic process ( f (t)) over a input set t, such that for any finite subset of inputs t ∈ {1, . . . , T}, the random variables f (t) are jointly Gaussian [20]. Additionally, the GP is uniquely determined by its mean function m(t) := E t ( f (t)), typically assumed m(t) = 0 and its covariance function κ(t, t) := cov( f (t), f (t )) ∈ R T×T known as the kernel.
Then, the multivariate extension of GPs is derived by assembling C different scalar-valued stochastic processes, one for each EEG channel. Any finite collection of values across all such processes are jointly Gaussian, termed multiple-output Gaussian Process (MOGP). This extension results in a vector-valued process f ∼ GP (m, K), where m(t) ∈ R TC is a concatenated vector from the mean vectors associated to the outputs and K ∈ R TC×TC a block partitioned matrix of the form [20]: where each block K(X i , X j ) is a T × T matrix denoting the covariance between output channels i, j. Furthermore, a multivariate kernel Therefore, we use the MOSM kernel in Equation (7) as the covariance function to be implemented within the MOGP. By defining such a process as the MOSM kernel, the model adjustment to the data is performed by maximizing the data log-probability.
Since the observations in the multiouput case are jointly Gaussian, they are concatenated into the vector y = [x 1 , x 2 · · · , x C ] ∈ R CT the channel observed value. Then, the negative log-likelihood (NLL) can be expressed as in Equation (10).
holding the complete set of parameters. As a result, minimization of NLL with respect to Θ designs an spectral kernel quantifying the EEG channel relationships at automatically tuned frequency bands.

Discriminative Scheme Using MOSM-GP
Let a set of N labeled BEA trials {χ n , l n } N n=1 , each of them belonging either class A or B, that is, l n ∈ {A, B}. In the case of DEAP dataset, classes correspond to low and high valence, while, for the MI dataset, left-and right-hand movement imagination are considered. As stated in Section 2.3, a single MOSM-GP models each BEA trial as the stochastic process f (l) n , resulting in N A and N B MOSM-GPs for classes A and B, respectively. Furthermore, on the evidence of a new BEA trial X * , the marginal likelihood for each learned MOSM-GP is computed as: n (X * ), K(X * , X * )). (11) By evaluating the marginal likelihoods on all training MOSM-GPs, the new BEA trial label is estimated as follows: where E{· : l n = l} denotes the expectation operator over training trials belonging to class l. Figure 1 illustrates the proposed discriminative MOSM-GP framework, termed DMOSM-GP.

Implementation Details
Before the model training stage, a channel selection is carried out to reduce the training computational cost. For the MI dataset, channels are selected based on the evidence that body movement triggers neural activity in the opposite brain hemisphere within the sensorimotor area. Regarding the DEAP dataset, channels related to brain regions more likely to participate in affective states are considered. Figure 2 depicts the subset of selected channels for both EEG datasets. Regarding the DMOSM-GP free parameter, the rank of decomposition is chosen from a grid search within the range Q ∈ {1, . . . , 10} to minimize the mean absolute error (MAE) of the model prediction against the original EEG data. The GPflow framework is employed for the model definition [29], and the kernel function is optimized via the minimization of NLL cost function using the autograd library. Finally, for the statistical significance assessment of the classification performance, an 10-fold cross-validation scheme was applied.

Parameter Tuning and Spectral Modeling
To tune the rank of decomposition Q, we evaluate the MOSM-GP performance for modeling the selected output channels from the data posterior distribution at specific temporal locations. Moreover, the mean absolute error (MAE) quantifies the difference between the target and predicted outputs as a function of the rank. Figure 3 presents the mean MAE across the GP outputs against the number of spectral components defined for the MOSM kernel. As a first insight, the MAE values evidence that the MOSM-GP effectively reconstruct EEG recordings. Nonetheless, the error increases over six spectral components, due to a large number of kernel parameters to be tuned, which in turn increases the computational complexity without providing relevant information to the probabilistic model. On the contrary, a single component lacks the complexity to account for the brain activity changes. Therefore, a rank of decomposition between three and five benefits the most the MAE performance, implying a balance between model complexity and generalization capability. Consequently, for the remaining of the work, we selected three spectral mixtures as the optimal Q for testing the MOSM-GP scheme. For the purpose of visualization, Figure 4 exemplifies the MOSM-GP output for channels FP1 and AF3 using Q = 3. As seen, the posterior MOSM distribution suitably models EEG data at all time locations with bounded deviations.
An analysis of the spectral information quantified by the MOSM-GP is carried out using Equation (6) that decomposes the spectral content shared by two channels into Q terms. Figure 5 plots the component-wise spectral distribution between channels FC2 and FC6 in a trial from left- (Figure 5a) and right-hand movement (Figure 5b). Attained spectra prove that each component automatically fits a particular frequency band. Moreover, the component magnitude α   Later, Equation (8) is used to compute the cross-spectrum visualizing the computed covariances by the MOSM kernel for every channel pair. Since there is valuable information on the analysis of the cross-covariances between channel pairs, a complete trial visualization of the quantified spectral information is presented in Figure 6a,b for left-and right-hand movement trials, respectively. Each one of the horizontal axis sections, corresponds to a particular channel i and its MOSM PSD against all the channels. The vertical axis represents the frequency bin at which the connectivity is assessed. Then, lighter green colors are related with strong spectral densities shared between i, j channels, while darker blue colors are related with lower interdependencies. Despite most of the strong spectral density being constrained to the [15,40] Hz band, there is clear evidence that not all the channels are synchronized at this frequency when performing MI activity. Particularly for Figure 6a, there are channels sections that seems to be uncorrelated in the complete spectrum for this particular task. For example, channels Cz, C2, and CP6 seem to have low frequency dependencies shared with the rest of the EEG array. On the other hand, channel CP2 presents strong connections with most of the channels. Furthermore, for the opposite MI task, there exist variations on the captured frequency relationships among the EEG array. C2 and C4 result as the most highly correlated the other channels when performing the MI task.   Figure 5. Normalized power spectral density S ij (ω) for a given pair of channels (FC2-FC6) from MI database for two opposite conditions. Top row for left-hand movement condition and low row for right-hand movement condition. Figure 5a,b are the spectral content per spectral component.

Discriminative MOSM-GP
For the DMOSM-GP framework, each EEG trial is trimmed into two-seconds lasting time series to find the desired spectral relationships. Table 1 presents the absolute value of the average likelihood as the class dependency measure, depicted in Equation (12). The first column corresponds to the test condition level regarding the emotional content (valence for this experiment), columns two and three are the values obtained for the model testing, and fourth is the valid tag of the corresponding test signal. Finally, the fifth column is the predicted tag from the magnitude of the mean test likelihood. Similarly, in Table 2, the results for the DMOSM-GP strategy over the BCI database are presented. The first column corresponds to the movement associated with the experiment. Columns two and three are the average likelihoods obtained by testing the new input against each class's models. The fourth column is associated with the resulting tag. Each row in Tables 1 and 2 corresponds to a particular trial, for emotional or motor imagery experiments.
From this test on complete datasets, it can be evidenced that the prediction regarding the likelihood of the trained model with the test signals is promising. The accuracy of the test data is around 73.3% for the DEAP database and about 87.33 for the BCI database. Specifically, for comparison purposes against works using the DEAP dataset, a selection of 9 subjects is performed. This selection is based on the evidence of an uncertainty test performed in [17], where the authors concluded that the subject itself did not adequately tag some experiments of the DEAP database. The scheme of implicit tagging used in this particular database allows the subjects to rate their affective result, so some of the acquired signals seem not correctly related to the emotional tests. The results reported in Table 3 show the accuracy resulting from the cross-validated DMOSM-GP scheme. The subject IDs reported are the same in the database, and some high accuracy (state-of-art comparable) ones were achieved around 78%, which is the case of subject 18.
On the other hand, for the BCI database, the complete set of 9 subjects is employed, and Table 4 reports the accuracy of classification within 5 folds of DMOSM-GP. Some higher accuracy was obtained for the MI database in the subject-dependent experiments around 87% for subject ID 2. In general terms, the results associated with the BCI database are higher than the DEAP database, and it can be related directly with the condition that is tested. In the case of emotional experiments, there is a high degree of subjectivity in the elicitation of the states. In contrast, in the case of motor imagery, the conditions are somehow more consistently replicated. Table 1. DMOSM-GP Test for brain electrical activity (BEA) analysis of emotional conditions of two classes. The mean absolute value of the likelihood from the test signals against the trained models is included-"A Database for Emotional Analysis using Physiological data" (DEAP) database. The boldface indicates the class with the highest log-likelihood.

Results Comparison
The results obtained from the DMOSM-GP strategy are compared with some state-of-the-art results in terms of classification accuracy. However, since the proposed methodology uses generative models for discriminative purposes, it has to be stated that the comparison should be made on a few additional items than the classification accuracy. As can be seen in Tables 5 and 6, the average classification results obtained by the DMOSM-GP strategy are competitive among the state-of-art works. It has to be noticed that this work does not implement a preprocessing or feature extraction stage but uses the acquired data to train the MOSM-GPs and perform the discriminative task.  [17].

Discussion and Concluding Remarks
Brain information processing is a complex task that is not yet entirely mapped and understood. Despite the previous knowledge about brain regions interactions in motor or emotional means, works that allow improving the interpretability of results in different BEA scenarios would lead to the development of more precise frameworks for analysis, diagnosis, and treatment of mental pathologies, among other tasks. In this work, we proposed a framework for discriminate BEA using raw data with the support of a spectral kernel that identifies relationships between channels on behalf of a probabilistic methodology of multi-output Gaussian process. One of the essential remarks of this framework is the capability of learning EEG connectivity patterns by estimating raw data spectral components without a feature extraction stage. Further, this proposal of generative models working directly with EEG data has the advantage of adjusting the model relying on the optimization of kernel hyperparameters just from the channels information in a data-driven framework.
The results presented in Section 3 evidenced that introducing the MOSM kernel to MOGPs becomes a reliable tool for BEA modeling, due to the spectral designing properties. It is well known that EEG channels share complex frequency relationships that can be exploited using the design of a covariance function in that particular domain. Regarding this property, the posterior distribution over the data measured by the MAE in Figure 3 shows an adequate adjustment of the model on the original data. It also allows us to conclude that the inclusion of more spectral components into the covariance function benefits the model adjusting to the data.
Moreover, the identification of the spectral relationships between channels, performed by the MOSM kernel, allows gaining a better understanding of the latent functional connectivity between brain regions. As Figure 6 evidenced, the MOSM-GP identifies representative frequency bands for the cognitive process. The positive linear relationships are grouped among channels from the same hemisphere, with strong specific dependencies between channels, like P3 − P7, and Fc6 − A f 4. The negative relationships are explained by lower PSD amplitudes at Cz, C2, and CP6 for the left-hand MI task, and C6, CP1, CP6 for the right-hand MI task. All these interactions quantified by the MOSM kernel in terms of higher or lower values of the PSD can be directly related with the activations of neural cells in different regions of the brain related with emotions (amygdala, hippocampus, and frontal cortex) or related with motor activities (primary motor cortex and posterior parietal cortex). Nevertheless, further studies must be completed from the evidence of these relationships to an accurate source reconstruction before determining the specific brain region of the acquired neural activity.
Finally, the discriminative results regarding the emotional and motor imagery conditions conclude that probabilistic models can be efficiently employed as a classification tool for EEG data. In this case, the probability distribution of tested data against the trained models directly becomes a classification algorithm by following a direct comparison of the mean likelihood value between the models from two classes. Despite lacking a feature extraction stage, the proposed DMOSM-GP produces discriminative information from the data. Moreover, the total accuracy of the subject-dependent and condition-dependent tests is comparable with state-of-art works, as Tables 5 and 6 illustrate.
One of the drawbacks of this framework is the computational complexity of training a considerable number of MOSM-GP models. In addition, increasing the number of MOSM kernel mixtures and the size of BEA data will derive into an exponential growth of the training time. Further improvements of this methodology will be directed towards using lighter versions of probabilistic models, such as the sparse GPs aiming at solving more difficult supervised learning tasks from EEG data as multilabel classification or regression.