Optimization of Alpha-Beta Log-Det Divergences and their Application in the Spatial Filtering of Two Class Motor Imagery Movements

The Alpha-Beta Log-Det divergences for positive definite matrices are flexible divergences that are parameterized by two real constants and are able to specialize several relevant classical cases like the squared Riemannian metric, the Steins loss, the S-divergence, etc. A novel classification criterion based on these divergences is optimized to address the problem of classification of the motor imagery movements. This research paper is divided into three main sections in order to address the above mentioned problem: (1) Firstly, it is proven that a suitable scaling of the class conditional covariance matrices can be used to link the Common Spatial Pattern (CSP) solution with a predefined number of spatial filters for each class and its representation as a divergence optimization problem by making their different filter selection policies compatible; (2) A closed form formula for the gradient of the Alpha-Beta Log-Det divergences is derived that allows to perform optimization as well as easily use it in many practical applications; (3) Finally, in similarity with the work of Samek et al. 2014, which proposed the robust spatial filtering of the motor imagery movements based on the beta-divergence, the optimization of the Alpha-Beta Log-Det divergences is applied to this problem. The resulting subspace algorithm provides a unified framework for testing the performance and robustness of the several divergences in different scenarios.


Introduction
Over the last few years, the use of specialized metrics and divergences measures in the successful design of dimensionality reduction techniques has been progressively acquiring much recognition [1][2][3][4][5].There are numerous real scenarios and applications for which the parameters of interest belong to non-flat manifolds, and where the Euclidean geometry results are unsuitable to evaluate the similarities.Indeed, this is usual case in the comparison of probability density functions and also of their associated covariance matrices.The present contribution may be seen as a continuation of the work in [6], where we defined the Alpha-Beta Log-Det family of divergences between Symmetric and Positive Definite (SPD) matrices and studied its properties.The Alpha-Beta Log-Det family unifies under the same framework many existing Log-Det divergences and connects them smoothly, through intermediate versions, with the help of two real hyperparameters: α and β.In [7] a recent extension of the the Alpha-Beta Log-Determinant divergences was also proposed for the infinite-dimensional setting.
The evaluation of the Alpha-Beta Log-Det divergences depends on the generalized eigenvalues of the compared SPD matrices, and this makes its optimization non-trivial.In this paper, we motivate the use of these divergences with the illustrative application of dimensionality reduction in Brain-Computer Interface (BCI) and explain how to perform their optimization.The electroencephalogram (EEG) data has a typical high-dimensionality, a low signal to noise ratio and may have artifacts/outliers.The dimensionality reduction is then a necessary processing of the EEG signals for extracting those subspaces where the features have highest discriminative power.
Brain-Computer Interface has gained lots of interest in neuroscience and rehabilitation engineering.BCI [8,9] systems enable a person to operate external devices by using brain signals.The motor imagery (MI) based BCI systems are the most preferable BCI systems among others.It uses the brain signals of the MI movements as control commands for external devices without using the peripheral nervous system.During the imagination process, an alteration in the rhythmic activity of the brain can be observed in the mu and β rhythms at the corresponding area of the sensory-motor cortex.This phenomenon is known as event-related synchronization (ERS) or event-related desynchronization (ERD) [10].The MI-based BCI systems use theses activities as control commands.Such a system can potentially serves as a communication aid for the people suffering from amyotrophic lateral sclerosis, multiple sclerosis and completely locked-in.
One of the most popular and efficient algorithms used for MI-based BCI applications is the common spatial pattern (CSP) algorithm [11].It was first used to detect the abnormalities present in EEG signals [12] and later, was introduced in BCI applications [13].The main objective of the CSP is to obtain the spatial filters by maximizing the variance of one class, at the same time minimizing that of the other class variance.It has been reported that this algorithm provides excellent classification accuracy for MI-based BCI systems.Besides, being the most popular method, its performance is easily affected by the presence of artifacts and nonstationarities.Since the computation of the spatial filters mainly depends on the covariance matrix, the presence of artifacts such as blinking of the eyes, eye movements and improper placement of the electrodes contribute to the poor computation of the covariance matrix which leads to the poor classification performance.
The main contributions of this work are the following: • The existing link between the CSP method and the symmetric KL divergence (see [1]), is extended to the case of the minimax optimization of the AB log-det divergences.In absence of regularization, their solutions are shown to be equivalent whenever these methods apply the same divergence-based criterion for choosing the spatial filters.Although, in general, this is not the case when the CSP method adopts the popular practical criterion of a priori fixing the number of spatial filters for each class, we show that the equivalence with the solution of the optimization of AB log-det divergences can be still preserved if a suitable scaling factor κ is used in one of the arguments of the divergence.

•
The details on how to perform the optimization of the AB log-det divergence are presented.The explicit expression of the gradient of this divergence with respect to the spatial filters is obtained.Expression which generalizes and extends the gradient of several more established well-known divergences, for instance, the gradient of the Alpha-Gamma divergence and the gradient of the Kullback-Leibler divergence between SPD matrices.

•
The robustness property of the AB log-det divergence with respect to outliers has been analyzed.The study reveals that the hyperparameters of the divergence can be chosen to underweight or overweight, at convenience, the influence of the larger and smaller generalized eigenvalues in the estimating equations.

•
Motivated by the success of criteria based on the Beta divergence [1] in the robust spatial filtering of the motor imagery movements, in this work, we consider the use of a criterion based on AB log-det divergences for the same purpose.A subspace optimization algorithm based on regularized AB log-det divergences is proposed for obtaining the relevant subset of spatial filters.Some exemplary simulations illustrate its robustness over synthetic and real datasets.
This article is organized as follows: Section 2 presents the fundamental model of the observations and paper notation.Section 3 reviews the CSP algorithm while Section 4 discusses CSP via the divergence optimization.In Section 5, we present the family of AB log-det divergences and provide a new upper-bounds and conditions for the equivalence between this divergence optimization and the robust CSP solution.Section 6 explains how to obtain closed-form formulas for computing the gradient of the AB log-det divergence, which is useful for its optimization.The analysis of the robustness of the divergence in terms of its hyperparameters is the objective of Section 7. Section 8 briefly reviews several related techniques, while Section 9 presents the regularized version of the criterion based on AB log-det divergences, as well as, the subspace algorithm that optimizes it.Section 10 presents the experimental datasets, the steps involved in the preprocessing, feature extraction and classification.The results of the simulations are presented and discussed in Section 11.Finally, the paper summarizes main results in Section 12.

Notation and Model of the Measurements
Throughout this paper, the following notations are adopted.Vectors are typically denoted by bold letters, the capital bold letters are reserved for the matrices, while the random variables appear in italic capital letters.The operators • and • round the value of their argument to the nearest lower and higher integers respectively.All the covariance matrices, which are denoted by Cov(•), are assumed to be positive definite and hence invertible.
Let us now describe the statistical model of the observations.As usual, the raw EEG observations are initially preprocessed by a bandpass filter that retains the activity in the bands of the mu and β rhythms and are later normalized for each trial so as to keep their total spatial power constant.One can define a statistical model of these "normalized" observations as x(t) = [x 1 (t), . . ., x n (t)] T conditioned on the true imagery movement, which here will be represented by a member of the class c ∈ {c 1 , c 2 }.In general, the EEG observations are noisy and high-dimensional, while the number of recorded trials is quite limited.Therefore, the learning of the discriminative features is quite sensitive to overfitting, a situation that would severely degrade the prediction accuracy over test samples.In this case, it is worth sacrificing the bias by choosing a simpler (less complex) model in which parameters can be estimated with a smaller variance.For this reason, we adopt usual convention [14] of considering the observations from each class as drawn from the independent and identically distributed (i.i.d.) Gaussian random vectors as represented as X|c of zero mean and with covariance matrix as Cov(X|c), which in turn is set equal to the sample covariance matrix of the class, i.e., The observations are then modeled by the mixture distribution where p(c) refers to the sample probabilities of each class in the training data.When x denotes the sample mean of the observations, their sample covariance matrix is obtained by and its eigenvalue decomposition is where ∆ and U 1 , respectively denote the matrix of eigenvalues and eigenvectors of Cov(x).
We define w i = [w 1i , w 2i , ..., w pi ] T as the vector with the coefficients of the i-th-esime spatial filter for i = 1, . . ., p.The collection of p spatial filters forms the overall filter matrix W = [w 1 , w 2 .....w p ], which is used to reduce the dimensionality of the observations by projecting them onto the p-dimensional subspace spanned by the filter outputs where p n.The model for the estimated conditional distribution p(y|c) is a multidimensional Gaussian of zero mean and covariance matrix Cov(Y|c) = W T Cov(x|c)W, i.e., for each class

The Common Spatial Patterns Algorithm
The development of the CSP algorithm as a technique for feature selection in classification problems can be traced back to the work of [11], while later, [12,13] considered its practical application for the study of EEG signals.This technique exploits the event-related desynchronization during the limbs movement imagination process that alters the rhythmic activity in a class dependent area of the motor cortex.The objective of the algorithm is to obtain a set of most discriminative spatial filters, i.e., those that hierarchically maximize the output activity of one class, while at the same time; they minimize the activity of the other class.Since only the direction of the spatial filters (i.e., not the scale) are of interest, the technique starts with a linear transformation y = W T x that whitens the sample covariance of the outputs With the help of the eigenvalue decomposition of Cov(x), the general expression of the spatial filter matrix that preserves the whitening constraint can be found as Note that W ∈ R n×p is specified up to the ambiguity in the choice of the semi-orthogonal matrix Ω ∈ R n×p (i.e., Ω T Ω = I p ) which parameterizes the relevant degrees of freedom for finding the most discriminative directions.Then, the objective of the CSP criterion [11] is implemented by first choosing one part of the spatial filters from the constrained maximization of the conditional covariances of the outputs of the first class and later choosing the other part of the filters to hierarchically maximize the conditional covariances of the outputs of the second class where, in both cases, the maximization with respect to the spatial filters takes place under the whitening or (Cov(x)-orthonormality) constraints The number of spatial filters k that hierarchically maximize (12) can be determined by a chosen filter selection policy.For simplicity, in most cases it is usually set k close to p 2 with the aim to balance the number of spatial filters devoted to each of the classes.
The maximization in (12) can be alternatively posed as the constrained optimization of the quotient which, in terms of the transformed and normalized spatial vectors is rewritten as a quadratic optimization under orthogonality constraints At this point, the straightforward application of the Courant-Fisher-Weyl minimax principle ( [15], p. 58) yields the variational description of the desired spatial filters as the minimax solution of the Rayleigh quotients for each class = arg min = arg min By the same principle, the generalized eigenvectors v (c) i of the matrix pencil (p(c) Cov(y|c), Cov(y)), are the minimax solutions of the Rayleigh quotient, while the values that takes the criterion at these solutions are the generalized eigenvalues which are sorted according to the descent in their magnitude, λ n .The generalized eigenvectors of the two quotients (one for each class) coincide, except for their ordering which are reversed [11], i.e., v n−i+1 , while the weighted sum of generalized eigenvalues is bounded by Therefore, a direction of maximum variance for one class will simultaneously minimize the variance of the other class, and vice versa.Hence, the standard CSP solution is obtained when the spatial filters match with the principal and minor eigenvectors of the generalized eigendecomposition problem [11][12][13] After sorting the eigenvalues according to its magnitude, CSP explicitly selects k spatial filters v (c 1 ) i from the principal eigenvectors and p − k spatial filters from the minor eigenvectors, to form the spatial filter matrix = [v

The Divergence Optimization Interpretation of CSP
Under the appropriate selection policy for the number of spatial filters for each class, the solution obtained by the CSP algorithm admits an interpretation in terms of the optimization divergence measures (here denoted by Div(• •)) between the Gaussian pdfs outputs for each class except for a probable permutation in the ordering of some of the spatial filters.The problem can be formulated using the following optimization problem where D(• •) refers to a divergence between the covariances of the conditional densities of the outputs.As a consequence of the assumption of zero mean Gaussian densities, the covariances are the only necessary statistics that summarize all the relevant information of the conditional data.
In particular, the solution of the CSP algorithm was linked in [1,11,16,17] with the optimization of the symmetric Kullback-Leibler divergence (sKL) In this paper, we propose an extension of the existing KL to the criterion of the Alpha-Beta log-det divergence (AB log-det) between the class-conditional covariances defined as [6] Cov(y i |c 1 ) for α = 0, β = 0, α + β = 0, where denotes the non-negative truncation operator.When the arguments covariances are scalars and α, β > 0, the AB log-det divergence can also be rewritten as the logarithmic ratio between the weigthed arithmetic mean of the scaled covariances (Cov α+β (y i |c 1 ), .(34) Additionally if α + β = 1, the AB log-det divergence between covariances is proportional to the Alpha-Gamma divergence [18] between the conditional densities for α > 0, β > 0, α + β = 1.
In Section 5.3, it is proven that, under certain conditions, the simple optimization of an AB log-det divergence also leads to the solution of the CSP algorithm.Although, the potential of these divergences does not rely on their plain optimization but instead rely on their optimization in the presence of some regularization terms that help to specify the desired solutions.
Recently, several divergence criteria have been proposed for the extraction of the spatial dimensions with maximum discriminative power.Among these, the multiclass approach based on the maximization of the harmonic mean of Kullback-Leibler divergences [16] and the regularization framework based on the beta divergences [1,17] are the most noteworthy methods.Another approach based on Bhattacharyya distance and Gamma divergence has also been proposed for classification of motor imagery movements [19].Our proposal is motivated by the success of these methods in improving the classification accuracy and the robustness against the outliers.The distinctive property of the AB log-det divergence is that it smoothly connects (through its hyperparameters) a quite broad family of log-det divergences for SPD matrices, covering several relevant classical cases like: the KL divergence, the dual KL divergence, the Beta log-det family, the Alpha log-det family, the Power log-det family, as well as the Affine Invariant Riemannian divergence.

The Definition of the AB Log-Det Divergence
Henceforth, we will work on the multidimensional observation vectors x = [x 1 , . . ., x n ] T ∈ R n .In order to simplify the notation, the covariance matrices of the two classes are renamed as follows The AB log-det divergence is a directed divergence that evaluates the dissimilarity between two multidimensional covariance matrices.It was defined in [6] as for α = 0, β = 0, α + β = 0, while, for the singular cases, its definition is given by The divergence depends only on the eigenvalues which also coincide with the eigenvalues of the matrix Q −1 P, although their eigenspaces differ.Given the eigenvalue decomposition where V 1 is the orthogonal matrix of eigenvectors, and Λ = diag{λ 1 , λ 2 , . . ., λ n } is the diagonal matrix with positive eigenvalues λ i > 0, i = 1, 2, . . ., n.One of the properties of the AB log-det divergence is that it is invariant under a common change of basis on its matrix arguments, i.e., an invertible congruence transformation.Since, with the help of this specific transformation, we have it can be inferred that the divergence is separable (over the generalized eigenvalues of the matrix pencil (P, Q)) in a sum of marginal divergences that measure how far are each of the generalized eigenvalues from the unity, i.e.,

D
(α,β) Hence, Similarly, for the singular cases, the divergence is This divergence compares two symmetric positive definite matrices and returns its dissimilarity, i.e., a positive value when they are non-coincident and D (α,β) As it can be observed in Figure 1 the AB log-det divergence generalizes several existing log-det matrix divergences, like: the Stein's loss, the S-divergence, the Alpha and Beta log-det families of divergences and the geodesic distance between covariance matrices (the squared Riemannian metric), among others (see Table 1 in [6] for a comprehensive list).

A Tight Upper-Bound for the AB Log-Det Divergences
The divergence D (α,β) AB (P Q) depends on the generalized eigenvalues λ 1 , • • • , λ n of the matrix pencil (P, Q) which, for convenience, are assumed to have a simple spectrum (the eigenvalues are unique or non-coincident) and can be sorted in descending order In practice, the assumption is plausible because the real symmetric matrices with unique eigenvalues are known to form an open dense set in the space of all the real symmetric matrices [20].
Although the space of the observations is high-dimensional, most of the discriminative information between the two conditions is confined into a low-dimensional subspace.Thus, the spatial filter matrix W ∈ R n×p is used to reduce the dimensionality of the samples from n to p with the linear compression transformation y = W T x ∈ R p .It is shown in [6] that, after applying this compression to the arguments of the divergence, the resulting output covariance matrices W T PW and W T QW are more similar than in the original space, as shown in the below equation where µ 1 ≥ • • • ≥ µ p > 0 are the generalized eigenvalues of the matrix pencil (W T PW, W T QW).However, this upper bound is loose for the case of interest (dimensionality reduction), i.e., when p < n.In Appendix A.1, the possible way to tighten the previous upper-bound with the following new proposal is shown where π defines the permutation of the indices 1, . . ., n that sorts the divergence of the eigenvalues from the unity in descending order Moreover, the equality with the upper-bound is only obtained for those extraction matrices W that lie within the span of the p generalized eigenvectors of the matrix pencil (P, Q) which are associated with the eigenvalues λ π 1 , . . ., λ π p that maximize the divergence from unity in (49).

Relationship between the Generalized Eigenvalues and Eigenvectors of the Matrix Pencils
We have seen in the previous section that the tight upper-bound of the divergence is attained by a subset of the generalized eigenvectors of the matrix pencil (P, Q), whereas, the CSP solution in (24) depends on a subset of the generalized eigenvectors of another matrix pencil (p(c 1 )P, Cov(x)).In this section we addresses the close relationship between both eigendecompositions.For this purpose, we recognize Λ as the matrix of eigenvalues of Q −1 P and Λ (c 1 ) as the matrix of eigenvalues of (Cov(x)) −1 p(c 1 )P.Then, we write and use the decomposition of Cov(x) in (4) to substitute p(c 2 )Q = Cov(x) − p(c 1 )P in the previous equation.In this way, we obtain The matrix of eigenvectors V of Q −1 P diagonalizes both sides of the previous equation Hence, we have the explicit relationship between the two sets of eigenvalues where g(λ ), as can be seen in Figure 2, is a strictly monotonous ascending function over the domain of λ (c 1 ) i ∈ (0, 1).Moreover, the Equations ( 52)-( 55) imply that the matrix V of generalized eigenvectors of the matrix pencil (P, Q) exactly coincides with the matrix V (c 1 ) of generalized eigenvectors of the other matrix pencil (p(c 1 )P, Cov(x)).56), maps eigenvalues of the matrix pencil (p(c 1 )P, Cov(x)) into the eigenvalues of the matrix pencil (P, Q), in a case where the sample probabilities of the classes are uniform p(c 1 ) = p(c 2 ).Note that the eigenvalues of the first pencil are bounded in the interval (0, 1), while the domain of the eigenvalues of the second pencil is (0, ∞).

Linking the Optimization of the Divergence and the CSP Solution
There is a link between the solutions of the CSP method and the solutions obtained with the optimization of the symmetric KL divergence between the class conditional covariances, which was studied in previous works [1,11,16].This subsection shows that under the appropriate filter selection criteria the link also extends to the optimization of other divergences, like the AB log-det family of divergences.
We have previously assumed that generalized eigenvalues are ordered and can be regarded as non-equal.Therefore, we can cluster them in the following three sets of principal, inner and minor eigenvalues of the matrix pencil (P, Q): The following sequence of optimizations induces an alternative ordering of the generalized eigenvalues according to a permutation π that sorts their marginal divergences from 1 in descending order For building the matrix of spatial filters W Div ≡ [w 1 , w 2 .....w p ], one possible selection policy is to retain only the p most discriminative spatial filters for the considered divergence optimization problem, i.e., those that solve (58) for i = 1, . . ., p.The filters consist in p eigenvectors (v π i with i = 1, . . ., p) of the matrix pencil (P, Q) that are arranged according to the permutation π.From the one-to-one relationship that exists between the generalized eigenvalues and eigenvectors of the matrix pencils (P, Q) and (p(c 1 )P, Cov(x)) (see the previous subsection) the solution takes the following form This result tells us that the optimization of different divergences (in absence of other regularizing terms) only differs in the selection criteria for the spatial filters, which eventually determine the chosen subindices π 1 , . . ., π p .Now, the question of whether these spatial filters that solve the sequence of minimax divergence optimization problems essentially coincide (up to a possible permutation in the order of the spatial filters) with the spatial filters of the CSP solution in ( 63) has a simple answer.The straightforward comparison between ( 61) and ( 63) reveals that both solutions should essentially coincide when the subindices π 1 , . . ., π p are a permutation of the integers 1, . . ., k, n − (p − k) + 1, . . ., n.Thus, the link between both techniques happens whenever CSP method adopts the filter selection policy of the divergence criterion in (59).However, many of the CSP implementations find satisfactory to choose the number of spatial filters for each class a priori, respectively as k and p − k (we will refer to this case as the original CSP filter selection policy), where k is close to p/2 in order to approximately balance the number of spatial filters for each class [13,21].
In general, the use of a divergence based selection policy does not ensure a balanced representation of the spatial filters for each of the classes.For instance, consider the synthetic but illustrative situation for n = 100, where we wish to select p = 8 spatial filters.If the generalized eigenvalues of the matrix pencil (P, Q) are shifted towards to zero, for instance, equal to {10, 0.99, 0.98, . . ., 0.03, 0.02, 0.01}.In most cases, the solution W Div will select as its columns: only k = 1 principal eigenvectors and p − k = 7 minor eigenvectors, an unbalanced choice.
In view of this potential limitation, an interesting question is whether it would be possible to modify the AB log-det divergence criterion so as to enforce that its solution essentially coincides with the one obtained by the CSP method with its original filter selection policy.We will show in the following that this requires only a suitable scaling κ ∈ R + in one of the arguments of the divergence.Without loss of generality, we assume scaling in the second argument of the divergence.As it is shown in the Appendix A.2, there is a permutation π of the indices of the spatial filters 1, . . ., p that links the CSP solution in (24) with the optimization of the divergence for any given with where the function determines the value of the constant κ = K(a, b) ∈ R that equalizes the value of the AB log-det divergences between any arbitrary a, b ∈ R constants (in the first argument) and κ (in the second argument), i.e., Note that the only role of the scaling factor κ is to adjust the reference value in one of the arguments of the divergence to ensure the exact balance in the number of spatial filters that are specialized in each class.As it is shown in the Appendix A.2, this scaling factor prevents that the minimax solution for i = 1, . . ., p, could be attained by some eigenvectors associated with elements of the inner set of eigenvalues in (57), so the chosen subset of eigenvectors have to essentially coincide with the principal and minor eigenvectors that form the CSP solution in (63).In practice, a value of κ which is closer to unity and meets the required bounds can be obtained from the truncated choice for an arbitrary small value of the constant κ sup − κ in f .

The Gradient of the AB Log-Det Divergence
The AB log-det divergence between the conditional covariance of the outputs Y = W T x for each of the classes is a function of the matrix W ∈ R n×p .The optimization of this function with respect to W is non-trivial, so in this section, we show how the gradient of the AB log-det divergences can be derived.One may note that this is not only naturally interesting for the optimization that we would like to perform in this work, but it also contributes to pave the way for the potential practical use of the AB log-det divergence in other scenarios and applications.
As we have shown previously, the divergence is separable over the eigenvalues of the matrix where respectively denote the matrices of eigenvalues and eigenvectors of M, which are functions of the matrix W.
The differential of f (W) can be expressed as where denotes the gradient of the function.The divergence directly depends on the generalized eigenvalues, which in turn depend on the matrix W. The suitable tool to obtain the gradient of this composition of functions is the chain rule, which can be written as So, the gradient can be evaluated after finding and ∂µ i ∂W .Since the divergence is a separable function of the generalized eigenvalues, the first term is easier to obtain, Obtaining the second term ∂W is not so easy and requires to employ our previous plausible assumption that the generalized eigenvalues have a simple spectrum.Under this condition, the Hadamard first variation formula can be used to write the differential of the eigenvalues as where u i denotes the normalized eigenvector ( u i 2 = 1) corresponding to each eigenvalue µ i .
With the help of the product rule for differentials, we obtain As we show in the Appendix A.3, it can be simplified as follows hence Thus, after substituting (84) in (80) and using the invariance of the trace under transpositions (tr{A} = tr{A T }) and the cyclic shifts (tr{AB} = tr{BA}), the following values are obtained At this point, we can use the identity for the differential in (88) to identify the second desired term Substituting the expressions (79) and (90) in (78), we obtain where, for convenience, the matrix is defined as following The matrix Z can also be represented directly in terms of the matrix M (which we have defined previously in Equation ( 74)) as where log(•) for matrix arguments denotes the matrix logarithm functional.After the grouping of common terms in (91) we obtain the final gradient expression, which is given by (95)

Validation of Equation (95) with the Gradient of the KL Divergence
The Kullback-Leibler (KL) divergence between the Gaussian densities p(x|c 2 ) and the p(x|c 1 ), of zero mean and the respective covariance matrices Cov(Y|c 1 ) and Cov(Y|c 2 ), is given by Since this divergence only involves trace and log-det operators, as it is shown in the Appendix A.4, its gradient with respect to W, i.e., is relatively easy to obtain.Then, we can use the fact that the KL divergence is proportional to the AB log-det divergence between the class conditional covariance matrices, as long as the conditional covariance matrices appear in the AB log-det divergence interchanged in position with respect to class conditional density arguments of the KL divergence.So for the specific case of α = 1 and β = 0, i.e.,

D
(1,0) to test whether there is coherence between the obtained gradient formula in (95) and twice the gradient of the KL divergence that was independently obtained in the Appendix A.4.For this purpose, in the specific case of α = 1 and β = 0, from (94) the following auxiliary matrices are evaluated and are substituted in the expression of the gradient of the AB log-det divergence (95).After the following straightforward simplifications, the proportionality between the gradient of D (1,0) ) and the gradient of the KL divergence in (97) is confirmed.

Validation of Equation (95) with the Gradient of the AG Divergence
The Alpha-Gamma divergence between the Gaussian densities p(x|c 2 ) and p(x|c 1 ), of zero mean and with respective covariance matrices Cov(Y|c 1 ) = W T PW and Cov(Y|c 2 ) = W T QW, is equal to Due to the constraint α + β = 1, we assume that β is determined by α, i.e., β = 1 − α along this subsection.Since the gradient of the AG divergence with respect to W is given by ].
Then, we can use the equivalence between the AG divergence and the AB log-det divergence between the class conditional covariance matrices which is valid for the specific case of α + β = 1 and α, β > 0, to also test the coherence between the obtained gradient formula in (95) and twice the gradient of the AG divergence.For α + β = 1, the auxiliary matrices in the definition of the gradient are and +β(W T QW) After substituting this last expression in the gradient of the AB log-det divergence (95), we obtain With the help of the particular form of the Woodbury identity for the matrix inverse we simplify the terms within the brackets.Finally, we use the fact that α + β = 1 to confirm the proportionality with the gradient of the AG divergence given in (108), (119)

Robustness of the AB Log-Det Divergence in Terms of α and β
The squared Riemann metric is known to be the natural distance in the manifold of SPD matrices, as it measures the squared length of the geodesic path between the arguments of the divergence [3].However, in the real data there are usually several model contaminations (mismatches), including outliers or artifacts, that could make other robust divergences preferable.In this section, we study how the hyperparameters α and β can influence robustness of the AB log-det divergence with respect to the behavior of the squared Riemann metric, which is used as a reference.
For convenience, we denote the AB log-det divergence as a function of the spatial filter matrix W by and we consider its gradient expression given by Equation (78).The spatial filters that maximize this divergence should satisfy the following estimating equations where µ i , i = 1, . . ., p, are the eigenvalues of matrix M, which was defined in Equation ( 74), and may be regarded as influence functions for each pair (α, β) that account for the penalty variation in the divergence with respect to µ i .The complementary term to ψ (α,β) (µ i ) in (121), i.e., ∂µ i ∂W , is a matrix of partial derivatives of the generalized eigenvalues µ i with respect to the elements of the spatial filters W and, therefore, it is independent of the considered divergence.It is easy to observe that, in the particular case of α = β = 0, the expression in (121) represents the estimating equation for the squared Riemann metric In order to study the relative robusness to outliers, one can rewrite the estimating equation for a chosen pair of hyperparameters (α, β) in terms of the influence function for the squared Riemannian metric as where the scalar term acts as a weight function that controls, for a given pair (α, β), the magnitude of the effect in the estimation equation of departures of µ i from unity.The presence of outliers in the real data, typically results in eigenvalues µ i that are too far from unity.However, depending on the problem, the higher prevalence of outliers may be stronger only for the greatest eigenvalues, or for the smallest eigenvalues, or simultaneously for the greatest and smaller eigenvalues.Those hyperparameters (α, β) that are able to down-weight the contribution of the outliers, are considered more robust.Therefore, the shape of the weight functions w (α,β) (µ i ) is useful to study the relative immunity of the AB log-det divergence to outliers.
Figure 3a shows the squared Riemannian metric (α = β = 0) and its weight function, which is flat since this divergence is taken as reference.Figure 3b presents a similar plot for the Power Log-det divergence with α = β = 1.In this case, the bell shape of the weight function is an indicator of the robustness with respect to the presence of outliers in the greatest and smallest eigenvalues, since they will be down-weighted in the estimating Equation (124).Similar plots can be done by increasing the magnitude of α = β, which progressively enhances the robustness.When α = β the divergence is asymmetric.The Figure 4a,b respectively present the Kullback-Leibler divergence for SPD matrices (α = 1, β = 0) and its dual version (α = 0, β = 1), together with their associated weight functions.These plots illustrate the asymmetric cases in situations where α + β > 0 and reveal that, when α β, the AB log-det divergences tend to be more robust against outliers in the large eigenvalues while, for α β, the robustness tends to be with respect to the outliers in small eigenvalues.

Review of Some Related Techniques for the Spatial Filtering of Motor Imagery Movements
In this section, we will review the regularized variants of CSP that have been proposed to improve the classification performance.The regularization approaces of CSP are mainly done either in the estimation of the covariance matrices or by modifying the CSP objective function.
Most of them combine the estimation of the covariance matrices for each class with the regularization of the CSP objective function using penalty terms.Some of the approaches include the previous information [22], other subject data [23,24] and previous session data [25] for estimating the class covariance matrix.Another approach used M-estimators to compute the robust class covariance matrices [26] and yet another approach obtained the covariance matrices by finding the minimum squared error [27].The authors of [28] applied Multiple Kernel Learning (MKL) to combine the information from different subjects.
It has been shown in [29] that the regularization of the objective function is more useful than regularizing the estimated covariance matrix.Several approaches have been proposed by regularizing the objective function.The authors of [21] have additionally incorporated the electrooculogram (EOG) signals for reducing the ocular artifacts.Other authors have tried to ensure robustness by selecting only the important channels and produce sparse spatial filters [30][31][32].Another approach is to robustify the system by obtaining only the stationary features.A robustify maximin CSP method was proposed that used a set of covariance matrices instead of an individual covariance matrix without using any other user data or data from the previous sessions [33,34].In order to avoid the presence of the outlier, the CSP objective function has been formulated using l p -norm in [35,36].The Stationary Subspace Analysis (SSA) algorithm was proposed to obtain the stationary subspaces of the time series EEG signals by considering only the stationary components of the signals.The limitation of this method is the detection of dissimilarity of the different class as a non-stationary feature [37].The group wise SSA (gwSSA) algorithm aims at obtaining the non-stationarities by dividing the dataset into different groups and calculating the minimum KL divergence between estimated source distribution of each trial in a group and the average distribution of the corresponding group.This algorithm not only allows the combining of the multisubject data but also the multiclass data [38].But, the gwSSA algorithm cannot find the discriminative information between the classes.The same group proposed a new approach for extracting the discriminative information, by subtracting the inter class divergences from the gwSSA objective function [39].To overcome the limitation of the SSA algorithm, two-step approaches have been proposed where the initial extraction of the stationary sources was done using the SSA method and later, the CSP was used for the computation of the spatial filters [40].Another approach to extract the stationary features is to reduce the nonstationarities between the two sessions.The supervised and unsupervised methods for adaptation of the data space have been proposed using KL divergence between the intersession data [41].Recently, the authors of [14] presented maximum a posteriori-CSP (MAP-CSP) algorithm by deriving the probabilistic model of CSP to resolve the issue of overfitting of the baseline CSP algorithm.
One of the limitations of the CSP algorithm is that it is mainly suitable only for the discrimination of two classes, while, in general, for an efficient BCI system more than two motor imagery movements are required.In order to formulate it for the multiclass system, the authors of [42,43] have reduced the multiclass problem to a binary problem.The authors of [44] proposed two approaches for the multiclass problem; firstly to find the spatial filters for one class with respect to all the other classes and secondly, by simultaneous diagonalization methods.Other approaches, like [45], proposed to solve the multiclass problems by combining information theoretic criteria with joint diagonalization methods.Several other methods have been proposed for the multiclass paradigm using independent component analysis [46] and Riemannian geometry to obtain the spatial filters [47].The authors of [48] derived a relation between Bayes classification error and Rayleigh quotient and used this approach to solve the multiclass problem.In spite of all these different approaches, the performance of MI based BCI systems is degraded due to the presence of non-stationarities and outliers, which is a challenge for the BCI systems in a real application.Hence, a robust feature extraction algorithm is needed to increase the overall performance of the system.

Proposed Criterion and Algorithm for Spatial Filtering
For the presentation of the proposed criterion some additional notation needs to be defined.Let x(j) (t)|c denote the output of the passband filtering of the raw observations at time t and for the jth trial of class c ∈ {c 1 , c 2 }.The power of the trials of a given class c is normalized by the operation where denotes the sample covariance matrix the j th trial x (j) of class c, and L is the size in samples of each trial.In order to simplify the notation, the covariance matrices of the two classes are renamed as and their averaged versions (the centroids of each class) are denoted as The classification of imagery movements involves extracting the relevant features of the observations and the classification of the observed patterns in the feature space.In the considered application, the data is high-dimensional but only a few features are sufficient to capture the discriminative information about the intended movements.Thus, the extraction of the relevant features involves a dimensionality reduction step for the observations from R n to R p where p n.This step is implemented through the spatial filtering, i.e., by projecting the n-dimensional observations onto a p-dimensional subspace which should allow a good discrimination of the cluster centroids and, at the same time, guarantee a compact representation of the clusters.
As mentioned earlier, the CSP solution will be obtained by a minimax optimization of the divergence between the projected and scaled centroids of the classes, i.e., D AB (w T i Pw i κ w T i Qw i ).However, since this solution completely ignores the within-class dispersion of the samples, it is quite sensitive to artifact and outlier in the training dataset.In similarity with the divergence framework presented in [1] and with some variants of Fisher LDA, p. 366 in [49], one can regularize the previous problem by controlling the dispersion of the trials of each class around their centroids and also by exploiting the degrees of freedom in the selection of the hyperparameters of the divergences.Then, a robust criterion based on the AB log-det divergence takes the following form where the penalties associated to the within-class dispersion involve the averaged divergences and the parameter η ∈ R + controls the balance between the maximization of the between-class scatter and the minimization of the within-class scatter.Note that in (132) we have used the fact that the AB log-det divergence is invariant under the common scaling of its arguments, to simplify The optimization of the criterion in (130) can be performed simultaneously, for all the spatial filters, with the use of subspace techniques [1].In the next section, we present a subspace optimization algorithm based on AB log-det divergences.

The Subspace Optimization Algorithm (Sub-ABLD)
The subspace method aims to extract the desired set of p spatial filters in two steps.The idea is to first use a robust method to determine the discriminative subspace of the spatial filters, for instance, considering the optimization of a robust criterion like (130).Later, another criterion is used to identify the individual spatial filters within the subspace.Since the influence of outliers on the solution is significantly reduced after the discriminative subspace is determined.In the second step, the standard CSP criterion can be safely used to determine the final spatial directions within the chosen subspace.
The input parameters of the subspace optimization algorithm based on AB log-det divergences (Sub-ABLD) are the set of covariance matrices for each class (P j , Q j ), the dimension of subspace to be extracted p, and the hyperparameters α, β and η.The method starts with the computation of the sample prior probabilities as well as the average covariance matrices for each class, i.e., p(c 1 ), p(c 2 ) and (P, Q).The spatial filter matrix decomposes as W T = Ω T T into the product of a whitening transformation matrix T of the observations and a semi-orthogonal matrix Ω T , which satisfies Ω T Ω = I p .The whitening transformation is obtained from eigenvalue decomposition of Cov(x) = p(c 1 )P + p(c 2 )Q = U 1 ∆U T 1 as follows where ∆ and U 1 represent the matrices of eigenvalues and eigenvectors.This transformation is applied to both sides of the covariance matrices to obtain the whitened trial covariances and their averaged versions The scaling parameter κ, which pursues the balance of the number of features for each class in absence of regularizers, is determined with the truncation procedure proposed in (70).The semiorthogonal matrix Ω T that projects the whitened observations onto a p-dimensional subspace is initialized from the identity matrix of dimension n × p.This is equivalent to start the optimization projecting onto the principal p-dimensional subspace of the observations, which ensures a good initial signal to noise ratio.Once the whitening transformation is fixed, the criterion to optimize F(W) can be rewritten, in terms of Ω, as the following function which ordinary gradient can be determined from (95), to obtain where the matrices Z i should be defined for each case (i = 1, . . ., 3) as in (94).However, this gradient is not the fastest ascent direction in the structured manifold of semi-orthogonal matrices (the Stiefel manifold).Instead, the fastest ascent direction is given by the "natural" gradient in this manifold [50,51], which is given by Let Ω (i) denote the semi-orthogonal matrix at iteration i and let µ (i) denotes the step-size, the gradient ascent update is then performed with The resulting matrix Ω (i+1) tg belongs to the tangent space of the manifold at Ω (i) and asymptotically follows the geodesic path of maximum ascent for a sufficient small stepsize µ → 0. However, for practical stepsizes, like the one that we consider next the resulting updates Ω (i+1) tg are not exactly semi-orthogonal and, in order to restore this property, a retraction procedure onto the manifold is necessary after each iteration.The retraction can be implemented with the help of the MatLab command for a "thin" singular value decomposition as The procedure is then repeated until convergence to a maxima of the criterion at a given iteration i max .After that, the solution (Ω (i max ) ) T T identifies the subspace of the spatial filters, but not each of their individual directions.In order to determine them, one can solve a CSP problem within the previously identified subspace.We compute the generalized eigenvalues of the matrix pencil ((Ω (i max ) ) T PΩ (i max ) , (Ω (i max ) ) T QΩ (i max ) ) and use the resulting principal and minor eigenvectors vj to form the spatial filter matrix The final matrix of spatial filters that solves the problem, is the product of the whitening matrix T, the projection matrix (Ω (i max ) ) T and a CSP rotation matrix VT which operates within the subspace, i.e., The main steps of the Sub-ABLD iteration are summarized in Algorithm 1.

5:
Whiten the trial and average covariance matrices to respectively obtain { Pj },{ Qj } and P, Q.

18: end function
The proposed subspace algorithm (Sub-ABLD) is similar in structure to the one presented in [1] for Beta divergences.In spite of the fact that they optimize different criteria, the main difference between both subspace algorithms is in the specific way that the updates of the estimates are implemented.In [1] the authors opted for applying multiplicative updates that require the determination of the gradient of the criterion in the space of skew-symmetric matrices, whereas our proposal performs tangent updates to the manifold of the semi-orthogonal matrices that are followed by a projection or retraction onto the manifold.These updates are quite common in the research field of Independent Component Analysis [50][51][52][53].

Experimental Study
The discrimination of two class MI movements consists of the following steps.The MI EEG signals are acquired, preprocessed and spatially filtered.The filtered signals are then used for extracting the required features, which are classified using a linear classifier.In the following section, we explain the experimental steps used in the testing and comparison of the proposed algorithm.

Simulations Data and Preprocessing
Initially, we have explored the robustness of the proposed algorithm in a controlled situation with synthetic data.Two sets of symmetric positive definite matrices (SPD) that represent the trial covariance matrices of the two classes were randomly generated.Each set consists of 200 trials.For further preprocessing, both the sets of matrices were concatenated.The concatenated data are cross-validated using k-fold cross-validation (k = 10).This divides the data into 10 equal subsets in which a single set was used as a testing data and the remaining 9 subsets were used for training the classifier.The performance of the proposed algorithm was studied in the presence of the outliers.The outliers consist of matrices with abnormal higher variances that were inserted in the training set of both the classes.The proposed Sub-ABLD algorithm was tested in the following figure by progressively varying the percentage of outliers in the trials from 0% until 30%.The robustness of Sub-ABLD and its comparison with respect to the other algorithms mentioned in the figure will be addressed in Section 11.

EEG Dataset and Preprocessing
To evaluate the proposed Sub-ABLD algorithm with BCI competition datasets, we utilized two datasets from competition III: data set 3a, data set 4a (which can be downloaded from [54]) and one dataset from competition IV data set 2a (which can be downloaded from [55]).The data were acquired during the MI movements.The first dataset 3a [56] from BCI competition III [57], were acquired from 3 healthy subjects namely K3, K6 and L1 using 60 channels EEG acquisition system.The signals were recorded while executing the MI movements of the left hand, right hand, foot and tongue.The signals were sampled at a frequency of 250 Hz.The sampled signals were bandpass filtered at the frequency range between 1 to 50 Hz.The data set consists of two sessions i.e., training and testing sessions.For subject K3, both the sessions consist of 45 trials for each class whereas the other two subjects i.e., K6 and L1 performed 30 trials per class in both the sessions.For the second dataset, data set 4a [44] of BCI competition III [57], the signals were acquired from five subjects namely AA, AL, AV, AW and AY using 118 channels EEG system.The acquisition was done during the imagery movements of the left hand, right hand and right foot.Down-sampling of the recorded signals was done at 100 Hz.The band-pass filter between 0.05 to 200 Hz frequency band was applied to the signals.The data set of each subject consists of 280 total trials.The size of the training sessions is different from testing sessions.The training sessions consist of 168, 224, 84, 56, 28 trails for subjects AA, AL, AV, AW, AY and the remaining denotes the testing trails for the corresponding subjects.The last dataset, data set 2a [46] BCI competition IV [58] were acquired from nine subjects (A1 to A9) while performing the left hand, right hand, foot and tongue MI movements using 22 electrodes.The sampling frequency of the signals was 250 Hz.The band-pass filtering of the acquired signals were performed between 0.5 and 100 Hz.For each subject, the data were acquired on different days and each set consists of 72 trials for each class.
In this approach, the performances were obtained using only two MI movements considering all the channels from each dataset.The preprocessing step was implemented similarly for all the algorithms.First, a fifth-order band-pass filter with a cut-off frequency between 8 to 30 Hz was applied to the raw EEG signals.A time window of 2s during the imagination of movements was extracted for each trial.The extracted trials were concatenated for each class and applied a k-fold cross-validation (k = 10) to the concatenated data.The CV process divides the data into 10 equal sets where one set of data was used as testing data and the remaining 9 sets were used for training.Finally, the optimal spatial filters were obtained using the training dataset.The number of filters selected for each class is k = 3, so the total number p = 6.

Feature Extraction and Feature Classification
For both-the synthetic and the BCI datasets, the obtained spatial filters were used for filtering the training and testing data.The training and testing features were obtained by taking the log-variance of the filtered data in order that their distribution be closer to Gaussianity.The Linear Discriminant Analysis (LDA) [59] classifier was used for discriminating the features of the two classes.The classifier was trained using the training features and its performance was obtained using the testing features.The preprocessing, feature extraction and classification steps were repeated 10 times and finally the average performance was obtained.

Selection of α, β and η Values
The selection of α and β is one of the crucial steps for the proposed algorithm.Depending on the α and β values, the AB Log-Det divergence can be derived into different divergence techniques [6].The proposed algorithm performed better when α = β, situation where the AB Log-Det divergence is symmetric or invariant under the permutation of its arguments.In this experiment, we have observed the performance for various values of α = β and η, and a suitable configuration of parameters for each dataset was selected.

Results and Discussion
The performance of the proposed Sub-ABLD algorithm is compared with the performance of the existing algorithms: CSP, JADE, MAPCSP and divCSP-WS for both the synthetic and the real BCI competition datasets.JADE algorithm performs a joint approximate diagonalization of the trial covariance matrices of the classes [45].MAPCSP is a Bayesian algorithm that tries to find the maximum a posteriori estimates of the patterns and sources in a generative model with additive Gaussian isotropic noise [14].The subspace implementation of divCSP-WS finds a balance between the maximization of Beta divergence between the conditional covariances of the filtered outputs for each class and the minimization of the variability within each class [1].This algorithm contains two hyperparameters, the regularization factor λ and the real scalar β that specifies the chosen Beta divergence.The factor λ admits an equivalence in terms of the reguralization parameter η in Sub-ABLD which link them through the mapping λ ≡ η/(1 + η), while the parameter of the Beta divergence β * was chosen in the simulations to maximize the performance .
In order to carry out a fair performance comparison, a total of six features (i.e., p = 6) have been selected for all the algorithms.The implementation of the JADE and divCSP-WS algorithms were taken from the webpages of the authors.The baseline divCSP-WS algorithm has been downloaded from [60], while the implementation of JADE algorithm can be found at [61].The performance comparison between all the algorithms is presented in the following subsections.

Observations for Simulated Data
To study the performance of the proposed algorithm in the presence of outliers, the experiment was done by increasing the percentage of outlier trials in the training set for both the classes.The performance of the proposed Sub-ABLD algorithm for (α, β) = (1.5, 1.5) and η = 1 was obtained.The performances of the above algorithms with the increasing percentage of outliers in the training set are presented in Figure 5.It can be observed that CSP and MAPCSP perform worse in the presence of the outliers.The performances of JADE and divCSP-WS are much more robust than those of CSP and MAPCSP, but in overall the proposed Sub-ABLD algorithm seems to outperform the compared algorithms in the presence of the outliers.

Observations for BCI Competition Datasets
In this section, the proposed algorithm is tested using three BCI competition datasets.For each dataset, the performances of the proposed algorithm for the different values of (α, β) and η were observed.From the observation, the maximum performance of the Sub-ABLD algorithm for the particular (α, β) and η values was selected.The selected performance is compared with the performances of other existing algorithms.Further analysis is done by using a box plot comparison for all the algorithms.The box plot analysis shows the distribution of the performances.In a box plot representation, the line inside the box represents the median performance.The upper and lower hinge of the box denote the 75-th and 25-th percentile of the overall performance distributions.The whiskers are symbolized by the two lines outside the box.The upper and lower whisker represent the maximum and minimum performance observed.
For BCI competition III dataset 3a, the Figure 6a shows the comparison of the highest average performance of the Sub-ABLD algorithm with the average performances of other existing algorithms.From the figure, it is observed that the Sub-ABLD algorithm outperforms the other existing algorithms with an average performance accuracy of 89% for this dataset.The box plot comparison is shown in Figure 6b.Although, the median performance is slightly higher for CSP, JADE and divCSP, their the 25th percentile performance is much smaller than the one of the Sub-ABLD algorithm.As we will see later, is a consequence that with the Sub-ABLD algorithm the most difficult subjects have attained a significant improvement in their classification performance.
Figure 7 shows the observed average performances using BCI competition III dataset 4a.For this dataset, the algorithms JADE, Sub-ABLD and divCSP-WS perform essentially similar and slightly above than the average performance of CSP, which is 88.1%.From the box plot of the results we can observe that the 25-th percentiles for these four algorithms are also quite close.
Similar results have been obtained for the BCI competition IV dataset 2a, which are shown in Figure 8. Again the algorithms JADE, Sub-ABLD and divCSP-WS perform essentially the same as CSP, which average performance is 81%.In the box plot we can observe that the quartiles of these algorithms are approximately coincident.To analyze the effect of performance for different divergences, we varied the parameters (α, β) for a single subject (Subject k6 from BCI competition III dataset 3a, which is one of the subjects with worst performance for the experiment) and obtained the corresponding performance.The values of (α, β) are varied to cover the interval [0, 2] × [0, 2] with a mesh of 0.1 spacings.The observed performance is shown in Figure 9.This figure reveals a tendency to improve the classification accuracy of the worst user for values of α and β that are close to the diagonal and large enough so they can effectively down-weight the contribution in the estimating equations coming from the largest and smallest generalized eigenvalues.The proposed Sub-ABLD algorithm has been tested on both simulated and real EEG signals.On one hand, the results with synthetic data indicate that the proposed Sub-ABLD exhibits a certain robustness to the presence of outliers trials in the dataset.On the other hand, the analysis of real EEG signals is also challenging because of the possible presence of artifacts and non-stationarities.We have presented the performance of the Sub-ABLD algorithm using several real BCI datasets.For BCI competition III dataset 3a, we can observe that the proposed Sub-ABLD algorithm also outperforms the other algorithms.Whereas, the performance of the proposed algorithm is almost similar to the one obtained by JADE, divCSP-WS and CSP in the other two datasets, i.e., for the BCI competition III dataset 4a and BCI competition IV dataset 2a.Additionally, the analysis of the box-plots reveals that the proposed Sub-ABLD algorithm increased the classification performance of the subjects that do not perform well for the other methods.At the same time, it retained an almost similar performance for the remaining subjects.These observations meet our initial goal of developing a robust algorithm.The classification performance is also affected by the regularization parameter η that controls the penalty term.In general, the data with outliers give the best performance for the higher values of η and, otherwise, smaller values are preferable.In this study, the value of η has been kept constant across subjects in each dataset.

Conclusions
In this paper, we have explained how one can be able to use and optimize the recently proposed family of Alpha-Beta Log-Det divergences.For this purpose, we have summarized the key properties of these divergences and derived an original explicit formula for their gradient.In this work, we have adopted as an illustrative example of application the problem of spatial filter selection for the classification of two class imagery movements.We have reexamined the relation between the Common Spatial Pattern criterion with a predefined number of spatial filters for each class and its interpretation as an Alpha-Beta Log-Det divergence optimization problem, to show that a scaling factor in one of the arguments of the divergence is necessary for the equivalence of the solutions.We have proposed a subspace algorithm (Sub-ABLD) for obtaining the spatial filters that retain the discriminative information of two class MI movements.This algorithm was tested with synthetic and real datasets and compared with the other existing algorithms.The simulations have confirmed the possibility to tune up the hyperparameters of the divergence so as to improve the robustness of the obtained solutions without deteriorating the expected accuracy.
With the help of the permutation π of the indices 1, . . ., n that sorts the divergence of the eigenvalues from the unity in descending order The fact that any Rayleigh quotient is bounded by the maximum and minimum eigenvalues of the associated matrix pencil n } that define the W CSP , the eigenvalues λ π 1 , . . ., λ π p that maximize their divergence from κ, should all belong to the upper and lower sets of eigenvalues defined in (57).For this to be true, it necessary and sufficient that the divergence of the last selected eigenvalue λ π p from κ upper-bounds with inequality all the divergences between an inner eigenvalue λ i and κ, in the sense that D

Figure 2 .
Figure 2. Illustration of the strictly monotonous ascending transformation g(•) that, through Equation (56), maps eigenvalues of the matrix pencil (p(c 1 )P, Cov(x)) into the eigenvalues of the matrix pencil (P, Q), in a case where the sample probabilities of the classes are uniform p(c 1 ) = p(c 2 ).Note that the eigenvalues of the first pencil are bounded in the interval (0, 1), while the domain of the eigenvalues of the second pencil is (0, ∞).

Figure 9 .
Figure 9. Results of the Sub-ABLD algorithm for the subject k6 from BCI competition III dataset 3a.This figure illustrates the changes in the average classification performance with respect to the variation of the parameters α and β.Relatively good performance results are obtained close to the diagonal and for moderately large values of the parameters.