Motor Imagery Classification via Kernel-Based Domain Adaptation on an SPD Manifold

Background: Recording the calibration data of a brain–computer interface is a laborious process and is an unpleasant experience for the subjects. Domain adaptation is an effective technology to remedy the shortage of target data by leveraging rich labeled data from the sources. However, most prior methods have needed to extract the features of the EEG signal first, which triggers another challenge in BCI classification, due to small sample sets or a lack of labels for the target. Methods: In this paper, we propose a novel domain adaptation framework, referred to as kernel-based Riemannian manifold domain adaptation (KMDA). KMDA circumvents the tedious feature extraction process by analyzing the covariance matrices of electroencephalogram (EEG) signals. Covariance matrices define a symmetric positive definite space (SPD) that can be described by Riemannian metrics. In KMDA, the covariance matrices are aligned in the Riemannian manifold, and then are mapped to a high dimensional space by a log-Euclidean metric Gaussian kernel, where subspace learning is performed by minimizing the conditional distribution distance between the sources and the target while preserving the target discriminative information. We also present an approach to convert the EEG trials into 2D frames (E-frames) to further lower the dimension of covariance descriptors. Results: Experiments on three EEG datasets demonstrated that KMDA outperforms several state-of-the-art domain adaptation methods in classification accuracy, with an average Kappa of 0.56 for BCI competition IV dataset IIa, 0.75 for BCI competition IV dataset IIIa, and an average accuracy of 81.56% for BCI competition III dataset IVa. Additionally, the overall accuracy was further improved by 5.28% with the E-frames. KMDA showed potential in addressing subject dependence and shortening the calibration time of motor imagery-based brain–computer interfaces.


Introduction
A brain-computer interface (BCI) provides a direct control pathway between the human brain and external devices, without relying on peripheral nerve and muscle systems [1]. BCIs have demonstrated potential in medical rehabilitation, education, smart homes, and so on. Most non-invasive BCIs are based on EEG signals, and the neural response patterns are decoded by well-designed algorithms, which can convert movement intentions into computer commands to control external devices, such as a wheelchair [2], an artificial limb [3], a spelling system [4,5], or a quadcopter [6]. Steady-state visual evoked potential (SSVEP), P300, and motor imagery (MI) are widely studied neural response paradigms for BCIs. SSVEP and P300 have shown breakthroughs in spelling applications [4][5][6], while MI is prized for its simple stimulus paradigm design, and allows subjects to express motor intention in a natural way [7]. construction scheme for converting the EEG timing series into 2D frames, which not only fully exploits the electrode position and frequency band of the signal, but also further reduces the dimension of the covariance matrix.
To sum up, the main contributions of this paper include: (1) The KMDA classifies the motor imagery tasks without any target labels.
(2) The KMDA defines a subspace learning framework in a RKHS space defined by a kernel on the SPD manifold. (3) The KMDA not only minimizes the marginal and conditional distributions, but also considers intra/inter-class discriminative information of sources and the principal components of the target. (4) A feature construction scheme is presented to reduce the dimension of the SPD matrix and the computational cost.
The rest of the paper proceeds as follows: Section 2 introduces the Riemannian metric theory of the SPD manifold, and the definition of Gaussian kernel applicable for Riemannian manifolds. Section 3 details our proposed framework, Section 4 provides a detailed description of the experiment design and results on three datasets, Section 5 presents a series of discussions, and a conclusion is drawn in Section 6.

Preliminaries
This section provides an overview on the geometry of the symmetric positive definite (SPD) manifold and some Riemannian metrics for the kernel method. Sym + d denotes the space spanned by the d × d SPD matrices, and T p is the tangent space on the point of P ∈ Sym + d . X i ∈ R c×t represents the single trail of recorded EEG signal with c electrodes and t time samples. C i represents a covariance matrix in Euclidean space, and P i is the point in the Riemannian manifold. X F = Tr(X T X) designates the Frobenius norm, (.) T denotes the transpose operator, and Tr(.) is the sum of the diagonal elements. The principal matrix exponential, exp(.) : Sym + d → Sym d , is defined as exp(X) = Udiag(exp(λ 1 , λ 2 , . . . , λ n ))U T ; similarly, the matrix logarithmic operator log(.) : Sym d → Sym + d is defined as log(X) = Udiag(log(λ 1 , λ 2 , . . . , λ n ))U T , with X = Udiag(λ 1 , λ 2 , . . . , λ n )U T . Exp P (.) and Log P (.) denote the exponential and logarithmic maps at the reference point P, respectively.

Riemannian Metrics
The covariance matrix of a single trial, X i , was normalized with the total variance, as follows: A covariance matrix is a typical symmetric positive definite (SPD) matrix, P i ∈ Sym + d , and the value of its determinant is a direct measure of the dispersion of the associated multivariate Gaussian [43]. However, Euclidean geometry forms a non-complete space [35], which often leads to a swelling effect in regression or average operations [44] that for the determinant of the Euclidean mean can be strictly larger than the original determinants [35,43], giving spurious variation to the data. To fully circumvent these issues, Riemannian metrics are proposed for the SPD manifold.
Tangent Space: The covariance matrices of multi-channel EEG signals define an SPD space, which is locally homeomorphic to the Euclidean space, i.e., the topological manifold is a locally differential manifold [43,45]. The curvatures of the curves that pass through each point on the smooth differential manifold define a linear approximation space, also known as the tangent space. For the SPD manifold, there exists a pair of mappings transporting points from the manifold to its corresponding tangent space, and vice versa. Specifically, the logarithmic map is used to embed the neighbors of a given point into the tangent space with the point as a reference, and the exponential map reverses a tangent vector back to the manifold: Sym + d → T P : S i = Log P (P i ) = P 1 2 log(P − 1 2 P i P − 1 2 )P 1 2 T P → Sym + d : P i = Exp P (S i ) = P 1 2 exp(P − 1 2 S i P − 1 2 )P 1 2 (2) As depicted in Figure 1, any vector in T P is identified as a geodesic starting at point P on the manifold; conversely, any bipoint (P, P i ) can be mapped into a vector of T P . It is worth noting that the tangent space of a manifold is not unique and depends on the reference point. Conventionally, the reference point is either an identity matrix or the Riemannian mean.
As depicted in Figure 1, any vector in P T is identified as a g P on the manifold; conversely, any bipoint ( , ) i P P can be map It is worth noting that the tangent space of a manifold is not uni reference point. Conventionally, the reference point is either an id mannian mean. Riemannian mean: The Riemannian mean is defined as the p lowing metric dispersion: denotes a distance suitable for the SPD manifo define a closed-form solution [41], but it can be solved through an Affine-Invariant Riemannian Metric: The affine-invariant Rie is a powerful and pervasive metric endowed to the SPD manifol uniquely defining the geodesic between two metrices and the me Riemannian mean: The Riemannian mean is defined as the point minimizing the following metric dispersion: where δ R (P * , P i ) denotes a distance suitable for the SPD manifold. Formula (3) does not define a closed-form solution [41], but it can be solved through an iterative algorithm [30]. Affine-Invariant Riemannian Metric: The affine-invariant Riemannian metric (AIRM) is a powerful and pervasive metric endowed to the SPD manifold, with the properties of uniquely defining the geodesic between two metrices and the mean of a set of metrices.
An arbitrary invariant distance on Choosing A = P i −1/2 , this distance is transformed to be a distance to the identity: δ R (P i , P j ) = δ R (I n , P i −1/2 P j P i −1/2 ), where the affine-invariant Riemannian distance between two points, P i and P j , is transformed to be the Riemannian distance between P i −1/2 P j P i −1/2 and I n . Based on this, the distance δ R (P i , P j ) can be solved by calculating the geodesic distance of P i −1/2 P j P i −1/2 starting at the identity matrix, which amounts to calculating the vector of P i −1/2 P j P i −1/2 in the tangent space of I n . Hence, the Affine-Invariant Riemannian Metric (AIRM) distance, δ air , between P i and P j is defined as: Equivalently, we can write (4) as: where λ i is the eigenvalues of P i −1 P j . Log-Euclidean Metric: The log-Euclidean metric (LEM) can also define the real geodesic distance of two SPD matrices [46], by computing the distance between their corresponding tangent vectors at the identity matrix, and we have: Let P = UΣU T be the eigen-decomposition of SPD matrix P, and Σ is the diagonal matrix of the eigenvalue. Its logarithmic map can be computed easily by: log(P) = U log(Σ)U T . Compared with the AIRM, the log-Euclidean consumes less computation, while conserving excellent theoretical properties.
In addition to LEM and AIRM, another two metrics derived from Bergman divergences, namely Stein and Jeffrey divergence, are extensively used in manifold analysis. Stein and Jeffrey divergence are symmetric and affine invariants [41], which prompts the choice of these metrics in the Riemannian mean. Algorithm 1 illustrates the iterative process of estimating a Riemannian mean by AIRM. Algorithm 1. Riemannian mean by AIRM.
, iteration Num, and termination criteria ε Output: The Riemannian mean M R ∈ S n + 1.
Initialize the reference matrix C with an identity matrix.

2.
Calculate the covariance matrices of training samples P i ∈ S n + by (1) 3.
Map each matrix P i to the tangent space at C by (2). 5.
Obtain their Arithmetic mean S i in the tangent space. 6.
Embed the Arithmetic mean S i to Riemannian space by (2), obtaining corresponding matrix C i 7.
C←C i 9.
end for 10. M R ←C

Positive Definite Kernel on Manifold
Embedding into RKHS through kernel methods is a well-established and prevalent approach in machine learning [14]. However, embedding SPD manifolds into RKHS requires the kernel functions to be positive definite. The Gaussian kernel has worked well in mapping the data from Euclidean space into an infinite dimensional Hilbert space. In Euclidean, the Gaussian kernel is expressed as κ(x i , x j ) := exp(− x i − x j 2 /2δ 2 ), which relies heavily on the Euclidean distance of two points. To define a Gaussian kernel applicable to the Riemannian manifold, a naive means is to replace the Euclidean distance with the geodesic distance on the premise that the generated kernel is positive definite.
Therefore, we defined the kernel: κ : (Sym + d × Sym + d ) → R by κ(P i , P j ) : = exp(−δ 2 (P i , P j )/2σ 2 ) for all points, P 1 , . . . , P N ∈ Sym + d . κ is a positive definite kernel for all σ > 0 only if the Riemannian geodesic metric, δ 2 (P i , P j ), is negative definite [42]. Herein, we consider the log-Euclidean metric as the geodesic distance, and we need to prove ∑ m i,j a i a j δ 2 (P i , P j ) ≤ 0 for all m ∈ N with ∑ m i c i = 0. It is easy to prove that κ is a symmetric function: κ(P i , P j ) = κ(P j , P i ), for all matrixes in the SPD manifold.
We analyzed the positive definiteness of the log-Euclidean metric as follows: The Equation (7) provides the proof that the log-Euclidean kernel guarantees the positive definite of the Riemannian kernel, and satisfies the Mercer theorem.

Proposed Framework
We assume that the sources have Ns labeled instances (X s i , y i ) , where X s i ∈ R c×t denotes a single recorded EEG signal in the source domain, and y i ∈ {1, . . . , l} is the corresponding label. X s i may be collected from one subject or from multiple subjects. X t i Nt 1 is a collection of unlabeled records from the target. We assume that there is the same feature space and label space between domains, but, due to dataset shift, the marginal and conditional probability distribution are different. We use φ(x), x ∈ Sym + d to map the feature vector to the RKHS space.
In this section, we elaborate on the proposed KMDA framework. KMDA aims to classify the unlabeled target data by exploiting the labeled data from multiple source domains. For the sake of simplicity, only one source domain is considered.
In KMDA, we take the covariance matrix of each EEG record as the feature. Covariance matrices define a symmetric positive definite space (SPD) that can be described by the Riemannian metrics. Due to individual differences in response patterns, and the deviation of the electrode installation position, there is a domain shift between the source and target covariance matrices. Hence, we first performed an alignment in the Riemannian manifold (RA). Subsequently, we embedded the manifold space into a high-dimensional Euclidean space through the log-Euclidean Gaussian kernel, where a discriminative subspace was learned. Alternatively, the SPD matrices can be defined by a set of 2D frames converted from a set of EEG records. Figure 2 shows the overall workflow of KMDA.
the Riemannian metrics. Due to individual differences in response patterns, and the deviation of the electrode installation position, there is a domain shift between the source and target covariance matrices. Hence, we first performed an alignment in the Riemannian manifold (RA). Subsequently, we embedded the manifold space into a high-dimensional Euclidean space through the log-Euclidean Gaussian kernel, where a discriminative subspace was learned. Alternatively, the SPD matrices can be defined by a set of 2D frames converted from a set of EEG records. Figure 2 shows the overall workflow of KMDA.

Alignment of the SPD Matrices
The correlation alignment (CORAL) has proven that aligning the second-order statistics can effectively mitigate the distribution differences across domains [47]. Referring to CORAL, we proposed an alignment in the Riemannian manifold, referred to as

Alignment of the SPD Matrices
The correlation alignment (CORAL) has proven that aligning the second-order statistics can effectively mitigate the distribution differences across domains [47]. Referring to CORAL, we proposed an alignment in the Riemannian manifold, referred to as Riemannian alignment (RA), to align the symmetric positive definite matrices on the Riemannian manifold, which skillfully skips the tedious process of feature extraction from the EEG signal.
In RA, we whitened the source domain first to remove the correlations of the source domain, by: Then, we recolored the source with correlations of the target domain.
where P st i denotes the source matrix after recollection, and ξ t R is the Riemannian mean of the target obtained through Algorithm 1. Equation (9) was used to reconstruct the source matrices using the target Riemannian mean, and after that, the source and target distributions differed little, so they can be considered to have an identical marginal probability distribution.

Kernel on Riemannian Manifold
Due to the non-Euclidean geometry of Riemannian manifolds, Euclidean algorithms yield inferior results on SPD matrices. We defined a Gaussian radial basis function-based positive definite kernel on the Riemannian manifold to embed the SPD manifold in a high-dimensional Reproducing Kernel Hilbert Space. The kernel makes it possible to utilize algorithms developed for linear spaces on nonlinear manifold data.
We employed the log-Euclidean distance as the Riemannian metric. One reason for this is that the log-Euclidean distance defines the real geodesic distance between two symmetric positive definite matrices, and more importantly, the Gaussian kernel with the log-Euclidean metric yields a positive definite kernel, satisfying the conditions of Mercer's theorem, as proven by (7).
The SPD matrices can be transformed into the RKHS with:

Learning Mapping Matrix
Since the RA reconstructs the source using the eigenvectors and eigenvalues of the target, we assumed that the marginal distribution of the source and target remains identical in RKHS. The purpose of KMDA is to learn a transformation matrix, W, in RKHS, so as to minimize the conditional divergence of the source and target while maximizing the variance of the target domain and preserving the discriminative information of the source domain as much as possible.
(1) Target Variance For an effective subspace, it should maximize the preservation of the principal components of the target and avoid projecting the features into irrelevant dimensions. Since the target labels are unknown, variance is used to measure the distinguishable information of target features. Hence, the objective function is defined as: (2) Source Discriminative Information The discriminative information of the source domain should be preserved in the new subspace. To this end, we exploited the labels to define the discernibility of the source; that is, we maximized the distance between classes while minimizing the distance within classes: where S (c) w is the within-class scatter matrix of the source data, is the between-class scatter matrix, in which Nt (c) is the number of source data of c-class, m (c) s is the mean of samples from class c, and m s is the mean of all source data. (

3) Condition Distribution
In the new subspace, the discrepancy between samples of the same type in the source and the target domain should be small, i.e., the conditional distribution distance should be minimized. We used the MMD as the criterion to measure the distribution divergencies.
Then, we obtained the objective function: Brain Sci. 2022, 12, 659 K t K s K ts are the kernel matrices defined by (10) on the Riemannian manifold in the target domain, source domain, and cross-domain, respectively.

(4) Overall Objective Function
Combining all of the above optimization objectives, we formulated the overall objective function of the proposed KMDA method: We simplified (15) as: where α, β, and µ are the trade-off parameters to balance the importance of each term. By the Lagrange operator, we deformed the optimization function into: By setting ∂J ∂W = 0, we found: The optimal W * are given by the k leading eigenvectors of the eigen-decomposition of (18).
where H t is the center matrix, I Nt is the Nt × Nt identity matrix, and 1 Nt is the column vector with all ones. In S b , we get m (c) Given a new instance, P t ∈ Sym + d , from the target, its projection, z t , in the discriminant subspace was obtained by: z t = W * K tt , where K tt = [k(P 1 , P t ), . . . , k(P N , P t )] and N = Nt + Ns. The classification was performed on the classifier trained with the source data.
The pseudo-codes of the KMDA algorithm are described in Algorithm 2.

Algorithm 2. Kernel-Based Manifold Domain Adaptation.
Input: EEG and source labels: Output: Transformation matrix: W * ; Target labels Y t .

1.
Calculate covariance matrices P s Or Calculate the covariance matrices of 2D frames.

2.
Calculate the Riemannian means of source and target by Algorithm 1.
Initialize pseudo labels of target domain Y t using the minimum distance to Riemannian mean algorithm [26].
Solve the generalized eigen-decomposition problem in (18) and select the k leading eigenvectors as the transformation W * 8.
Obtain the embedding features: Train a classifier f on z s i , y i Ns i=1 to update pseudo labels in target domain 10.
Update L. 11. Until convergence 12. Obtain transformation matrix W * and target labels Y t The joint geometrical and statistical alignment (JGSA) [48] algorithm is a similar study to KMDA. JGSA mainly concentrates on finding two coupled projections that embed the source and target data into low-dimensional subspaces, where the domain shift is reduced while preserving the target domain properties and the discriminative information of source data, simultaneously. KMDA improves in two aspects. One is that, with the help of Riemannian alignment, KMDA transforms the source and target data into a common space, and hence it is reducible to solve an embedded subspace. Besides, the features in JGSA must be in the form of a flattened vector, while KMDA is characterized by the form of an SPD matrix.

(5) Converting Multichannel EEG Signals into 2D Frames
We assumed a set of EEG signals can be divided into M segments by a sliding window, denoted by x i ∈ R c×m (i = 1, 2, . . . , M).
In each segment, we calculated the power of each channel in the 8~30 Hz frequency band in sequence. welch (the built-in function package of MATLAB) was used first to calculate the power spectral density, followed by the pwelch function for the power spectrum and then the bandpower function to extract the power of the alpha and beta rhythm. As a result, we flattened a c × m EEG signal into a 1 × c vector, with each element corresponding to the power value For the purpose of maintaining spatial information among multiple adjacent channels, we further converted the vector x i to a 2D frame according to the electrode distribution map. Figure 2a illustrates the schematic diagram for 2D EEG frames in the 22-electrode scenario, where the electrodes circled in bold are the selected ones. The constructed frame, f i , is expressed as: and f i must be a square matrix. We filled the central raw of the matrix f i with electrodes located in the central functional area (marked with C1, C2, . . . , Cn), partitioning the matrix into upper and lower parts, and each part was associated with the physical installation positions. Then, we completed the matrix with task-related electrodes, and the unused electrodes and the positions without electrodes were represented by the average power of all electrodes, v. In this way, the chain-like EEG signals were converted to 2D frame sequences, [ f 1 , f 2 , . . . , f M ], and each frame, f i , embodied the task-related power features and spatial information. It is obvious that the size of a 2D frame is much smaller than that of an EEG signal, and is of great significance to improving the computation speed of Riemannian manifolds. Taking a 22-electrode setup as an example, the covariance matrix of an EEG trial is 22 × 22, whereas the size is 5 × 5 for a 2D frame.

Dataset Description
BCI Competition IV Dataset IIa (Dataset IIa) consists of 9 subjects, 'A01', 'A02', . . . , 'A09'. The 4-class cued motor imagery data were recorded by 22 EEG channels with a 250 Hz sampling rate. At t = 2 s, an arrow indicated that one of the four classes promoted the subject to perform the desired mental task until the cue disappeared at t = 6 s. Each subject recorded two sessions on different days, one for calibration, and the other for evaluation. Each session is comprised of 6 runs, and one run consists of 48 trials (12 trials per class), yielding 288 trails per session. In our experiment, we removed the trials containing an artifact label, marked with 1 in the h.ArtifactSelection list.
BCI Competition III Dataset IVa (Dataset IVa) contains 2-class EEG signals recorded at 118 channels with a 1000 Hz sampling rate (down-sampled to 100 Hz in this paper) from 5 subjects, named as 'AA', 'AY', 'AW', 'AL', and 'AV'. For each subject, a total of 280 cue-based trials are available. In each trial, a cue was indicated for 3.5 s, during which two MI tasks were performed: right hand and right foot. Then, the cue was intermitted by periods of random length, 1.75 to 2.25 s, in which the subject could relax.
BCI Competition III Dataset IIIa (Dataset IIIa) is a 4-class EEG dataset (left hand, right hand, foot, tongue) from 3 subjects ('K3', 'K6', 'L1'), recorded by 60 channels, sampled at 250 Hz. The dataset consists of several runs, with 40 trials for each run. After the beginning of each trial, the subject rested in the first 2 s, then performed the indicated mental task from t = 3 s to t = 7 s. Each of the 4 cues appeared 10 times per run in a random order.
In our experiments, after the removal of EEG baseline drift, all datasets were filtered by a 6-order 8~30 Hz bandpass filter. The calibration and evaluation trials of Dataset IIa were extracted from the 2.5 to 4.5 s time interval recommended by the competition winner, and Dataset IVa and Dataset IIIa were extracted using a 3 s window after the cue onset at 0.5 s.

Experiment Design
We verified the merits of the proposed KMDA using three datasets, and compared it with the state-of-the-art domain adaptation algorithms. Table 1 presents the descriptions of the concerned methods. Except for MEKT, none of the other control algorithms were originally designed for EEG analysis, and we adapted them slightly to fit the experimental situation.
Feature Extraction: MEKT maps the covariance matrices into the tangent space at the identity matrix, yielding a collection of corresponding tangent vectors, and the other algorithms concatenate the covariance matrices into flattened vectors. For the BCI Competition III Dataset IVa, for instance, the size of each trial was 118 × 300, and the size of its covariance matrix was 118 × 118, so the corresponding concatenated vector size was 1 × 13,924, and the vector in the tangent space was 1 × 7021. The high-dimensional vectors put forward a high demand for the size of the training set, otherwise, the model would be overfitted. Therefore, we proposed the method of converting multichannel EEG signals to 2D EEG frames to reduce the dimension. In the follow-up experiments, except for KMDA, all the control methods took the covariance matrices of 2D frames as input, so for the Dataset IVa, the dimension of the corresponding tangent vector was 1 × 66, and the dimension of the concatenated vector was 1 × 121. Both the covariance matrix of the 2D frames and the EEG signals are discussed in KMDA. For ease of differentiation, KMDA refers to the model with the covariance matrices of the 2D frames as input, while e-KMDA takes the covariance matrices of the EEG signals as input. The matrix dimensions before and after the conversion of the three datasets are shown in Table 2. Table 1. Compared algorithms and parameters in the experiment.

Method
Descriptions Para.

SA
A linear transformation on the principal components [49]. none CORAL Aligning the second-order statistics of the features [47]. none

GFK
The principal components of the source and the target are regarded as two points in the Grassmann manifold and a geodesic flow kernel (GFK) is obtained by integrating geodesics between the two points [50].
TCA Minimizing the marginal probability distribution difference in RKHS [13]. none JDA Minimizing the joint distribution difference of marginal and conditional probability in RKHS [14]. λ = 0.1 JGSA Seeking two coupled projections that embed the source and target data into low-dimensional subspaces, where the domain shift is reduced while preserving the target domain properties and the discriminative information of source data simultaneously [48].

MEKT
Whitening the covariance matrices of source and target in Riemannian manifold, and learning two subspaces to reduce the domain divergences [33].
Our algorithm. Hyper-parameter: Since the target data is assumed to be unlabeled, cross-validation is not applicable to the parameter determination. We set the parameters of (17) as α = µ = 1 and β = 0.1, the iterations T = 15, and the dimension of subspace k = d, where d is the dimension of the SPD matrix. The hyper-parameters for the other algorithms were set according to the recommendations in the corresponding literature.
Classifier: The k-Nearest Neighbor Classifier (KNN) was used for all methods. To facilitate the calculation, we fixed k = 3 for all the experiments.
Data setting: All our experiments were carried out on the calibration data from three datasets. Since the feature distributions of 'A08', 'AL', and 'K3' were distinguishable, they were treated as the source data for the corresponding dataset, and the rest of the subjects were taken as the target. Therefore, we had 8 + 4 + 2 = 14 transfer scenarios.
Measurement: For the classification evaluation of the four tasks (Dataset IIa and Dataset IIIa), we opted for the Kappa coefficient recognized by BCI competition, while for binary classification (Dataset IVa), we used accuracy as the evaluation index.

Results
Validation of Riemannian Alignment: Among the compared methods (as described in Table 1), except for JDA and JGSA, all the methods contained unsupervised distribution alignment of the source and target domains in Euclidean space or manifold space. Figure 3 visualizes the distributions after unsupervised alignment by the investigated methods in transferring subject 'AL' to subject 'AA' from DIVa by t-SNE [51]. The results indicated that the proposed RA not only aligns the marginal distributions of the source and target domain well, but also minimizes the distance between features of the two domains while preserving the characteristic of target distribution. Specifically, (a) to rectify the mismatch in distribution, TCA capitalizes on subspace learning [13], while GFK resorts to a shared space [50] to match the data distributions of different domains; however, both methods ignore the distribution characteristics of the target. (b) Both CORAL and SA align source data in the direction of the target domain. SA reconstructs the source data with the principal components of the target [49], and CORAL restructures the source data with all the eigenvectors of the target covariance matrix [47]. However, they fail to take into account the particularity of the covariance matrix as a feature, and the geometry of the SPD manifold. (c) The alignment approaches of MEKT and KMDA perform a parallel transport on the cone manifold of the SPD matrices to align the source with the target domain. However, MEKT whitens the covariance matrices of the source and target, resulting in an identical and uniform distribution [33], which completely destroys the characteristics of the target. By contrast, KMDA aligns the source covariance matrices with the target, yielding a set of covariance matrices formally similar to those of the target and consistent with the principal axis of the target, thus minimizing the domain shift while preserving the distribution characteristics of the target.
in Table 1), except for JDA and JGSA, all the methods contained unsupervised distribution alignment of the source and target domains in Euclidean space or manifold space. Figure  3 visualizes the distributions after unsupervised alignment by the investigated methods in transferring subject 'AL' to subject 'AA' from DIVa by t-SNE [51]. The results indicated that the proposed RA not only aligns the marginal distributions of the source and target domain well, but also minimizes the distance between features of the two domains while preserving the characteristic of target distribution. Specifically, (a) to rectify the mismatch in distribution, TCA capitalizes on subspace learning [13], while GFK resorts to a shared space [50] to match the data distributions of different domains; however, both methods ignore the distribution characteristics of the target. (b) Both CORAL and SA align source data in the direction of the target domain. SA reconstructs the source data with the principal components of the target [49], and CORAL restructures the source data with all the eigenvectors of the target covariance matrix [47]. However, they fail to take into account the particularity of the covariance matrix as a feature, and the geometry of the SPD manifold. (c) The alignment approaches of MEKT and KMDA perform a parallel transport on the cone manifold of the SPD matrices to align the source with the target domain. However, MEKT whitens the covariance matrices of the source and target, resulting in an identical and uniform distribution [33], which completely destroys the characteristics of the target. By contrast, KMDA aligns the source covariance matrices with the target, yielding a set of covariance matrices formally similar to those of the target and consistent with the principal axis of the target, thus minimizing the domain shift while preserving the distribution characteristics of the target. Validation of Subspace Learning: JDA, JGSA, MEKT, and KMDA aim to learn a discriminative subspace by leveraging labeled source data. Figure 4 depicts results of transferring subject 'AL' to subject 'AA' using the four domain adaption approaches. As shown in Figure 4, the raw source domain and target domain distribute differently, and their marginal distribution and conditional distribution are widely divergent. JDA and JGSA minimize the discrepancy of marginal and conditional distributions between the source and target, rather than the distance between features. MEKT and KMDA not only minimize the distribution divergence, but also make the features from the same class maxi-  Validation of Subspace Learning: JDA, JGSA, MEKT, and KMDA aim to learn a discriminative subspace by leveraging labeled source data. Figure 4 depicts results of transferring subject 'AL' to subject 'AA' using the four domain adaption approaches. As shown in Figure 4, the raw source domain and target domain distribute differently, and their marginal distribution and conditional distribution are widely divergent. JDA and JGSA minimize the discrepancy of marginal and conditional distributions between the source and target, rather than the distance between features. MEKT and KMDA not only minimize the distribution divergence, but also make the features from the same class maximally close in the two domains. However, compared with MEKT, KMDA preserves more target distribution characteristics. Validation of Subspace Learning: JDA, JGSA, MEKT, and KMDA aim to learn a discriminative subspace by leveraging labeled source data. Figure 4 depicts results of transferring subject 'AL' to subject 'AA' using the four domain adaption approaches. As shown in Figure 4, the raw source domain and target domain distribute differently, and their marginal distribution and conditional distribution are widely divergent. JDA and JGSA minimize the discrepancy of marginal and conditional distributions between the source and target, rather than the distance between features. MEKT and KMDA not only minimize the distribution divergence, but also make the features from the same class maximally close in the two domains. However, compared with MEKT, KMDA preserves more target distribution characteristics.  Classification accuracy: We evaluated KMDA and the other methods (list in Table 2) on different cross-domain scenarios. A baseline refers to the results of classifying the target data directly by a classifier trained on the source. Table 3 depicts the Kappa values of four metal tasks, and Table 4 shows the accuracies of Dataset IVa. We observed that the domain adaptation methods improved transfer performance to varying degrees. In general, KMDA achieved a better performance compared with other methods, the average Kappa values of KMDA were 0.56 and 0.75, and the average accuracy was 81.56%, 0.08%, 0.05%, and 5.28% higher than e-KMDA, respectively, which indicates that the 2D frame framework helps to improve performance. The results of a Wilcoxon signed rank test further confirmed the significant superiority of KMDA over other methods.   Classification accuracy: We evaluated KMDA and the other methods (list in Table 2) on different cross-domain scenarios. A baseline refers to the results of classifying the target data directly by a classifier trained on the source. Table 3 depicts the Kappa values of four  metal tasks, and Table 4 shows the accuracies of Dataset IVa. We observed that the domain adaptation methods improved transfer performance to varying degrees. In general, KMDA achieved a better performance compared with other methods, the average Kappa values of KMDA were 0.56 and 0.75, and the average accuracy was 81.56%, 0.08, 0.05, and 5.28% higher than e-KMDA, respectively, which indicates that the 2D frame framework helps to improve performance. The results of a Wilcoxon signed rank test further confirmed the significant superiority of KMDA over other methods.
Parameter Sensitivity: We analyzed the parameter sensitivity of KMDA in the scenario of 'A08->A03'. The objective function of KMDA (17) contains three parameters, where α, µ, and β are trade-off parameters to balance the principal components of the target domain, the discrepancy of conditional probability distributions between the source and target, and the within-and between-class variance of the source, respectively. Since α only involves the target domain, and µ and β involve the source domain, the evaluation of α, µ, and β can be boiled down to the evaluation of µ and β under the condition of α = 1. Figure 5a demonstrates that the optimal values µ and β are not unique, and a large range of µ (µ ∈ [ 0.4 1 ]) and β (β ∈ [ 0.001 0.4 ]) can be selected to obtain satisfactory performances. This is partly explained by the fact that, when the β exceeds 0.4, the model will overfit due to excessive attention to the discriminative information of the source. Computation Complexity: We validated the convergence of KMDA, and checked the computation cost of the JDA, JGSA, MEKT, and KMDA/e-KMDA methods. Figure 5b,c demonstrate the results in 'A08->A03', 'AL->AA', and 'K3->L1' scenarios. As can be seen from Figure 5b,c in KMDA, the classification accuracy improved with the number of iterations, and the distribution distance gradually decreased and converged within 5 iterations. Figure 6  ) can be selected to obtain satisfactory performances. This is partly explained by the fact that, when the  exceeds 0.4, the model will overfit due to excessive attention to the discriminative information of the source.

Covariance Matrix as a Feature
The results in Tables 3 and 4 not only prove the effectiveness of the KMDA (e-KMDA) method, but also imply that it is feasible to use the covariance matrix as a feature. Moreover, the results of KMDA (the covariance matrix of the 2D frame as a feature) are superior to those of e-KMDA (the covariance matrix of the EEG signal as a feature), which demonstrates that the framework for converting multichannel EEG signals to 2D frames improves the accuracy, accounting for the fact that it considers the electrode position and power spectrum characteristic of the signal. In order to further investigate the advantages of the 2D frame feature, we conducted a comparative experiment with the CSP-based variants. We compared KMDA (e-KMDA) with CCSP [25], SGRM [11], and CSP. Note that CSP is a supervised feature extraction method for binary data; for simplicity, only leftand right-hand mental data of Dataset IIa and Dataset IIIa were considered for this experiment. The source data were set according to the description of Section 4, with 20 trials per class with labels randomly selected from the target. Note that the KMDA (e-KMDA) prototype was designed for unlabeled target data. When considering the labels of the target data, we simply executed steps 7-9 of Algorithm 2 once. We opted to use RBF-based SVM (LibSVM Toolbox) as the classifier, whose parameters (-g gamma, -c cost) were determined by the built-in k-fold cross-validation function.
The results are depicted in Table 5. The classification accuracy was improved to var-

Covariance Matrix as a Feature
The results in Tables 3 and 4 not only prove the effectiveness of the KMDA (e-KMDA) method, but also imply that it is feasible to use the covariance matrix as a feature. Moreover, the results of KMDA (the covariance matrix of the 2D frame as a feature) are superior to those of e-KMDA (the covariance matrix of the EEG signal as a feature), which demonstrates that the framework for converting multichannel EEG signals to 2D frames improves the accuracy, accounting for the fact that it considers the electrode position and power spectrum characteristic of the signal. In order to further investigate the advantages of the 2D frame feature, we conducted a comparative experiment with the CSP-based variants. We compared KMDA (e-KMDA) with CCSP [25], SGRM [11], and CSP. Note that CSP is a supervised feature extraction method for binary data; for simplicity, only left-and right-hand mental data of Dataset IIa and Dataset IIIa were considered for this experiment. The source data were set according to the description of Section 4, with 20 trials per class with labels randomly selected from the target. Note that the KMDA (e-KMDA) prototype was designed for unlabeled target data. When considering the labels of the target data, we simply executed steps 7-9 of Algorithm 2 once. We opted to use RBF-based SVM (LibSVM Toolbox) as the classifier, whose parameters (-g gamma, -c cost) were determined by the built-in k-fold cross-validation function.
The results are depicted in Table 5. The classification accuracy was improved to varying degrees compared with the baseline CSP algorithm with the help of source data. The results of methods characterized by the SPD matrix were superior to the compared CSP variants with variance-based features. Specifically, the proposed KMDA method achieved the highest average accuracy of 78.54%, and e-KMDA came second with 75.79%. We conducted a Wilcoxon signed rank test on the accuracies to investigate the significance of the difference between KMDA and the other methods (p < 0.05). The results confirmed the significant superiority of KMDA (e-KMDA) over CSP in small target training sets.

Classification on Different Training Sets
Generally, the proposed KMDA is a transudative setting method, and so are the methods listed in Table 1. However, KMDA is also applicable to the classification of unseen test data. Herein, we explored the effectiveness of KMDA in an inductive setting, with varying numbers of labeled training samples from the target. We implemented this experiment on Dataset IVa and Dataset IIa, and the source data were set as stated in Section 4. For target setting, we considered only the left-vs. right-hand task of Dataset IIa, with calibration for training and evaluation data for testing. The Dataset IVa was divided in half, one for training and the other for testing. A given number of labeled samples were randomly selected from the training set, and the average accuracy of 5 repetitions was taken as the final output of the current subject.
SVM in tangent space (TSVM) [27,33,35] is a pervasive classifier for SPD matrices' classification. To investigate the effectiveness of KMDA in the small-sample training set, we compared the results of TSVM with those of KMDA using SVM (LibSVM toolbox) as a classifier. Tables 6 and 7 depict the accuracies of KMDA with different numbers of labeled training samples. It was observed that most results of KMDA outperformed those of TSVM, and the KMDA scores tended to be higher for the target subjects who performed well on TSVM, which demonstrated that KMDA could boost the classifier performance with knowledge from the source. However, we also found that KMDA domain adaptation was not always effective for all subjects in any scenario, and we expound the reason in the Limitations Section below.

Limitations
We further observed the results of Tables 6 and 7 and found that, for subjects A03, A08, A09, and AL, when the training samples of the target were greater than 20 trials per class, the results of TSVM were higher than those of KMDA, i.e., the source data impaired the classification performance of the target, resulting in Negative Transfer (NT). The reason may be that for subjects who are good at motor imagery, their feature distributions are discriminative, while the source data becomes feature noise instead, hindering the generalization of the model. Due to the pervasive individual differences, the generalization from the source to the target is often limited by NT, unless the distributions of the source and target are close, and the tasks are similar [52]. To cope with NT, existing research can be summarized into four main categories: source data quality, target data quality, domain divergence, and integrated algorithms. For a more detailed survey on NT, please refer to [52]. Since the essence of the proposed KMDA is to learn cross-domain feature representation, we improved the performance of KMDA from two aspects: improving the quality of source data and reducing domain distribution discrepancy.
Source instance selection/weighting attempts to make the features of the source closer to those of the target by selecting similar instances or adjusting the weights. TrAdaBoost [53] is a typical instance-based boosting approach that increases the weight of the source instance if the corresponding instance is correctly classified, and vice versa. In addition, the similarity measures commonly used in the literature include Kellback-Leibler divergence [20], Cosine similarity [52], MMD distance [13,14,17], and domain transferability estimation [33]. In KMDA, we leveraged the pseudo-labels to calculate the conditional distribution distance of the embedded features. However, pseudo-labels and embedded features are two common unstable factors during transfer [52]. Therefore, it is of great significance to design a transfer model robust to feature noise (caused by embedded features) and class noise (induced by pseudo-labels) in the subspace learning process. For KMDA, we can introduce a sparse regularized term of the projection matrix into the objective function (15) to model the feature noise.

Conclusions
This paper proposed a kernel-based Riemannian manifold domain adaptation approach for motor imagery-based BCI classification. Compared to existing cross-subject EEG trial transfer works, KMDA (1) describes the EEG trials with their covariance matrices, (2) aligns the SPD matrices of sources and the target in the Riemannian manifold, and (3) exploits the Gaussian kernel based on the log-Euclidean metric to map the SPD matrices to a high-dimensional Reproducing Kernel Hilbert Space, then (4) performs domain adaptation by minimizing the probability distribution distance between the source and the target, while preserving the target's distinct information and the discriminative information of sources. An optional descriptor of the EEG trial signal is presented to convert the chain-like EEG trial to a 2D framework, while preserving the spatial distribution. Extensive experiments on three motor imagery BCI datasets validated the effectiveness of KMDA in cross-subject adaptation. In brief, this paper presented a domain adaptation method that aims at transferring knowledge obtained from auxiliary EEG databases to the target subject, overcomes the subject dependence of the BCI system, and shortens the training time of the model. However, we note that not all source data produced positive effects. When the quality of target data was more discriminative than that of source data, the effectiveness of KMDA could not be guaranteed, resulting in negative migration. Under the framework of KMDA, we can further improve the effect of domain adaptation from two aspects: (1) by selecting similar instances or adjusting weights, the source features can be closer to the target features, so as to improve the quality of source data, and (2) by improving the generalization of classifiers trained on the source code, the discrepancies of domain distributions can be reduced.