Regularized RKHS-Based Subspace Learning for Motor Imagery Classification

Brain–computer interface (BCI) technology allows people with disabilities to communicate with the physical environment. One of the most promising signals is the non-invasive electroencephalogram (EEG) signal. However, due to the non-stationary nature of EEGs, a subject’s signal may change over time, which poses a challenge for models that work across time. Recently, domain adaptive learning (DAL) has shown its superior performance in various classification tasks. In this paper, we propose a regularized reproducing kernel Hilbert space (RKHS) subspace learning algorithm with K-nearest neighbors (KNNs) as a classifier for the task of motion imagery signal classification. First, we reformulate the framework of RKHS subspace learning with a rigorous mathematical inference. Secondly, since the commonly used maximum mean difference (MMD) criterion measures the distribution variance based on the mean value only and ignores the local information of the distribution, a regularization term of source domain linear discriminant analysis (SLDA) is proposed for the first time, which reduces the variance of similar data and increases the variance of dissimilar data to optimize the distribution of source domain data. Finally, the RKHS subspace framework was constructed sparsely considering the sensitivity of the BCI data. We test the proposed algorithm in this paper, first on four standard datasets, and the experimental results show that the other baseline algorithms improve the average accuracy by 2–9% after adding SLDA. In the motion imagery classification experiments, the average accuracy of our algorithm is 3% higher than the other algorithms, demonstrating the adaptability and effectiveness of the proposed algorithm.


Introduction
Non-invasive BCIs enable people to communicate with electronic devices by analyzing the electrical or magnetic signals generated by the brain's nervous system. Due to the advantages of non-invasiveness, low cost, portability and high temporal resolution for different brain activity monitoring modalities, electroencephalography (EEG) has been used in many non-invasive BCI studies [1]. Depending on the strategy used to control the device, BCIs systems can be classified as endogenous or exogenous [2]. Exogenous task BCIs systems are based on evoked activities that require external stimuli, such as visual evoked potentials. In contrast, endogenous BCIs are based on spontaneous activities, such as motor imagery (MI), in which the subject needs to focus on a specific mental task [3].
Motor imagery signals are body parts that imagine movement in the absence of actual movement. Different MI tasks lead to oscillatory activity observed in different areas of the sensorimotor cortex of the brain [4]. Various MI-based BCI applications have been used as rehabilitation for wheelchair and prosthetic control in disabled patients [5][6][7][8], and for recreation in healthy individuals [9,10].
EEG signals are prone to be affected by individual mental states, such as mood and attention. In the BCI's MI experiment, subjects were asked to repeat the motor imagery tasks maximum regularization term for the variance of the target domain data in the subspace. Our experiments show that their algorithm improves the classification accuracy to some extent, but ignores the optimization problem of the source domain data and its labels. In domain adaptation, source domain data with label is an important information source. How to use label information is always the focus of various domain adaptation algorithms. Lei et al. [27] applied the dictionary learning to the source domain while we borrowed the idea of LDA.
In this paper, we develop a new approach based on RKHS subspace learning and apply it to motor imagery recognition. It attempts to learn the coefficients of the RKHS subspace so that the differences in data distribution across domains can be reduced when projecting to that subspace. Machine learning approaches, such as classification and regression models, can be used in this subspace. Additionally, to make full use of the source domain information, we propose a source linear discriminant analysis (SLDA) regularization term. Specifically, considering the sensitivity of BCI data, we sparsely construct the RKHS subspace framework using the L2.1 criterion. The primary contributions of this paper are summarized as follows: (1) We reformulate the RKHS subspace learning framework (RKHS-DA), and propose the SLDA regularization term to remedy the deficiency of MMD in domain adaptation. (2) To address the problem of complex and unstable EEG signal, we choose features wisely in the low-dimensional subspace projected to the data through the L2.1 criterion to constrain the coefficient matrix. (3) Experimental results show that the average accuracy of our algorithm is 3% higher than other algorithms.
The remainder of this paper is organized as follows. Section 2 presents a general description of our approach. Section 3 describes the proposed framework and the SLDA regularization terms in detail. We validate our SLDA regularization and RKHS subspace learning framework, and the experimental results are presented in Section 4.

Notations
In this paper, we use a combination of letters and numbers to represent data. A sample is denoted as a vector, e.g., the ith sample of x in a set is denoted as x i . We also use the subscripts s and t to indicate the source domain and the target domain, respectively. For a matrix M, the trace of matrix M is denoted by tr(M). For clarity, the frequently used notations and corresponding descriptions are shown in Table 1. Definition (inner product space [28]): let H be the linear space on the real number domain R, •, • : H → R , with the following properties: (1) Positive definiteness: for all x ∈ H, x, x and ⇔x = 0; (2) Symmetry: For all x, y ∈ H, x, y = y, x ; (3) Bilinear: For all x, y, z ∈ H and α, β ∈ R, Entropy 2022, 24,195 4 of 23 αx + βy, z = α x, z + β y, z Then, we consider that •, • is the inner product of H, and (H, •, • ) is an inner product space.
Let x be an element of inner product space (H, •, • ). In the inner product space, the norm is defined by the inner product: According to the nature of the positive definite inner product: If all the basic sequences are convergent in this inner product space which is known as a Hilbert space.

Definition of Reproducing Kernel Hilbert Space (RKHS)
Let H = { f | f : Ω → R, Ω | f (x)| 2 < +∞ } be a square integrable function space. It is clear that H is a linear space. We define •, • : H × H → R , for any f , g ∈ H f , g = Ω f (x)g(x)dx (4) It can be shown that •, • is an inner product and (H, •, • ) is a Hilbert space. Further, if there is k : Ω × Ω → R , satisfy (1) For any x ∈ Ω, k x = k(•, x) ∈ H; (2) For any x ∈ Ω and f ∈ H, we have Therefore, H can be called a reproducing kernel Hilbert space (RKHS), and k is the reproducing kernel of H. Using reproducing kernel k, we can define mapping ϕ : Ω → H : for any x ∈ Ω, we have ϕ(x) = k(•, x) = k x ∈ H From Equation (6), it can be proved that

Hilbert Subspace Projection Theorem
Definition of projection: let (H, •, • ) be an inner product space and A be a subspace of H. For x 0 ∈ H. If x 0 can be decomposed into Projection theorem: (H, •, • ) is an inner product space. A is a finite dimensional subspace of H and {e 1 , · · · , e d } is the standard orthogonal basis of A. For any x 0 ∈ H, the projection x 0 of x 0 in A is as follows: Remark 1. A is a finite-dimensional subspace of H; therefore, A is complete, i.e., A is a Hilbert subspace, so the projection of any point in H onto A exists.

Domain Adaptation Learning and MMD
There are two datasets in data space Ω: the labeled source domain data X s = {x s 1 , · · · , x s N s } ⊆ Ω, and the unlabeled target domain data X t = {x t 1 , · · · , x t N t } ⊆ Ω, and the distributions of X s and X t in the data space are different. We need to classify X t based on X s . This problem is domain adaptation learning. In our work, we resorted to MMD [22], a nonparametric metric to measure the distance between distributions, which can transform the source domain data X s and target domain data Xt onto RKHS H generated by the reproducing kernel k, i.e., In this way, the distribution of Φ(X s ) and Φ(X t ) in RKHS H can be as similar as possible. Moreover, the similarity here can exactly be measured by MMD: where ϕ(·) is the mapping defined by reproducing kernel k.
In practice, it is not easy to learn an optimal RKHS H based on MMD. Most methods based on MMD choose to learn a linear subspace spanΘ of RKHS H, so the MMD distance can be expressed as follows: where Φ spanΘ (X s ) and Φ spanΘ (X t ) mean the projection of Φ(X s ) and Φ(X t ) in the subspace, respectively. The regenerated kernel of RKHS is used to construct the transformation from the original data space to RKHS, rather than defining the transformation first and then using the transformation and the inner product of RKHS to define the so-called "kernel function", which is not actually the reproducing kernel of RKHS. However, many studies have used the reproducing kernel to define the transformations from original data space to RKHS, ignoring the connection between the original data space and RKHS. Therefore, we reformulated a mathematical framework model of RKHS in this section.

Domain Adaptation
Let (H, •, • ) be the RKHS on the data space Ω, and use the reproducing kernel k of H to define the transformation from the data space Ω to H: ϕ : Ω → H , for any x ∈ Ω, we define ϕ(x) = k(•, x) ∈ H, so for any x, y ∈ Ω, we have ϕ(x), ϕ(y) = k(x, y()). Now, given a set of data on data space Ω, Feature map ϕ(·) is used to transform X to H Entropy 2022, 24,195 6 of 23 The kernel matrix K is represented as where k x i , x j = ϕ(x i ), ϕ x j , and K iCol is the ith column vector of K, i = 1, · · · , N.
3.1.2. The Construction and Restraint of the RKHS Subspace ϕ(X) is used to construct a basis of subspace of H: We define where W iCol is the ith column vector of W, and i = 1, · · · , d. We use Θ = {θ 1 , · · · , θ d } to span a subspace of H: To constitute the orthonormal basis of subspace spanΘ, Θ satisfies the condition of bivariate orthogonality: Subspace spanΘ is a d-dimensional subspace and is determined by the data transformed to subspace and combination coefficient W, which satisfies the above constraints.

Representation of Data in the RKHS Subspace
Since RKHS is an infinite dimensional space, machine learning algorithms cannot be directly applied to such space, so it needs to project the data in RKHS into the subspace of RKHS. According to the projection theorem, if {θ 1 , · · · , θ d } is the orthonormal basis of subspace spanΘ, then the coordinates of the projection of ϕ(x i ) on subspace spanΘ are where i = 1, · · · , d. By constructing the subspace of RKHS, we implemented the transformation of data from the original data space Ω to the Euclidean space R d : The working space is Euclidean space R d , of which W will be determined according to the specific machine learning task. The orthonormal basis of the subspace is constructed by the linear combination of transformed samples. Due to the requirements of the orthonormal The source domain data X s is labeled while the target domain data X t is unlabeled. We define the following: Using the RKHS subspace learning framework proposed in Section 3.1, we have where y s i = W T K iCol , i = 1, · · · , N s and y t i = W T K (N s +i)Col , i = 1, · · · , N t . In the expressions of Y s and Y t , the matrix W is unknown and represents the subspace of RKHS, desired distribution of Y s and Y t in the Euclidean space R d can be achieved by learning W. To measure the difference between two distributions, MMD between X s and X t can be calculated as follows: where y s i = W T K iCol and y t j = W T K (N s +j)Col , the MMD distance in (25) can be rewritten as and MMD is an approximate criterion rather than an exact one. Therefore, it is common practice to add regularization terms to compensate for the deficiency of MMD. In transfer learning, KNN is a commonly used classifier. To improve the classification efficiency of KNN, we considered the reduction of the within-class scatter between the source domain and the target domain, while increasing the between-class scatter. Since the target domain is usually unlabeled, the SLDA proposed in this section only applies to the source domain data.
During the distribution matching process, it would be helpful to keep samples of the same class close to each other while the samples of different classes are far from each other. For this purpose, we define the transformed source domain data as C categories, and each category has N c data samples, which can be expressed as: {y 1 , · · · , y N } = y s 11 , · · · , y s 1N 1 , · · · , y s C1 , · · · , y s cN c , where y s ci , c = 1, · · · , C, i = 1, · · · , N c . The center of the cth class can be computed as follows: Moreover, the center of all the samples can also be computed as follows: (1) To increase the distance between the different types of source domain data, the between-class scatter can be defined and rewritten as: where (2) To improve the discriminative efficiency of the same category data in the subspace, the intra-class divergence can be expressed as follows: The distance between the same types of data in the source domain is reduced, so that the same type of data will be more concentrated.

Solution
Since the target domain data is used to rely the source domain data in the subspace, the optimization of the target domain data in the subspace will not be effective if KNN is used to identify the target domain data in the subspace. By adding the regularization term of SLDA, the overall objective function of our proposed SLDARKHS-DA can be formulated as follows: where N = L + λ(Φ − Ψ) + µI. This model can be solved by the properties of generalized Rayleigh entropy. Since K is symmetric positive definite, it can be expressed as where UU T = I, Σ where M = Σ − 1 2 U T (L + γ(Φ − Ψ) + µI)UΣ − 1 2 . It can be solved by the generalized Rayleigh entropy. V is a d-dimensional row vector, which is the eigenvector corresponding to the first d smallest eigenvalues of matrix M.

Computational Complexity
The computational complexity of the SLDARKHS-DA Algorithm 1 consists of the following three main components: (1) the complexity of the feature problem optimization in step 2, (2) computing Φ and Ψ and (3) the computing of K and L. The complexity is usually expressed in terms of O, and the complexity of the generalized eigen-decomposition is O(dn 2 ) (d is the dimension of the subspace). By computing Φ, Ψ is O(dn 2 ), and by computing K, L is O(n 2 ). Therefore, the total complexity of the algorithm is O((4d + 1)n 2 ).

Algorithm 1: SLDARKHS-DA
Input: source domain data set X s and target domain data set X t , label information of X s ; parameters λ, µ and subspace dimension d. Output: projection matrix W and the label information of X t .

1.
Combine source domain data set and domain data set: Eigendecompose the matrix M and select the d leading eigenvectors to construct the projection matrix W; 4.
Project both X s and X t to obtain data in the subspace, y s i = W T K iCol and y t i = W T K (N s +i)Col . Classify y t i in the subspace by KNN, and y s i is used as the reference.

Description of BCI IV 2a Data
Nowadays, many BCI data recognition tasks are handled by domain adaptation methods. Previous studies [20,21] have shown the effectiveness of domain adaptation approaches in reducing the differences in data distribution between subjects or sessions.
The BCI competition dataset is commonly used as a benchmark dataset for BCI domains. This 2a dataset consisted of nine subjects recorded [29]. Subjects were asked to imagine moving four parts of their body: left hand, right hand, foot and tongue. Addressing multi-class problems is an important challenge for the BCI system.

Domain Adaptation Subspace Learning Based on Sparse Regularized RKHS
Considering the complexity of BCI data, when the transformed data are projected into the subspace, the dimensionality reduction will be performed and some irrelevant data features should be discarded. We selected the most favorable data to improve the recognition effect, construct the subspace by row sparse projection matrix and minimize the geometric offset of the data.
We used the L 2.1 norm to constrain the W matrix so that the rows were sparse. The L 2.1 norm of matrix A is defined as follows: The L 2.1 norm makes the L 2 norm of each line as small as possible, and as many zeros appear in the line as possible to achieve sparsity.

Solution
By adding the L 2.1 norm of matrix W as a sparse regularization term, the overall objective function of our proposed SLDARKHS-DA based on sparse regularization terms is formulated as follows, show in Algorithm 2.
where tr W T LW represents the MMD distance between the source domain sample and the target domain sample in the subspace. The purpose of tr W T (Φ − Ψ)W is to increase the inter-class divergence and reduce the intra-class divergence of the data in the subspace. ||W || 2,1 is the L 2.1 norm of matrix W to the sparse elements that make up the basis of the subspace. The regularization term tr W T W can avoid the over-fitting of the model. The constraint W T KW = I d serves two purposes: (1) to make the basis of the subspace orthogonal, and (2) to avoid trivial solutions and ensure that W is not 0. λ, µ, γ represents the coefficient of the regularization term.
To solve the optimization problem, we introduced a Lagrange multiplier Λ, and the Lagrange function for the model can be obtained as follows: Then, by taking the derivative of (37) with respect to W, and setting the derivative to zero, we obtain Note that ||W || 2,1 is not smooth, so we computed its subgradient G, which is a diagonal matrix with the ith diagonal element that equals to where W i denotes the ith row of W. Thus the concatenated multiple transformations can be solved by calculating the d smallest eigenvectors of KWΛ.

Algorithm 2: SLDARKHS-DA (Sparse)
Input: source domain data set X s and target domain data set X t , label information of X s ; parameters γ, λ, µ and subspace dimension d. Output: projection matrix W and the label information of X t .

1.
Combine the source domain data set and domain data set: Computer matrix K, L, Φ, Ψ and initialize G = I.
Until convergence or max iteration

5.
Project both X s and X t to obtain the data in the subspace, y s i = W T K iCol and y t i = W T K (N s +i)Col . Classify y t i in the subspace by KNN, and y s i is used as the reference.

Experiments
To verify the fitness of the SLDA regularization term, we first conducted experiments on four commonly used standard benchmark datasets (faces, objects, handwritten digits and text). We added the SLDA regularization term to the comparison algorithm. For example, we added the SLDA regularization terms to TCA and performed comparison experiments with the original TCA. Second, we tested the performance of our SLDARKHS-DA on the 4th BCI competition 2A dataset and compared it with some classical baselines published in recent years, respectively. All the methods were programmed in MATLAB 2019 and executed on a PC (CPU: Intel i9) with 3.50 GHz and memory: 16 GB. The source programs of the baseline methods used for the comparison can be downloaded from GitHub. The following are available online at https://github.com/viggin/domainadaptation-toolbox (TCA, downloaded in April 2021), https://github.com/minjiang/iglda (IGLDA, downloaded in April 2021) and https://github.com/lijin118/tit (TIT, downloaded in May 2021).

Baseline and Parameter Settings
We compared the proposed method with typical subspace learning methods in domain adaptation, such as TCA [24], IGLDA [25] and TIT [26]. The details of each baseline methods are summarized below: (1) TCA [24] is a typical example of RKHS subspace learning in domain adaptation. TCA converts the original data into RKHS, then finds a subspace in this space to reduce the dimensionality of the data, and then uses MMD to measure the distance between the two domains. Its objective function is as follows: where tr W T W prevents the over-fitting of data and W T KHKW = I m can maintain the data characteristics of the source domain data and target domain data.
(2) Jiang et al. proposed the integration of global and local metrics for domain adaptation learning (IGLDA) [25], which added the regularization term based on TCA. IGLDA introduces data label information to keep source domain and target domain data as close as possible while preserving the geometric properties of source domain data. The objective function of this method is as follows: where L w represents the within-class divergence matrix.
(3) Li proposed another approach in 2019 [26], which combines the manifold regularization terms used in SSTCA, a kind of regularization terms to select features, and regularization terms to minimize the variance in the target domain. In the experiments, the sample selection is performed by iterative experiments, and the final objective function is as follows.
where tr W T KCKW can be used to minimize the variance of the target domain data in the subspace, and ||W || 2,1 selects the feature from the data.
In our experiment, we use the K-nearest neighbor (KNN) as a classifier to evaluate the performance of the proposed method. We obtained the parameter setting with the best classification accuracy through grid search and applied the same parameter selection process to the baseline methods. Each of the hyper-parameters used in our experiments was chosen. We chose the best parameter by searching in the range of [10 −15 , 10 2 ]. For simplicity and clarity, we chose an acceptable common set of them, as shown in Table 2.
In the experiments of the first four data sets, we used the linear kernel as kernel function, while in the experiments of the BCI data set, we used the radial basis function (RBF).

Face Recognition
In this section, we evaluate the effectiveness of the proposed algorithm in face recognition tasks. The AR dataset [30] is widely used in experiments in the field of face recognition. In this section, we select a subset of AR data set with a total of 2600 face images. This dataset consists of 100 people, with 50 men and 50 women. Each subject has 26 images. The AR face images were captured twice, with an interval of two weeks between the two shots. Each shot collected 13 pictures of different modes with different light brightness, light angle, facial expression and occlusion (sunglasses or scarf). In this experiment, each face image was normalized as a gray level image of pixels. The training set and test set directly used the gray value and vectorization of the image as the input. According to the different shooting times and states, 26 face images of each subject corresponded to 26 patterns, which were numbered as 1a to 1m and 2a to 2m. Figure 1 shows a sample of the AR data set with 26 face images from the same subject. Figure 1a-m belongs to one group, while 2a-2m is from another group, which is under the same conditions taken two weeks later. We used the notation C 1.a and C 2.a to represent a collection of natural expressions of the face images. In this section, C 1.a and C 2.a are combined as source domain data set X S .

Face Recognition
In this section, we evaluate the effectiveness of the proposed algorithm recognition tasks. The AR dataset [30] is widely used in experiments in the fiel recognition. In this section, we select a subset of AR data set with a total of 2600 ages. This dataset consists of 100 people, with 50 men and 50 women. Each su 26 images. The AR face images were captured twice, with an interval of two w tween the two shots. Each shot collected 13 pictures of different modes with light brightness, light angle, facial expression and occlusion (sunglasses or scarf experiment, each face image was normalized as a gray level image of pixels. The set and test set directly used the gray value and vectorization of the image as th According to the different shooting times and states, 26 face images of each sub responded to 26 patterns, which were numbered as 1a to 1m and 2a to 2m. shows a sample of the AR data set with 26 face images from the same subject. Fi m belongs to one group, while 2a-2m is from another group, which is under t conditions taken two weeks later. We used the notation ℂ . and ℂ . to represe lection of natural expressions of the face images. In this section, ℂ . and ℂ . bined as source domain data set . From the other 24 patterns except ℂ . and ℂ . , the first 18 patterns were including ℂ . to ℂ . and ℂ . to ℂ . , and the data of these patterns were taken a get domain data sets respectively and 18 classification tasks were set.
In the first experiment, we studied how our proposed SLDARKHS-DA affe distribution of the source and target domains. We took ℂ . and ℂ . as target do respectively, and calculated the distance between the geometric center of the so main and target domain and the variance of the source domain data in the origin and subspace, to prove the effectiveness of domain adaptation. As shown in Ta the experiment with ℂ . as the target domain, after the data of the source dom target domain are transformed from the original space to the subspace, not onl geometric centers of the data of the two almost coincide, but also the variance of is greater. As the distance between the classes in the source domain becomes la classification efficiency of the KNN algorithm improves. Similarly, the geometr bution of data with ℂ . as the target domain, also shows a similar change. Table 3. Data distribution in the original space and subspace for face recognition. From the other 24 patterns except C 1.a and C 2.a , the first 18 patterns were selected, including C 1.b to C 1.j and C 2.b to C 2.j , and the data of these patterns were taken as 18 target domain data sets respectively and 18 classification tasks were set.
In the first experiment, we studied how our proposed SLDARKHS-DA affected the distribution of the source and target domains. We took C 1.f and C 2.f as target domain X T , respectively, and calculated the distance between the geometric center of the source domain and target domain and the variance of the source domain data in the original space and subspace, to prove the effectiveness of domain adaptation. As shown in Table 3, in the experiment with C 1.f as the target domain, after the data of the source domain and target domain are transformed from the original space to the subspace, not only do the geometric centers of the data of the two almost coincide, but also the variance of the data is greater. As the distance between the classes in the source domain becomes larger, the classification efficiency of the KNN algorithm improves. Similarly, the geometric distribution of data with C 2.f as the target domain, also shows a similar change.  In the second experiment, we regard the data in the source domain as labeled and the data in the target domain as unlabeled. Similarly, we combined C 1.a and C 2.a as the source domains and set the target domains as C 1.b to C 1.j and C 2.b to C 2.j . A total of 30% of the images from the target domain were crystals selected as training data and the transformation function by the domain adaptation methods obtained by X T and X S . We set the KNN as the default classifier and the expected subspace dimensionality was fixed at 90 for the classification experiments. The experimental results are shown in Table A1 in Appendix A.
We noted that the direct classification of X T data with the KNN was worse than the classification of the mapped Γ(X T ) data obtained by RKHS-DA with KNN. Moreover, the classification accuracy of the SLDARKHSDA algorithm combining the regularization terms SLDA and RKHS-DA improved by 4% on average. Furthermore, we combined the proposed regularization term SLDA with the baseline algorithm for comparison to form a new domain adaptation algorithm, which was compared with the original baseline algorithm. As shown in Table A3, in terms of average classification accuracy, the SLDA improves the TCA by 3.1%, IGLDA by 1.2% and TIT by 2.8%, respectively. This agrees with our idea of RKHS subspace learning, because the baseline algorithm compared with our RKHS-DA algorithm is similar in terms of domain adaptation; therefore, SLDA also improves the performance of such an algorithm.
In the third experiment, to investigate how the dimensionality of the subspace of the feature map affects the final performance of our algorithm, we combined C 1.a and C 2.a as the source domain and took C 1.f as the target domain for the classification experiment. We mapped the data into different dimensionalities in subspace from 10-dimensional to 100-dimensional, the step size was set to 10 and other parameters were set to the same values as in the second experiment. The experimental results are shown in Figure 2. We observed that the larger the subspace dimension, the higher the classification accuracy. However, the curve of classification accuracy tends to flatten out as the subspace dimension keeps increasing. Compared with the original baseline algorithm, the baseline algorithm combining the SLDA regularization term achieved a higher accuracy in different subspace dimensions, which means that the SLDA regularization term proposed in this paper is robust and stable.

Object Recognition
Caltech-256 (C, collected by the California Institute of Technology), Amazon (A, images downloaded from amazon.com in October 2020), webcam (W, low resolution images captured by a Web camera) and DSLR (D, high-resolution images captured by a digital SLR camera) 4 datasets domain adaptation (4DA) are the most popular benchmarks in domain adaptation. The number of common categories in the 4 domains is 10, indicating that the number of categories in the 4DA dataset is 10. Each category in each domain has 8 to 151 samples, with a total of 2533 images. Figure 3 shows some samples selected from the 4DA.
in Figure 2. We observed that the larger the subspace dimension, the higher the classification accuracy. However, the curve of classification accuracy tends to flatten out as the subspace dimension keeps increasing. Compared with the original baseline algorithm, the baseline algorithm combining the SLDA regularization term achieved a higher accuracy in different subspace dimensions, which means that the SLDA regularization term proposed in this paper is robust and stable.

Object Recognition
Caltech-256 (C, collected by the California Institute of Technology), Amazon (A, images downloaded from amazon.com in October 2020), webcam (W, low resolution images captured by a Web camera) and DSLR (D, high-resolution images captured by a digital SLR camera) 4 datasets domain adaptation (4DA) are the most popular benchmarks in domain adaptation. The number of common categories in the 4 domains is 10, indicating that the number of categories in the 4DA dataset is 10. Each category in each domain has 8 to 151 samples, with a total of 2533 images. Figure 3 shows some samples selected from the 4DA.  For all datasets, we followed [31] to preprocess the data using a similar feature extraction and experimentation protocol. By randomly selecting two different domains as the source and target domains, a total of 4 × 3 = 12 cross-domain object recognition tasks were constructed. In each task, we randomly selected a certain number of samples from each category as the source domain data for the training set.
When D was the source domain, we drew 8 samples from each category; when A, C and W were the source domains, we drew 20 samples from each category. Then, the source domain samples were used as the training set data and the target domain samples were used as the test set.

Object Recognition
Caltech-256 (C, collected by the California Institute of Technology), Amazon (A, images downloaded from amazon.com in October 2020), webcam (W, low resolution images captured by a Web camera) and DSLR (D, high-resolution images captured by a digital SLR camera) 4 datasets domain adaptation (4DA) are the most popular benchmarks in domain adaptation. The number of common categories in the 4 domains is 10, indicating that the number of categories in the 4DA dataset is 10. Each category in each domain has 8 to 151 samples, with a total of 2533 images. Figure 3 shows some samples selected from the 4DA. For all datasets, we followed [31] to preprocess the data using a similar feature extraction and experimentation protocol. By randomly selecting two different domains as the source and target domains, a total of 4 × 3 = 12 cross-domain object recognition tasks were constructed. In each task, we randomly selected a certain number of samples from each category as the source domain data for the training set.
When D was the source domain, we drew 8 samples from each category; when A, C and W were the source domains, we drew 20 samples from each category. Then, the source domain samples were used as the training set data and the target domain samples were used as the test set.
The results of the first experiment are shown In Appendix A, Table A2. Compared with the original space, the geometric center distance between the source and target domains in the subspace is greatly reduced, and the number variance of the source and target domains is greatly increased.
The results of the classification experiments are shown in Appendix A, Table A3. The classification accuracy of the SLDARKHSDA algorithm with the addition of the For all datasets, we followed [31] to preprocess the data using a similar feature extraction and experimentation protocol. By randomly selecting two different domains as the source and target domains, a total of 4 × 3 = 12 cross-domain object recognition tasks were constructed. In each task, we randomly selected a certain number of samples from each category as the source domain data for the training set.
When D was the source domain, we drew 8 samples from each category; when A, C and W were the source domains, we drew 20 samples from each category. Then, the source domain samples were used as the training set data and the target domain samples were used as the test set.
The results of the first experiment are shown In Appendix A, Table A2. Compared with the original space, the geometric center distance between the source and target domains in the subspace is greatly reduced, and the number variance of the source and target domains is greatly increased.
The results of the classification experiments are shown in Appendix A, Table A3. The classification accuracy of the SLDARKHSDA algorithm with the addition of the SLDA regularization term is about 2% higher than that of the KNN and RKHS-DA algorithms.

Handwritten Numeral Classification
In this section, the USPS+MNIST dataset is used for handwritten digit classification experiments. The USPS dataset consists of 7291 training images and 2007 test images of size 16 × 16. The MNIST dataset has a training set of 60,000 examples and a test set of 10,000 examples of size 28 × 28.
The images of both the MNIST and USPS datasets share 10 grayscale images of handwritten Arabic numerals. These images were rescaled to a size of 16 × 16, which allowed the numbers to be fixed in the center of the entire image and the images to be of the same size. Figure 4 shows an example of MNIST and USPS data sets. SLDA regularization term is about 2% higher than that of the KNN and RKHS-DA algorithms.

Handwritten Numeral Classification
In this section, the USPS+MNIST dataset is used for handwritten digit classification experiments. The images of both the MNIST and USPS datasets share 10 grayscale images of handwritten Arabic numerals. These images were rescaled to a size of 16 × 16, which allowed the numbers to be fixed in the center of the entire image and the images to be of the same size. Figure 4 shows an example of MNIST and USPS data sets.  The experiment in this section was conducted on a subset of MNIST+USPS data set, which consisted of two parts: the first part was 2000 images randomly selected from the MNIST data set, and the second part was 1800 images randomly selected from the USPS data set.
Similarly, all the images in the subset were uniformly resized to 16 × 16 pixels and the gray value of the pixels was used as a feature vector to represent each image. Thus, the samples of MNIST and USPS lie in the same 256-dimensional feature space. To speed up the experiments, we constructed a dataset MINST vs. USPS, randomly selected 50 sets of digital images in MINST, with a total of 500 images to form the source data, and used all the images in USPS to form the target data.
Like we did on the other data sets, in the first experiment, we fixed the dimension of the subspace as 150, and after our algorithm transformation, D(S, T) was reduced from 2.96 to 1.15, Var(S) was changed from 3.6 to 3.2, and Var(T) was changed from 4 to 6. Although the Var(S) is smaller, which is not what we expected, the ratio of Var(S)/D (S, T) is larger, so it still verifies the effectiveness of our algorithm.
In the second experiment, we trained the KNN classifier to repeat the classification experiment 100 times, and used a linear kernel function. The subspace dimensions were set to 30 to 150 and the step size was 20. Figure 5 shows the experimental results. The SLDARKHS-DA algorithm with the SLDA regularization term improves the classification accuracy of RKHS-DA algorithm by about 3%, which is much higher than the classification accuracy of KNN directly (51.18%). Similar results were found for other baseline methods: the accuracy of the baseline algorithm with the SLDA regularization term was The experiment in this section was conducted on a subset of MNIST+USPS data set, which consisted of two parts: the first part was 2000 images randomly selected from the MNIST data set, and the second part was 1800 images randomly selected from the USPS data set.
Similarly, all the images in the subset were uniformly resized to 16 × 16 pixels and the gray value of the pixels was used as a feature vector to represent each image. Thus, the samples of MNIST and USPS lie in the same 256-dimensional feature space. To speed up the experiments, we constructed a dataset MINST vs. USPS, randomly selected 50 sets of digital images in MINST, with a total of 500 images to form the source data, and used all the images in USPS to form the target data.
Like we did on the other data sets, in the first experiment, we fixed the dimension of the subspace as 150, and after our algorithm transformation, D(S, T) was reduced from 2.96 to 1.15, Var(S) was changed from 3.6 to 3.2, and Var(T) was changed from 4 to 6. Although the Var(S) is smaller, which is not what we expected, the ratio of Var(S)/D (S, T) is larger, so it still verifies the effectiveness of our algorithm.
In the second experiment, we trained the KNN classifier to repeat the classification experiment 100 times, and used a linear kernel function. The subspace dimensions were set to 30 to 150 and the step size was 20. Figure 5 shows the experimental results. The SLDARKHS-DA algorithm with the SLDA regularization term improves the classification accuracy of RKHS-DA algorithm by about 3%, which is much higher than the classification accuracy of KNN directly (51.18%). Similar results were found for other baseline methods: the accuracy of the baseline algorithm with the SLDA regularization term was higher than that of the original baseline algorithm. In addition, the variation of subspace dimensions had little effect on the classification accuracy of each algorithm.

Text Categorization
Reuters-21578 dataset (Dai et al., 2007) contains three cross-domain document categorization tasks, Orgs vs. People, Orgs vs. Places and People vs. Places. The notation "orgs vs. Place" indicates that we have the Org subtype as the source domain data and the Place subtype as the target domain. There are 1237 source documents and 1208 target documents for the task of Orgs vs. People, 1016 source documents and 1043 target documents for the task of Orgs vs. Places and 1077 source documents and 1077 target documents for the task of People vs. Places. We randomly selected 50% of the source domain data as the training set and used all the target domain data as the testing set.
In the first experiment, we set the subspace dimensions from 10 to 50 with a step size of 10, and calculated the variance and the distance between the source domain and the geometric center of the target domain. The experimental results are shown in Appendix A, Table A4.
In the second experiment, we used the KNN classifier to verify the effect of the methods.
The experimental results are shown in Appendix A, Table A5. In almost all the dimensions and all the experiments, the recognition rate of RKHS-DA was improved to some extent by the SLDA regularization term. In addition, SLDA also improved the classification accuracy of the other baseline methods used for comparison.

Motor Imagery Classification
As described in Section 3, we used the 2a dataset from the BCI competition IV, which consists of nine subjects [32]. The subjects were sitting in an armchair in front of a computer screen. As shown in Figure 6, at the beginning of the trial (t = 0 s), a fixation

Text Categorization
Reuters-21578 dataset (Dai et al., 2007) contains three cross-domain document categorization tasks, Orgs vs. People, Orgs vs. Places and People vs. Places. The notation "orgs vs. Place" indicates that we have the Org subtype as the source domain data and the Place subtype as the target domain. There are 1237 source documents and 1208 target documents for the task of Orgs vs. People, 1016 source documents and 1043 target documents for the task of Orgs vs. Places and 1077 source documents and 1077 target documents for the task of People vs. Places. We randomly selected 50% of the source domain data as the training set and used all the target domain data as the testing set.
In the first experiment, we set the subspace dimensions from 10 to 50 with a step size of 10, and calculated the variance and the distance between the source domain and the geometric center of the target domain. The experimental results are shown in Appendix A, Table A4.
In the second experiment, we used the KNN classifier to verify the effect of the methods. The experimental results are shown in Appendix A, Table A5. In almost all the dimensions and all the experiments, the recognition rate of RKHS-DA was improved to some extent by the SLDA regularization term. In addition, SLDA also improved the classification accuracy of the other baseline methods used for comparison.

Motor Imagery Classification
As described in Section 3, we used the 2a dataset from the BCI competition IV, which consists of nine subjects [32]. The subjects were sitting in an armchair in front of a computer screen. As shown in Figure 6, at the beginning of the trial (t = 0 s), a fixation cross appeared on the black screen. In addition, a short acoustic warning tone was presented. After two seconds (t = 2 s), a cue appeared and stayed on the screen for 1.25 s. This prompted the subjects to perform the desired motor imagery task (left hand, right hand, both foot and tongue). No feedback was provided. The subjects were asked to carry out the motor imagery task until the fixation cross disappeared from the screen at t = 6 s. cross appeared on the black screen. In addition, a short acoustic warning tone w sented. After two seconds (t = 2 s), a cue appeared and stayed on the screen for This prompted the subjects to perform the desired motor imagery task (left han hand, both foot and tongue). No feedback was provided. The subjects were asked ry out the motor imagery task until the fixation cross disappeared from the scree 6 s. For each subject, two periods of data were recorded on two different days, w tails for each period and 72 trajectories for each category. We captured data from 6.5 s for one trial. The recorded EEG signals were sampled with 250 Hz and filter fifth-order Butterworth filter band in the 8-30 Hz frequency band.
We took A01T to A09T as the source domain and A01E to A09E as the tar main, respectively, while a total of 9 experiments were set up. In this experime PCA algorithm as the baseline method was added for the dimensionality reduc the original spatial data, and KNN was used as the default classifier of all algo We set the parameter of the sparse regularization item to 10 −2 , and the other par settings are shown in Section 4.1.
Since our ultimate goal was to compare the performance of our method w other baseline methods on the BCI 2a dataset, in the first experiment, we fixed mension of our subspace to 25 and performed the classification on different s from A01 to A09 for comparison. Figure 7 shows that our method outperforms th line algorithm in all experiments, except for the result recorded in A04. For each subject, two periods of data were recorded on two different days, with 288 tails for each period and 72 trajectories for each category. We captured data from 1.5 to 6.5 s for one trial. The recorded EEG signals were sampled with 250 Hz and filtered by a fifth-order Butterworth filter band in the 8-30 Hz frequency band.
We took A01T to A09T as the source domain and A01E to A09E as the target domain, respectively, while a total of 9 experiments were set up. In this experiment, the PCA algorithm as the baseline method was added for the dimensionality reduction of the original spatial data, and KNN was used as the default classifier of all algorithms. We set the parameter γ of the sparse regularization item to 10 −2 , and the other parameter settings are shown in Section 4.1.
Since our ultimate goal was to compare the performance of our method with the other baseline methods on the BCI 2a dataset, in the first experiment, we fixed the dimension of our subspace to 25 and performed the classification on different subjects from A01 to A09 for comparison. Figure 7 shows that our method outperforms the baseline algorithm in all experiments, except for the result recorded in A04.
hand, both foot and tongue). No feedback was provided. The subjects were asked t ry out the motor imagery task until the fixation cross disappeared from the screen 6 s. For each subject, two periods of data were recorded on two different days, wit tails for each period and 72 trajectories for each category. We captured data from 6.5 s for one trial. The recorded EEG signals were sampled with 250 Hz and filtered fifth-order Butterworth filter band in the 8-30 Hz frequency band.
We took A01T to A09T as the source domain and A01E to A09E as the targe main, respectively, while a total of 9 experiments were set up. In this experimen PCA algorithm as the baseline method was added for the dimensionality reducti the original spatial data, and KNN was used as the default classifier of all algori We set the parameter of the sparse regularization item to 10 −2 , and the other para settings are shown in Section 4.1.
Since our ultimate goal was to compare the performance of our method wit other baseline methods on the BCI 2a dataset, in the first experiment, we fixed th mension of our subspace to 25 and performed the classification on different su from A01 to A09 for comparison. Figure 7 shows that our method outperforms the line algorithm in all experiments, except for the result recorded in A04. In the second experiment, we compared our method with the baseline method in terms of the dimensionality reduction. We used A01T as the source domain and A01E as the target domain, and the dimensionality of the subspace varies from 10 to 110. As shown in Figure   In the second experiment, we compared our method with the baseline met terms of the dimensionality reduction. We used A01T as the source domain and A the target domain, and the dimensionality of the subspace varies from 10 to 1 shown in Figure 8, our SLDARKHS-DA (Sparse) outperforms the other baseline ods. In the third experiment, we investigated the impact of our algorithm on the domain data distribution. For visualization purposes, we applied tSNE to both the nal data and the transformed data. Figure 9a shows a two-dimensional representa the original data vector, i.e., each point in the figure is a representative of a trial. M ver, Figure 9b shows the representation of the transformed data vector obtained SLDARKHS-DA (Sparse). In Figure 9a,b, the points are colored according to the m task. We observe that the source domain data are chaotic in the original space, wh algorithm separates the four classes of data, which facilitates the accuracy of the classifier. In the fourth experiment, Table A6 in Appendix A shows the classification ac In the third experiment, we investigated the impact of our algorithm on the source domain data distribution. For visualization purposes, we applied tSNE to both the original data and the transformed data. Figure 9a shows a two-dimensional representation of the original data vector, i.e., each point in the figure is a representative of a trial. Moreover, Figure 9b shows the representation of the transformed data vector obtained by our SLDARKHS-DA (Sparse). In Figure 9a,b, the points are colored according to the mental task. We observe that the source domain data are chaotic in the original space, while our algorithm separates the four classes of data, which facilitates the accuracy of the KNN classifier. In the second experiment, we compared our method with the baseline method in terms of the dimensionality reduction. We used A01T as the source domain and A01E as the target domain, and the dimensionality of the subspace varies from 10 to 110. As shown in Figure 8, our SLDARKHS-DA (Sparse) outperforms the other baseline methods. In the third experiment, we investigated the impact of our algorithm on the source domain data distribution. For visualization purposes, we applied tSNE to both the original data and the transformed data. Figure 9a shows a two-dimensional representation of the original data vector, i.e., each point in the figure is a representative of a trial. Moreover, Figure 9b shows the representation of the transformed data vector obtained by our SLDARKHS-DA (Sparse). In Figure 9a,b, the points are colored according to the mental task. We observe that the source domain data are chaotic in the original space, while our algorithm separates the four classes of data, which facilitates the accuracy of the KNN classifier. In the fourth experiment, Table A6 in Appendix A shows the classification accuracy of various original baseline algorithms and the baseline algorithm after adding the In the fourth experiment, Table A6 in Appendix A shows the classification accuracy of various original baseline algorithms and the baseline algorithm after adding the SLDA regularization term. Firstly, from the perspective of the domain adaptation framework, RKHS-DA and the other baseline algorithms of the domain adaptation are better than PAC+KNN, and the SLDARKHS-DA (Sparse) is better than the other algorithms. In addition, it can be seen that the SLDA regularization term has certain improvements over the other baseline algorithms used for comparison.
In the fifth experiment, we conducted experiments on subject A01, with A01T as the source domain and A01E as the target domain. We set the range of the subspace dimension from 10 to 110. The average classification results are shown in Figure 10. Based on these results, we observe that the classification performance of the algorithm with regularization SLDA is better than that of the original baseline algorithm in all the dimensions. SLDA regularization term. Firstly, from the perspective of the domain adaptation framework, RKHS-DA and the other baseline algorithms of the domain adaptation are better than PAC+KNN, and the SLDARKHS-DA (Sparse) is better than the other algorithms. In addition, it can be seen that the SLDA regularization term has certain improvements over the other baseline algorithms used for comparison.
In the fifth experiment, we conducted experiments on subject A01, with A01T as the source domain and A01E as the target domain. We set the range of the subspace dimension from 10 to 110. The average classification results are shown in Figure 10. Based on these results, we observe that the classification performance of the algorithm with regularization SLDA is better than that of the original baseline algorithm in all the dimensions.

Conclusions
In this paper, we reorganized the RKHS subspace learning framework based on the theory of RKHS, which consists of functions defined on the original data space instead of the Hilbert space that is independent of the original data space. We first proposed an SLDA regularization term based on the discriminant analysis of the source domain data. The regularization term can increase the inter-class distance and decrease the intra-class distance. Based on the SLDA and RKHS subspace learning framework, we proposed a domain adaptation algorithm. Based on the application of BCI, we selected the most desired data to form the basis of the subspace by adding sparse constraints, i.e., . norm. Extensive experiments validated the effectiveness of our algorithm.
In the future, we plan to continue our work by pursuing several avenues. First, SLDARKHS-DA uses parametric kernels for the MMD, and we plan to develop an efficient algorithm for kernel choice in SLDARKHS-DA. Second, to improve the sensitivity of the MI data, we will use the frequency domain features of the MI data. Moreover, we plan to extend SLDARKHS-DA to other BCI experiments with cross-subject settings.

Conclusions
In this paper, we reorganized the RKHS subspace learning framework based on the theory of RKHS, which consists of functions defined on the original data space instead of the Hilbert space that is independent of the original data space. We first proposed an SLDA regularization term based on the discriminant analysis of the source domain data. The regularization term can increase the inter-class distance and decrease the intra-class distance. Based on the SLDA and RKHS subspace learning framework, we proposed a domain adaptation algorithm. Based on the application of BCI, we selected the most desired data to form the basis of the subspace by adding sparse constraints, i.e., L 2.1 norm. Extensive experiments validated the effectiveness of our algorithm.
In the future, we plan to continue our work by pursuing several avenues. First, SLDARKHS-DA uses parametric kernels for the MMD, and we plan to develop an efficient algorithm for kernel choice in SLDARKHS-DA. Second, to improve the sensitivity of the MI data, we will use the frequency domain features of the MI data. Moreover, we plan to extend SLDARKHS-DA to other BCI experiments with cross-subject settings.

Conflicts of Interest:
The authors declare no conflict of interest.
Appendix A Table A1. Classification accuracy (in %) of face recognition in different tasks (1(a) and 2(a) for source domain).  Table A2. Data distribution in the original space and subspace for object recognition.  Table A4. Data distribution in the original space and subspace for text categorization.

D(S, T) Var(S) Var(T) D(S, T) Var(S) Var(T)
People vs. Places Table A5. Classification accuracy (in %) of text categorization in different tasks.