Deep and Wide Transfer Learning with Kernel Matching for Pooling Data from Electroencephalography and Psychological Questionnaires

Motor imagery (MI) promotes motor learning and encourages brain–computer interface systems that entail electroencephalogram (EEG) decoding. However, a long period of training is required to master brain rhythms’ self-regulation, resulting in users with MI inefficiency. We introduce a parameter-based approach of cross-subject transfer-learning to improve the performances of poor-performing individuals in MI-based BCI systems, pooling data from labeled EEG measurements and psychological questionnaires via kernel-embedding. To this end, a Deep and Wide neural network for MI classification is implemented to pre-train the network from the source domain. Then, the parameter layers are transferred to initialize the target network within a fine-tuning procedure to recompute the Multilayer Perceptron-based accuracy. To perform data-fusion combining categorical features with the real-valued features, we implement stepwise kernel-matching via Gaussian-embedding. Finally, the paired source–target sets are selected for evaluation purposes according to the inefficiency-based clustering by subjects to consider their influence on BCI motor skills, exploring two choosing strategies of the best-performing subjects (source space): single-subject and multiple-subjects. Validation results achieved for discriminant MI tasks demonstrate that the introduced Deep and Wide neural network presents competitive performance of accuracy even after the inclusion of questionnaire data.


Introduction
Motor imagery (MI) is related to the process of mentally generating a quasi-perceptual experience in the absence of any appropriate external stimuli [1]. MI practice promotes children's motor learning and has been suggested to provide benefits in enhancing the musicality of untrained children [2,3], in evaluating the screen-time and cognitive development [4], and improving attentional focus and rehabilitation [5][6][7], among others. MI-based brain-computer interface (BCI) systems often entail electroencephalogram (EEG)-decoding because of their ease of use, safety, high portability, relatively low cost, and, most importantly, high temporal resolution [8]. EEG is a non-invasive and portable neuroimaging technique that records brain electrical signals over the scalp, reflecting the synchronized oscillatory activity originating from the pyramidal cells of the sensorimotor cortex. However, evoked responses in frequency bands, besides the eliciting stimuli, depend upon every individual. In addition, in MI-based cognitive tasks, the evoked event-related de/synchronization of the sensorimotor area is perturbed by other background brain processes or even artifacts, seriously reducing the signal-to-noise ratio [9]. Hence, to generate steady evoked control patterns, long training must master brain rhythms' self-regulation. As a result, the percentage of users with MI inefficiency (or BCI-illiteracy) is high enough to limit this technology to lab environments even that MI research has been going for many years [10].
In practice, the MI ability can be assessed to determine to what extent a user engages in a mental representation of movements, mainly through self-report questionnaires developed explicitly for this purpose [11]. Yet, there is very little evidence stating a confident correlation between the classification accuracy and the questionnaire scores. Several reasons may account in this regard [12,13]: weak and ambiguous self-interpretation in understanding the questionnaire instructions, laboratory paradigms restricted to a narrow class of motor activity, timeline limitations guaranteeing consistent mental states, and difficulty in learning features from subjects with BCI-illiteracy, among others. Hence, although psychological assessment and questionnaires are probably the most accepted and validated methods in medical contexts [14], their inclusion in the automated prediction of the BCI skills remains very rare due to their disputed reliability and reproducibility [15]. For enhancing the predictive utility, the joint analysis of different imaging modalities is achieved, which may explain the discovered relationships between anatomical, functional, and electrophysiological properties of the brain [16,17]. Nonetheless, besides those issues that may arise by the questionary implementation, research endeavors of multimodal analysis pose a challenging problem in terms of combining categorical data with imaging measurements, facing the following restrictions [18,19]: Different spatial and temporal sampling rates, noninstantaneous and nonlinear coupling, low signal-to-noise ratios, a lack of interpretable results, and the optimal combination of individual modalities is still undetermined, as well as effective dimensionality reduction to enhance the discriminability of extracted multi-view features [20].
Another approach to improve BCI skills is to perform several training sessions in which participants learn how to modulate their sensorimotor rhythms appropriately, relying on the spatial specificity of MI-induced brain plasticity [21]. However, collecting extensive data is time-consuming and mentally exhausting during a prolonged recording session, deteriorating the measurement quality. To overcome this lack of subject-specific data, transfer learning-based approaches are increasingly integrated into MI systems using pre-existing information from other subjects (source domain) to facilitate the calibration for a new subject (target domain) through a set of shared features among individuals under the assumption of a unique data acquisition paradigm [22][23][24]. Therefore, to have the advantages of transfer learning in EEG signal analysis, strategies for individual difference matching and data requirement reduction are needed to fine-tune the model for the target subject [25]. For example, in [26], the authors use pre-trained models (e.g., VGG16 and Alex-net) as the starting point for approach-fitting. This strategy limits the amount of training data required to support the MI classification task. In this case, they compute the continuous wavelet transform from EEG signals to represent the time-series data into equivalent image representation that can be trained in deep networks. Similarly, Zhang et al. in [27] proposed five schemes for adaptation of a deep convolutional neural network-based EEG-BCI system for decoding MI. Specifically, each procedure fine-tunes a pre-trained model to enhance the evaluation performed on a target subject. Recently, approaches based on weighted instances [28] and domain adaptation [29] have been studied. In the first case, instance-based transfer learning is used to select the source domain data that is most similar to the target domain to assist the training of the target domain classification model. In the second case, researchers extend deep transfer learning techniques to the EEG multi-subject training case. In particular, they explore the possibility of applying maximum-mean discrepancy to align better distributions of features from individual feature extractors in an MI-based BCI system. Nonetheless, to extract sets of shared features among subjects with a similar distribution, there is a need to adequately handle two main limitations of subject-dependent and subject-independent training strategies: small-scale datasets and a significant difference in signals across subjects [30]. In fact, several issues remain as challenges to obtaining adequate consistency of the feature space and probability distribution of training and test data, avoiding negative transfer effects [31,32]: feature extraction from available multimodal data effective enough to discriminate between MI tasks, and the choosing of transferable objects and transferability measures along with the assignation of their weights [33].
Here, we introduce a parameter-based approach of cross-subject transfer learning for improving poor-performing individuals in MI-based BCI systems, and pooling data from labeled EEG measurements and psychological questionnaires via kernel-embedding. For sharing the discovered model parameters, as presented in [34], an end-to-end Deep and Wide neural network for MI classification is implemented that is, firstly, fed by data from the whole trial set to pre-train the network from the source domain. Then, the layer parameter layers are transferred to initialize the target network within a fine-tuning procedure to recompute the Multilayer Perceptron-based accuracy. To perform data fusion combining categoricals with the real-valued features, we implement the stepwise kernel-matching via Gaussian embedding, resulting in similarity matrices that hold a relationship with the BCI inefficiency clusters. For evaluation purposes, the paired source-target sets are selected according to the inefficiency-based clustering by subjects to consider their influence on BCI motor skills, exploring two choosing strategies of the best-performing subjects (source space): Single-subject and multiple-subjects, as delivered in [35]. The validation results for discriminating MI tasks show that the proposed Deep and Wide neural network gives promising accuracy performance, even after including questionnaire data. Therefore, this deep learning framework with cross-subject transfer learning is a promising way to address small-scale data limitations from the best-performing subjects.
The remainder of this paper is as follows: Section 2 presents the materials and methods, Section 3 describes the experiments and the corresponding results, putting effort into their interpretation. Lastly, Section 4 highlights the conclusions and recommendations.

2D Feature Representation of EEG Data
From the EEG database collected by an C-channel montage, we build a single matrix for the n-th trial {X n ∈R C × T , λ n ∈{0, 1} Λ } N n=1 , that contains T time points at the sampling rate F s . Along with the EEG data, we also create the one-hot output vector λ n in Λ∈N labels. For evaluation in discriminating MI tasks, the proposed transfer learning model is assessed on a trial basis. That is, we extract the feature sets per trial {X r n ∈R C } R r=1 , incorporating a pair of EEG-based feature representation approaches (R = 2): Continuous Wavelet Transform (CWT) and Common Spatial Patterns (CSP), as recommended for Deep and Wide learning frameworks in [36].
Further, the extracted multi-channel features (using CSP and CWT methods) are converted into a two-dimensional topographic interpolation R C → R W × H to preserve their spatial interpretation, mapping into a two-dimensional circular view for every extracted trial feature set. As a result, we obtain the labeled 2D data {Y z n ∈R W × H , λ n : n∈N}, where Y z n is a single-trial bi-domain t-f feature array, termed topogram, extracted from every z-th set. Of note, the triplet z = {r, ∆ t , ∆ f (with z∈Z) indexes a topogram estimated for each included domain principle r∈R at the time-segment ∆t∈T, and within the frequency-band ∆ f ∈F.
Besides, we estimate the local spatial patterns of relationships from the input topographic set through the square-shaped layer kernel arrangement {K z i,l ∈R P × P } I l ,Z (as in straightforward convolutional networks), where P holds the kernel size. Therefore, the number of kernels varies at each layer i∈I l , so that the stepwise 2D-convolutional operation is performed over the input topogram, Y z , as follows: where ϕ z l (Ŷ z l-1 ) = γ l (K z i,l ⊗Ŷ z l-1 + B z i,l ) is the convolutional layer, followed by a non-linear activation function γ l : R W z l × H z l → R W z l × H z l ,Ŷ z l ∈R W z l ×H z l is the resulting 2D feature map of the l-th layer (adjustingŶ z 0 = Y z ), and the arrangement B z i,l ∈R W z l ×H z l denotes the bias matrix. Notations • and ⊗ stand, respectively, for the function composition and convolution operator.

Multi-Layer Perceptron Classifier Using 2D Feature Representation
In this stage, we employ the deep learning-based classifier function ϕ : R W × H → Λ developed through a Multilayer Perceptron (MLP) Neural Network that predicts the label probability vectorṽ∈{0, 1} Λ , as below [37]: For computation at each layer, the hidden layer vector is iteratively updated by the rule (composition function-based approach of deep learning methods) u d = φ d (u d−1 ), for which the initial state vector is flattened by concatenating all matrix rows across z and I l domains as u 0 = [vec(Ŷ z L ) : ∀z∈Z]. The input vector u 0 sizes G = W H Z ∑ l∈L I l , holding W < W, H < H . Besides, the optimizing estimation framework of label adjustment estimates the training parameter set fixing the loss function L : R Λ × R Λ → R to calculate the gradients employed to update the weights and bias of the proposed Deep and Wide neural network through a certain number of training epochs. Remarkably, we refer to our method as Deep and Wide because of the inclusion of a set of different topograms (along time and frequency domains) from the extracted multi-channel features using CSP and CWT algorithms. A mini-batch-based gradient implements the solution, as commonly used in deep learning methods, equipped with automatic differentiation and back-propagation [38].

Transfer Learning with Added Questionnaire Data
In EEG analysis based on Deep Learning, for enhancing the classifier performance, transfer learning is a common approach to adjust a pre-trained neural network model equipped with the label probability vectorṽ, aiming to provide a close domain distance measurement δ(·, ·)R + , lower than a given value ∈R + , between the paired domains to approximate the source Y (s) to the target Y (t) [24], as follows: Here, we propose to conduct the transfer learning procedure to learn a target prediction function that is enhanced by the addition of the categorical assessments of a psychological questionnaire data matrix, S, along with the stepwise multi-space kernel-embedding, including EEG-based features, to perform the whole network parameter optimization in Equation (2b). Besides, for interpretation purposes, selecting the paired source-target sets is accomplished according to the inefficiency-based clustering of subjects.
Therefore, to combine the categorical data, S, with the real-valued feature map set extracted from EEG as exposed in Sections 2.1 and 2.2, Y, we compute the tensor product space between the corresponding kernel-matching representations, κÛ and κ S , as suggested in [39]: where J = ∑ M m=1 N m (N m holds the trials for the m-th subject), κ S ∈R J × J is the kernel matrix directly extracted from the questionnaire data S∈R J × N Q (N Q is the questionnaire vector length), κÛ ∈R J×J is the kernel topographic matrix estimated from the projected ver-sionÛ = UΥ Υ Υ * , withÛ∈R J×G (holding that G < G), U∈R J×G U∈R J×G is the initial data matrix build by concatenating across the trial and subject sets all flattened vectors u * 0 , which are computed by adjusting the optimized parameters Θ * = {K q * i,l , B q * i,l }, and Υ Υ Υ * ∈R G×G is the projection matrix introduced to maximize the similarity between both estimated kernel-embeddings derived from the labeled EEG measurements of MI responses, namely, one from the one-hot label vectors, κ V ∈R J×J , and another from the topographic features, κ U ∈R J×J .
In particular, we match both estimated kernel-embeddings through the centered kernel alignment (CKA), as detailed in [40]: where the kernel κ V is obtained from the matrix of predicted label probabilities V ∈R J×Λ build by concatenating across the trial and subject sets all label probability vectorsṽ mn .

Experimental Set-Up
Training of the proposed Deep and Wide neural network model for transfer learning to improve classification of MI responses, including EEG and questionnaire data, encompasses the following stages (see Figure 1): (i) Preprocessing and spatial filtering of EEG signals, followed by 2D features extracted from the input topogram set using the convolutional network (see Section 2.1). (ii) MLP classification applying the extracted 2D feature maps (see Section 2.2), (iii) Cross-subject transfer learning, including stepwise multi-space kernelembedding of the real-valued and categorical variables (see Section 2.3). The paired source-target sets are selected according to the inefficiency-based clustering by subjects to consider their influence on BCI motor skills.
Nonetheless, the classifier performance can decrease since the extracted representation sets may still involve irrelevant and/or similar features. Therefore, for reducing the data complexity, we accomplish dimensionality reduction by evaluating a widely-used unsupervised feature extractor of Kernel PCA (KPCA) that provides a representation of data points' global structure [41].

Database Description and Preprocessing
GigaScience (publicly available at http://gigadb.org/dataset/100295 (accessed on 9 July 2021)): This acquisition holds EEG data recorded by a BCI experimental paradigm of MI movement collected from 52 subjects (though only 50 is available). Data were acquired by a 10-10 C-electrode system C = 64 with 512 Hz sampling rates, collecting 100 individual trials (each one lasting 7 s) in either task (left or right hand). The MI paradigm begins with a fixation cross presented on a black screen within 2 s. Next, a cue instruction appeared randomly on the screen for 3 s to ask each subject to imagine moving the fingers, starting to from the forefinger and proceeding to the little finger, touching each to their thumb. A blank screen was then shown at the beginning of a break period, lasting randomly between 4.1 and 4.8 s. For each MI class, these procedures were repeated 20 times within a single testing run.
GigaScience also collected subjective answers to physiological and psychological questionnaires (categorical data), intending to investigate the evidence on performance variations to work out strategies of subject-to-subject transfer in response to intersubject variability. To this end, all subjects were invited to fill out a questionnaire during three different phases of the MI paradigm timeline: before beginning the experiment (each subject answered N Q = 15 questions); after every run within the experiment (N Q = 10 questions were answered); and at the experiment's termination (N Q = 4 answered questions, {Q i : i = 1, 2, 3, 4}).
As preprocessing, we filtered each raw channel x c n ∈R T within  Hz using a five-order Butterworth band-pass filter. Further, we carry out a bi-domain short-time feature extraction (i.e., CWT and CSP-see Section 2.1), as performed in [42]. In the former extraction, the wavelet coefficients are assumed to provide a compact representation pinpointing the EEG data energy distribution, yielding a time-frequency map in which the amplitudes of individual frequencies (rather than frequency bands) are represented. In the latter extraction, the goal of CSP is to employ a linear relationship to transfer a multichannel EEG dataset into a subspace with a lower dimension (i.e., latent source space), aiming to enhance the class separability by maximizing the labeled covariance in the latent space. In both extraction cases, we fix the sliding short-time window length parameter τ∈R + according to the accuracy achieved by the baseline Filter Bank CSP algorithm that is performed using the whole range of considered frequency bands. The sliding window is adjusted to τ = 2 s with a step size of 1 s as an appropriate choice to extract N τ = 5 EEG segments, as performed in [43]. Since electrical brain activities provoked by MI tasks are commonly related to µ and β rhythms [44], the spectral range is split into the following bandwidths of interest: ∆ f ∈{µ∈ [8][9][10][11][12], β∈ [12][13][14][15][16][17][18][19][20][21][22][23][24][25][26][27][28][29][30]} Hz. The CWT feature set is computed by the Complex Morlet function frequently applied in the spectral EEG analysis, fixing a scaling value to 32. Additionally, we set the number of CSP components as 3Λ (Λ∈N holds the number of MI tasks), utilizing a regularized sample covariance estimation.

MLP Classifier Performance Fed by 2D Features
At this stage, we carry out the extraction of 2D feature maps from the input topogram set using the convolutional network. Further, the 2D features extracted to feed the MLPbased classifier with the parameter tuning shown in Table 1, and the resulting layer-bylayer model architecture is illustrated in Figure 2. For implementation purposes, we apply the Adam algorithm using the optimizing procedure with fixed parameters: a learning rate of 1 × 10 −3 , 200 training epochs, and a batch size of 256 samples. Additionally, the mean squared error (MSE) is chosen as the loss function L(:) in Equation (2b), that is, L(ṽ n , λ n |Θ) = E (ṽ n − λ n ) 2 . For speeding the learning procedure, the Deep and Wide neural network framework is written in Python code (TensorFlow toolbox and Keras API) trained to employ multiple GPU devices at the Google Colaboratory. The codes are made available at a public GitHub repository (codes available at https://github.com/ dfcollazosh/DWCNN_TL (accessed on 9 July 2021)).
As the performance measure, the classifier accuracy A c ∈[0, 1] is computed by the expression: A c = (T P + T N )/(T P + T N + F P + F N ), where T P , T N , F P , and F N are truepositives, true-negatives, false-positives, and false-negatives, respectively. In this case, we split the subject's dataset and built the training set using 90% of trials and the remaining 10% for the test set. Further, the individual training trial set is randomly partitioned by a stratified 10-fold cross-validation to generate a validation trial partition. Table 1. Detailed Deep and Wide architecture of transfer learning. Layer FC8 accomplishes the regularization procedure using the Elastic-Net configuration, while layers FC8 and OU10 apply a kernel constraint adjusted to max_norm(1.). Notation O = RN ∆ N τ , N ∆ denotes the number of filter banks, P -the number of hidden units (neurons), C-the number of classes, and I L stands for the amount of kernel filters at layer L. Notation || · || stands for the concatenation operator.

Layer
Assignment Output Dimension Activation Mode rsion July 3, 2021 submitted to Sensors 7 of 16 we evaluate the performance to be considered as inadequate in brain-computer interface systems 254 as detailed in [46]. Namely, we cluster the individual set into the following three groups with  3  41  44  21  43  49  28  50  36  35  12  26  5  8  23  30  27  15  9  6  46  17  10  4  19  25  1  48  51  31  39  52  47  13  37  38  45  42  22  24  16  18  20  33  32  40  2  11  7 Subject ID For the tested subject set, Figure 3 displays the results of accuracy that the MLPbased classifier produces if fed by just the 2D feature set extracted before. From the obtained accuracy values, we evaluate the performance to be considered as inadequate in brain-computer interface systems as detailed in [45]. Namely, we cluster the individual set into the following three groups with distinctive BCI skills: (i) Group of individuals performing the highest accuracy but with very low variability of neural responses (colored in green). (ii) A group that reaches superior classifier performance but with some response fluctuations (yellow color). (iii) A group that produces modest performance along with a high unevenness of responses (red color).

Performed Stepwise Multi-Space Kernel Matching
Algorithm 1 presents the procedures to complete the validation of the suggested transfer learning with multi-space kernel-embedding. We implement the Gaussian kernel to represent the available data because of its universal approximating ability and mathematical tractability. The length scale hyperparameter σ∈R + , ruling the variance of the described data, is adjusted to their median estimate. The following steps (3: and 4:) accomplish the pairwise kernel matching, firstly between the sets of EEG measurement U and label probability V . To this end, the CKA matching estimator is fed by the concatenated EEG features together with the predicted label probabilities to perform alignment across the whole subject set, empirically fixing the parameter G to 50 according to the subjects' number in this experiment. In the second matching, we encode all the available categorical information about the psychological and physiological evaluation with the relevant feature set, resulting from CKA, by their projection onto a common matrix space representation, using the kernel/tensor product. Note that the projected dataÛ by CKA are also embedded. We also perform dimensionality reduction of the feature sets generated after stepwise-matching using Kernel Principal Component Analysis (KPCA) for evaluating the representational ability.
Further, we estimate the subject similarity matrix from the extracted feature sets, aiming to assess the domain distance between the source-target pairs, which are to be selected from different clusters of BCI inefficiency. Since the clustering of individuals relies on the ordered accuracy vector, we introduce the following neighboring similarity matrix ∆ξ with pairwise metric elements computed from the matricesξ = {κ,κ KPCA }, as follows: ∆ξ (m, m ) = cov(seq(∆ξ (m, ∀m )), seq(∆ξ (m , ∀m))),∆ξ (m, m ) ∈∆ ∆ ∆ξ ∈ R M×M (7) where notations cov(·, ·) and seq(∆(m, ∀m ))∈R M stand for, respectively, the covariance operator and the sequence composed of all ∀m ∈M elements of row m ranked in decreasing order of the achieved MLP-based accuracy. The rationale for applying the covariance over the ranked row vectors of ∆ ∆ ∆ξ is to preserve the similarity information between neighboring subjects.
Algorithm 1 Validation procedure of the proposed approach for transfer learning with stepwise, multi-space kernel matching. † Dimensionality reduction is an optional procedure performed for comparison purposes.
Input data: EEG measurement U, predicted label probabilities V , questionary data S, ∀m∈M  Figure 4 displays the similarity matrix performed by the tensor product ∆ ∆ ∆ξ (left column), evidencing some of the relations between the clustered subjects, but depending on the evaluated questionary data. Thus, the collection Q 1 yields two groups, while Q 4 exhibits three partitions. Instead, Q 2 and Q 3 do not cluster the individuals precisely. After KPCA dimensionality reduction, however, the proximity assessments∆ KPCA tend to make the neighboring association more solid, resulting in clusters of subjects with more distinct feature representation, as shown in the middle column for each questionary.
Under the assumption that the closer the association between the paired source-target couples, the more effectively their cross-subject transfer learning is implemented, we estimate the marginal distanceδξ (m)∈R + from either version ∆ ∆ ∆ξ,∆ KPCA by averaging the neighboring similarity of each subject over the whole set, as follows: where the notation E{z : ∀ζ} stands for the expectation operator computed across the whole set {ζ}. The right column displays the values of marginal valuesδξ (m), showing that each individual is differently influenced by the stepwise multi-space kernel matching of electroencephalography to psychological questionnaires Q i . These results are in agreement with the subject cluster properties evaluated above. Thus, Q 1 and Q 4 , having more discernible partitions, yield the feature representations that are more even in the subject set, while Q 2 and Q 3 provide irregular representations. One more aspect is the effect of dimensionality reduction that improves the representation of Q 1 and Q 2 cases. On the contrary, the use of KPCA tends to worsen the global similarity level of individuals.

Estimation of Pre-Trained Weights for Cross-Subject Transfer Learning
The following step is to pair the representation learned on a source to be transferred to a given target subject. Starting from the subject partitions according to their BCI skills performed above in Section 3.2, we select the candidate sources (i.e., the source space Y (s) (, )) within the best-performing subjects (Group I), while the target space Y (t) (, ) becomes the worst-performing participants (Group III). Here, we validate two choosing strategies of subjects from the source space (Group I): (a) Single source-single-target, when we select the subject of Group I, achieving the highest value of the domain distance measurement in Equation (9) computed as follows: Once the source-target pairs are selected, the pre-trained weights are computed from each designed source subject to initialize the Deep and Wide neural network, rather than introducing a zero-valued starting iterate, and thus enabling a better convergence of the training algorithm. Note that the fulfilling condition in Equation (9) depends on Q i , meaning distinct selected sources for each questionnaire data. (b) Multiple sources-single-target when the selected subjects of Group I achieve the four highest domain distance values. In this case, the Deep and Wide initialization procedure applies the pre-trained weights estimated from the concatenation of the source topograms. Figure 5 details the reached classification performance using the proposed transfer learning approach for either strategy of selecting the candidate sources through a radar diagram that includes all target subjects (axes). For comparison's sake, the graphical representation depicts, with a line colored in black, the MLP-based accuracy (see Figure 3) as a reference for assessing the performance classifier gain due to the applied transfer learning approach, the accuracy achieved by the features extracted by the tensor product (blue line) and KPCA (magenta line),κ,κ KPCA , respectively.
The odd columns (first and third) present the Single source-Single-target diagrams, while the even ones are for the Multiple sources-Single-target strategy. In all cases of questionnaire data Q i , the transfer learning with stepwise, multi-space kernel matching allows increasing, on average, the baseline classifier performance of the subjects belonging to Group III with modest accuracy and high unevenness of responses. Nevertheless, there are still some issues to be clarified. The accuracy gain performed by the Single source-Single-target strategy is lower than the one achieved by the latter approach, but the number of subjects that benefit from the transfer learning approach is higher. On the contrary, the presence of multiple sources halves the number of poor-performing subjects that are improved, though they produce accuracy gain values up to 25% (see subject #45). The next aspect of addressing is the contribution of categorical data in terms of classifier performance. The first two radars in the bottom row (labeled as EEG) present the accuracy improvement performed by the features extracted from the EEG measurements after CKA alignment (CKA(κ U , κ V )), underperforming the transfer learning adding questionnaires.
Topographic maps of representative subjects (computed with and without transfer learning) using just the feature map information, presenting the learned weights with assumed meaningful activity.
Regarding the dimensional reduction additionally considered, its delivered accuracy (outlined in magenta) strongly depends on the specific case of fused data Q i . Thus, while Q 1 and Q 2 benefit from the KPCA procedure, Q 4 reduces the performance achieved. This result becomes evident in two bottom radars (3-th and 4-th) that depict the effect of transfer learning averaged across the data Q i , showing that the classifier performance of almost each target individual can be enhanced by the proposed transfer learning approach for either strategy of selecting the candidate sources. However, there are a couple of subjects (# 38 and # 20) that did not have a positive impact.
Individual gain Lastly, the topographic maps shown in Figure 6 give a visual interpretation of the proposed transfer learning, which are reconstructed from the learned network weights according to the algorithm introduced in [37]. We compare the estimated approaches under the assumption that the discriminating power is directly proportional to the reconstructed weight value. Thus, the top row shows the topograms of the single-source strategy built from both bandwidths (β and µ) within different intervals of the neural response. As seen, the source selected (subject #3) performs a weight set with the spatial distribution related to the sensorimotor area, focusing their neural responses within the MI segment correctly. Next to S #3, we present the target's topograms that benefit the most from the transfer learning, holding weights with a spatial distribution that is a bit blurred. The effect of the single-source transfer learning approach is the reduction of the weight variability, as shown in the adjacent topograms. However, the source effectiveness to reduce the variability is limited in the case of the low-skilled target #38 that presents many contributing weights spread all over the scalp area. Moreover, the weights appear inside the two intervals (before cue-onset and the ending segment) at which the responses elicited by MI tasks are believed to vanish. As a result, the single-source strategy yields a negative accuracy gain of Target #38 (it drops from 70% to 65%). Similar behavior is also observed in the second row, displaying the topograms of the multi-source strategy performed by the most benefitting (T#11) and the worst-achieving target (T#22), respectively. However, the inclusion of multiple sources leads to weights with a sparse distribution, as observed in the topograms of the selected subjects (S#3, 14,41,28). This effect may explain the small number of targets improved by the multi-source strategy. In order to clarify this point, the bottom row displays the corresponding spatial distribution performed by the multi-source strategy when including the whole subject set of Group I, resulting in weights that are very weak and scattered. Moreover, compared with the first two rows, the all-subjects source approach of the bottom row makes the related transfer learning deliver the worst performance averaged across the target subject set.

Discussion and Concluding Remarks
Here, we introduce a cross-subject transfer learning approach for improving the classification accuracy of elicited neural responses, pooling data from labeled EEG measurements and psychological questionnaires through a stepwise multi-space kernel-embedding. For validation purposes, the transfer learning is implemented in a Deep and Wide framework, for which the source-target sets are paired according to the BCI inefficiency, showing that the classifier performance of almost each target individual can be enhanced using single or multiple sources.
From the evaluation results, the following aspects are to be highlighted: Evaluated NN framework: The Deep and Wide learning framework is supplied by the 2D feature maps extracted to support the MLP-based classifier. As a result, Table 2 compares the bi-class accuracy of the GigaScience database achieved by several recently published approaches, which are outperformed by the learning algorithm with the proposed transfer learning method. Of note, the MSNN algorithm presented in [46] achieves a competitive classification accuracy on average, 82.6 (ours) vs. 81.0 (MSNN), but with a higher standard deviation in comparison with our proposal, 12.0 vs. 8.4. Besides, our method can include categorical data from questionnaires within the MI paradigm, which favors the interpretability concerning the studied subject from spatial, time, and frequency patterns from EEG data coupled with categorical physiological and psychological data.

Approach
A c Interpretability CSP + FLDA [47] 67.60 -LSTM + Optical [48] 68.2 ± 9.0 -SFBCSP [49] 72.60 -DCJNN [50] 76.50 MINE + EEGnet [51] 76.6 ± 12.48 MSNN [46] 81.0 ± 12.00 Proposal 79.5 ± 10.80 Proposal + TL * 82.6 ± 8.40 Feature representation challenges and computational requirements: The bi-domain extraction is presented (CWT and CSP) to deal with the substantial intra-subject variability in patterns across trials. However, for improving their combination with categorical data, more compact feature representations can be explored, for instance, using connectivity metrics like in [52]. Besides, neural network architectures capturing the temporal dynamics local structures of the EEG time-series associated with the elicited MI responses could be helpful to upgrade our approach [53]. Moreover, it is well-known that deep learning approaches require considerable computational time when training the model. For clarity, a computational time experiment is carried out. Specifically, for the parameter setting of the FC8 layer, with regularization values l1 and l2 tuned by a grid search around [0.0005, 0.001, 0.005], and a number of neurons fixed through a grid search within P = [100, 200, 300], the fitting time with and without our transfer learning approach is summarized in Table 3. As seen, the multi-source scheme requires more computation time per fold. Still, real-time BCI requirements can be satisfied once the model is trained, and a new instance must be predicted. In short, for a new subject, the following stages must be carried out: (i) Store the EEG and questionnaire information of the new and training subjects. (ii) Apply our transfer learning approach as exposed in Figure 1 to couple EEG and questionary psychological data for the new subject. (iii) Once the model is trained, new instances of the studied subject can be predicted as straightforward deep learning methods (in this stage, only the EEG data is required). Multi-space kernel matching: To overcome the difficulties in utilizing data-fusion combining categorical with the real-valued features, we implement the stepwise kernel matching via Gaussian embedding. As a consequence, the obtained similarity matrices evidence the relationship with the BCI inefficiency clusters of subjects. Even though the association is highly influenced by each evaluated questionnaire data, this result becomes essential in light of previous reports stating that no statistically significant differences can be detected between questionary scores and EEG-based performance [54]. One more aspect is the effect of dimensionality reduction through kernel PCA that improves the representation, but to a certain extent (only in Q 1 and Q 2 cases). For tackling the differences in subjective criteria for predicting MI performance, however, two main issues need to be addressed: The use of more appropriate kernel-embedding for categorical scores [55] and dimensionality reduction approaches, providing representation of data points with a wide range of structures like t-Distributed Stochastic Neighbor Embedding [56].
Cross-subject transfer learning: We conduct the transfer learning to infer a target prediction function from the kernel spaces embedded before, selecting the paired source-target sets according to the Inefficiency-based clustering by subjects. Overall, the transfer learning with feature representations, combined with questionary data, allows for an increase of the baseline classifier accuracy of the worst-performing subjects. Nevertheless, source selection through a different method impacts the classifier performance; while the Multiple-source-Single-target strategy tends to produce accuracy improvements that are bigger than the Single-source-Single-target, and the number of the benefited targets declines. This result may point to future exploration of more effective transfer learning of BCI inefficiency devoted to bringing together, as much as possible, the source domain to each target space. This task also implies improving the similarity metric in Equation (7) proposed for comparing ordered-by-accuracy vectors of different BCI inefficiency clusters.
As future work, the authors plan to validate the cross-subject transfer learning approach in applications with the joint incorporation of two or more databases (crossdatabase), growing the tested number of individuals significantly. For instance, we plan to consider the dataset collected by the Department of Brain and Cognitive Engineering, Korea University in [57], since this set holds questionnaire data information about the physiological and psychological condition of subjects. As a result, we will obtain classification performances based on transfer learning at intra-subject and inter-dataset levels. Funding: This research manuscript is developed supported by "Convocatoria Doctorados Nacionales COLCIENCIAS 727 de 2015" and "Convocatoria Doctorados Nacionales COLCIENCIAS 785 de 2017" (Minciencias). Additionally, A.M. Álvarez-Meza thanks to the project: Prototipo de interfaz cerebrocomputador multimodal para la detección de patrones relevantes relacionados con trastornos de impulsividad-Hermes 50835, funded by Universidad Nacional de Colombia.

Informed Consent Statement:
No aplicable since this study uses duly anonymized public databases.

Data Availability Statement:
The databases used in this study are public and can be found at the following links: GigaScience: http://gigadb.org/dataset/100295, accessed on 10 March 2021.

Conflicts of Interest:
The authors declare no conflict of interest.