Subject-Independent EEG Classification of Motor Imagery Based on Dual-Branch Feature Fusion

A brain computer interface (BCI) system helps people with motor dysfunction interact with the external environment. With the advancement of technology, BCI systems have been applied in practice, but their practicability and usability are still greatly challenged. A large amount of calibration time is often required before BCI systems are used, which can consume the patient’s energy and easily lead to anxiety. This paper proposes a novel motion-assisted method based on a novel dual-branch multiscale auto encoder network (MSAENet) to decode human brain motion imagery intentions, while introducing a central loss function to compensate for the shortcomings of traditional classifiers that only consider inter-class differences and ignore intra-class coupling. The effectiveness of the method is validated on three datasets, namely BCIIV2a, SMR-BCI and OpenBMI, to achieve zero calibration of the MI-BCI system. The results show that our proposed network displays good results on all three datasets. In the case of subject-independence, the MSAENet outperformed the other four comparison methods on the BCIIV2a and SMR-BCI datasets, while achieving F1_score values as high as 69.34% on the OpenBMI dataset. Our method maintains better classification accuracy with a small number of parameters and short prediction times, and the method achieves zero calibration of the MI-BCI system.


Introduction
A BCI system is a technology that establishes a connection between the human or animal brain and an external environment for information exchange and access control without the use of peripheral nerves and muscles, providing convenience for patients with limited motor function [1][2][3]. Due to the advantages of scalp electroencephalography (EEG) with high temporal resolution, simple operation, low cost and safety, BCI systems for EEG have received increasing attention from researchers [4,5]. Some common EEG paradigms include the event-related potential (ERP), P300, the steady-state visual evoked potential (SSVEP) and motor imagery (MI). MI means that the user imagines a specific part of the body in motion without external stimulation and self-induces to make the brain generate specific control signals [2,6]. MI-BCI converts EEG signals of subjects into control signals of external devices, which directly affects the clinical application of BCI and has great research value and potential. MI-BCI not only has a promising future in assisting disabled and medical rehabilitation, but also plays a role in entertainment, education and other fields [4].
The utility of the BCI system is inversely proportional to patient specificity, and a significant amount of calibration time is spent to capture the specificity of the current subject. The calibration time is generally about 20-30 min, which is time consuming in the actual application of BCI systems, depletes patient patience, increases patient anxiety and hinders the application of BCI in practice. Therefore, reducing the calibration time or even 2), feet (class 3) and tongue (class 4). At the beginning of each trial (t = 0), a fixed cross appeared on the computer screen and was accompanied by a brief siren sound. At t = 2 s, a left, right, down or up arrow (corresponding to the left hand, right hand, feet and tongue) appeared and stayed for 1.25 s, and the subject was asked to imagine the movement of the corresponding limb in the direction of the arrow until the arrow disappeared from the screen at t = 6 s, when the subject finished the imagery task. The entire experiment was performed using 22Ag/AgCl electrodes to acquire EEG signals at a sampling frequency of 250 Hz, and each subject performed a total of 288 trials. In this paper, the left-and right-handed EEG signals were used to achieve dichotomous task evaluation. Twenty EEG channels in motor cortical regions (FC3, FC1, C1, FCz, FC2, FC4, C5, C3, C1, Cz, C2, C4, C6, CP3, CP1, CPz, CP4, P1, Pz, and P2) were selected from the raw EEG signals, and the 3 s-6 s time period was intercepted as EEG-MI data. Subjects continued to perform the motor imagery task for a selected period of 3 s.
The SMR-BCI dataset is derived from 14 healthy subjects, and the experiment includes right hand and feet motor imagery binary classification tasks. A sampling frequency of 512 Hz was used to acquire 15 channels of EEG signals. In this paper, the sampling frequency of 512 Hz was downsampled to 250 Hz using the first 4 s of continuous motion imagery data.
The OpenBMI dataset comes from 54 healthy subjects, and the experiment records dichotomous tasks for left and right hand motor imagery. From 0 to 3 s of each trial, a black fixed cross appeared in the center of the monitor to prepare the subjects for the MI task. When an arrow to the left or right appeared, the subjects performed the corresponding imagery task with a continuous grasp for 4 s. The experiment consisted of 4 phases, each with 100 trials, including 50 trials each for the imaginary left-handed and right-handed tasks. A sampling frequency of 1000 Hz was used to acquire 62 channels of EEG signals. In this paper, we selected 20 electrode channels from the motor cortex region. EEG signals were processed, the 1000 Hz sampling frequency was down-sampled to 250 Hz, and the 4 s EEG signals for motor imagery grasping were intercepted.

Spatial-Spectral Feature Pre-Extraction
All three datasets used in this study were EEG signals recorded according to a time series, and the time-domain signals that varied over time during cross-session and crosssubject motor imagery classification made the features unstable [27]. Therefore, we adopted a feature pre-extraction method to solve this problem. Motor imagery showed obvious performance in the µ and β rhythms, and the selected characteristic frequency bands were mainly distributed between 8-30 Hz [26]. We define X N = {x i } (i = 1, 2, ..., n) ∈ R T×C as the experimentally obtained raw EEG signal; Y N = {y i } (i = 1, 2, ..., n) is the corresponding category, where N denotes the number of EEG trials, T denotes the number of sampling points, and C denotes the number of EEG channels. The application of bandpass filtering is as follows: E N is the data obtained after bandpass filtering from 8-30 Hz, where X N is the original EEG signal, ⊗ is defined as a fifth-order Butterworth bandpass filter operation, and B N is a frequency band at a specific frequency. The spatial-spectral features are pre-extracted from the perspective of spectral and spatial optimization. After bandpass filtering based on the 8-30 Hz band, the CSP algorithm is used for processing [28][29][30][31], where W N ∈ R U×U is the feature matrix containing spatial feature information obtained from the E N resolution, and U is the number of spatial filters obtained by the CSP algorithm. E N contains the spectral information extracted from a specific frequency band, and the spatial-spectral features are obtained using the covariance matrix. The covariance matrix is as follows:

Classification
Our proposed MSAENet structure is shown in Figure 1, which consists of three modules: the autoencoder branch (AE branch), the multiscale branch and the loss function fusion (LFF). We conduct subject-independent and subject-dependent experiments on three publicly available datasets to demonstrate the strong generalization of the MSAENet and consider the performance of all methods using two evaluation metrics: accuracy (ACC) and F1_score. The subject-leave-one-cross validation method is used for the subjectindependent case, and, for a particular dataset containing S subjects, the session data of a subject are used as the test set. The data of the remaining S-1 subjects are used as the training set. The method uses the Keras framework (TensorFlowv2.2.0 as the backend), and the training process is implemented using NVIDIA GeForce RTX 2080 Ti with 22 GB of memory. During the training iterations, the Adam optimizer is used to optimize the loss function, and the early stop method is used for network training. The learning rate decay is 0.5 when there is no improvement in validation loss for 5 consecutive epochs, and the network stops the iteration process when there is no reduction in the validation loss for 20 consecutive epochs.
ing spatial feature information obtained from the resolution, and is the number of spatial filters obtained by the CSP algorithm.
contains the spectral information extracted from a specific frequency band, and the spatial-spectral features are obtained using the covariance matrix. The covariance matrix is as follows:

Classification
Our proposed MSAENet structure is shown in Figure 1, which consists of three modules: the autoencoder branch (AE branch), the multiscale branch and the loss function fusion (LFF). We conduct subject-independent and subject-dependent experiments on three publicly available datasets to demonstrate the strong generalization of the MSAENet and consider the performance of all methods using two evaluation metrics: accuracy (ACC) and F1_score. The subject-leave-one-cross validation method is used for the subject-independent case, and, for a particular dataset containing S subjects, the session data of a subject are used as the test set. The data of the remaining S-1 subjects are used as the training set. The method uses the Keras framework (TensorFlowv2.2.0 as the backend), and the training process is implemented using NVIDIA GeForce RTX 2080 Ti with 22 GB of memory. During the training iterations, the Adam optimizer is used to optimize the loss function, and the early stop method is used for network training. The learning rate decay is 0.5 when there is no improvement in validation loss for 5 consecutive epochs, and the network stops the iteration process when there is no reduction in the validation loss for 20 consecutive epochs.  MSAENet structure. RawData refers to EEG data downsampled to 250 Hz. Input refers to the spatial-spectral features obtained through the common spatial mode and the 8-30 Hz frequency band. Multi scale branch refers to feature extraction through convolution kernels of different sizes. The AE branch is composed of two parts: the encoder and the decoder, and the intermediate vector z is adjusted by training the error between the input data and the reconstructed data together with other loss functions. Loss function fusion refers to the combination of the central loss function, the mean square error loss function and the cross entropy loss function to backpropagate the training network.

AE Branch
The AE branch is mainly composed of two components: encoder E = p(Input) and decoder D = q(z). The encoder encodes high-dimensional data into a one-dimensional Brain Sci. 2023, 13, 1109 5 of 14 latent vector z that can maintain the robustness characteristics of the EEG-MI signal by convolutional pooling, and the whole encoder is considered as a data dimensionality reduction process; the detailed structure is shown in Table 1. Average pooling achieves weight sharing, effectively reducing the number of parameters. The BN layer speeds up network training, convergence and prevents network overfitting. The decoder and the encoder have a symmetric structure. First, the latent vector z is amplified to the appropriate size by a dense layer operation, and the one-dimensional vector is decoded into reconstruction which has the same data size as input from two transposed convolutional layers. Transposed convolution can extract effective features from the data, reduce useless features and facilitate the reconstruction of latent vectors.

Multiscale Branch
The multiscale branch uses different sub-branches to extract spatial-spectral features at different granularities and uses three different convolution kernels of 3 × 3, 5 × 5 and 7 × 7 to extract local features from the spatial-spectral data. Different granularity of convolution kernels is used to extract local features and to ensure that richer and more representative features are extracted. Then, the convolution is carried out through a 3 × 3 convolution kernel. With the deepening of the network, the extracted local features are also richer so as to achieve a wider EEG receptive field. See Table 2 for details.

Loss Function Fusion
In this study, three loss functions are used to jointly constrain the training of the whole network. The first loss function is the mean squared error (MSE) loss function. The AE branch trains this branch by minimizing the reconstruction error between the encoder input data input and the decoder output data reconstruction. The intermediate latent vector z can maintain the features in the input data. Given an input signal , its loss function is expressed as follows: where x ij is the value of the matrix in the reconstructed data.
In the MSAENet classification module, Softmax is used as a classifier, and the spliced feature vector is activated by Softmax and input to the fully connected layer. The weight coefficient of each class is obtained, which is expressed as follows: where W and b are the weight matrix and bias vector, respectively. Then, the model is trained using the Adam optimizer to minimize the cross-entropy loss, which makes the spacing between two categories larger and can effectively distinguish two different categories to achieve a good classification effect. The calculation method is as follows: L cross−entropy y, y = − ∑ M k=1 y k logy k (5) where y andŷ are the true label and classification probability, respectively. M indicates the number of categories. Then, the central loss function is used to effectively make up for the defects of the Softmax layer in the traditional CNN, not only considering the dispersion between different categories but also the cohesion of the same category, thus making the intra-class distance smaller and maintaining intra-class compactness and central loss. The function is expressed as follows: Among them, c y represents the feature center of the yth category, m i represents the feature before the fully connected layer of the multi scale branch, z i represents the feature of the fully connected layer of the AE branch network, and b represents the number of batches. It is hoped that the smaller the sum of the squares of the distance between the feature and the feature center of each sample in each batch, the better.
The MSAENet was optimized by combining the mean square error loss function in Equation (3), the cross entropy loss function in Equation (5) and the central loss function in Equation (6). The final loss function is expressed as follows: β 1 , β 2 and β 3 are used to denote the weighted hyperparameters of each loss function, and the network is trained by back-propagating the error through the integration of the three loss functions. We found that the accuracy is higher when the loss weight of the cross loss function is at its maximum in reference [24]. In this paper, we choose (0.1, 1.0, 0.1) (0.5, 1.0, 0.5) (0.5, 0.5, 0.5) (1.0, 1.0, 1.0) (1.0, 0.5, 1.0) groups of hyperparameters of loss function weights for testing, and we find that the accuracy is higher when β 1 = 0.5, β 2 = 1.0 and β 3 = 0.5. We observe that a larger classification loss weight for Softmax results in slightly higher classification accuracy, a result that is more consistent with those in the relevant study [24].
The EEGNet is a lightweight CNN framework for classification in different BCI paradigms. First, a 1 × 200 convolution kernel is used to extract the temporal information from the original data EEG. Then, the spatial features are extracted from the n channels using an n × 1 convolution kernel. Finally, the separable convolution summarizes each feature map separately and optimally combines the outputs, which can effectively reduce the number of fitting parameters [32].

DeepConvNet
The DeepConvNet proved to be effective in handling EEG-MI classification. The network consists of 4 blocks: the first block contains two convolutional layers and Batch-Normalization, using the maximum pooling layer to reduce the number of parameters; the final dropout layer prevents overfitting. The last three blocks have the same structure, with the only difference from the first block being that there is only one convolutional layer. Band-pass filtering (fifth-order Butterworth band-pass filter) is performed on the original EEG signal between 8 and 30 Hz [33].

MIN2Net
The MIN2Net adopts an autoencoder network combined with a momentum learning mechanism and demonstrates advanced performance in subject independence. The framework consists of an encoder and a decoder. Different linear convolution kernels are used in the encoder to extract spatial-temporal features, and the autoencoder is backpropagated through the momentum learning mechanism and the error of the reconstruction function [24].

SSCNN
Compared with subject dependence, the SSCNN shows stronger advantages in subject independence. The network obtains spatial-spectral features by extracting the spatial features of 30 frequency bands, selects the spatial-spectral features of the first 20 frequency bands through mutual information, sends them to a three-layer convolutional network, and, finally, splices the classification through the fully connected layer [5].

Ablation Method
To prove the validity of the dual branch, we only keep the first branch of the dual branch, which extracts spatial-spectral features at different granularities through different receptive fields, and we define the network as the MultiScaleNet; the second branch is the AE, which learns the features of the input data as much as possible through the potential vector z in between. We define it as the AENet.
To test the effectiveness of multiple scales in the network structure, 3 × 3 convolutional kernels are used instead of different scale convolutions in all the proposed networks, which are defined as the SingleScaleNet.
The central loss function can increase the compactness within the class. This study introduces the central loss function to make up for the deficiencies in the Softmax loss function. It removes the central loss function to verify its validity. The network structure is called the WithoutLossNet.
To demonstrate the effectiveness of feature pre-extraction, the raw data are directly used as the input data of the MSAENet, which is called the RawDataNet.

Visisual
Deep learning is a black-box model, and information about the intermediate processes is not directly observable to us. To visualize the features of deep learning, we use the t-SNE method [34] to embed the fully connected layer of the deep learning model into a two-dimensional space for visualization and compare the differences in the features learned by various methods.

Results
The experimental results of the subject-independent comparison experiments on the three publicly available datasets are shown in Table 3. On both the BCIIV2a and SMR_BCI datasets, the MSAENet is significantly higher than the other four advanced comparison methods in two evaluation metrics, i.e., accuracy and F1_score, with an accuracy rate of 72.50% and an F1_score of 72.23% on SMR-BCI. However, in the OpenBMI dataset, MIN2Net has the best performance with the highest values on both the ACC and F1_score metrics, while the EEGNet, the DeepConvNet, the SSCNN and the MSAENet have comparable results. The F1-Score, as a comprehensive indicator, balances the impact of precision and recall. Using just one of the two indicators of the value of a large indicator is not useful; it must be the value of the two indicators at the same time. The corresponding value of the F1 Score indicator will be very large, which can be a more comprehensive evaluation of a classifier. The EEGNet network has the worst results on the SMR-BCI dataset with an F1_score value of only 34.43%. The DeepConvNet has a low ACC on the BCIIV2a dataset; the F1_score value, in particular, only reaches 30.62%. This situation is similar to that of the EEGNet, with neither of them being generalizable. The MIN2Net network in the BCIIV2a dataset has an F1_score value of 49.09%. The results of SSCNN and MSAENet on the three datasets show robustness, but the results of the MSAENet are higher than those of the SSCNN. Table 3. Results of subject-independent comparison experiments (β 1 = 0.5, β 2 = 1.0, β 3 = 0.5). The largest value is marked with bold font. MSAENet embodies great advantages over the other four methods compared, and the ablation experiment is used to determine the degree of influence of various factors on the network. The results of the subject independence in the ablation experiments are shown in Table 4. It can be seen that the MSAENet outperforms the other ablation network results in both accuracy and F1-score in the subject-independent case. In particular, compared with the RawDataNet, our method shows great advantages on two datasets, i.e., BCIIV2a and SMR-BCI, with significant results in F1_score and ACC. In the OpenBMI dataset, the performance of the AENet network is poor, but the overall trend of several ablation experiments has not changed significantly. On the BCIIVa dataset, the SingleScaleNet differs the least from the MSAENet in terms of ACC, and the WithoutLossNet is closest to the MSAENet in terms of the F1_score. On the SMR-BCI dataset, the results of MultiScaleNet are close to those of MSAENet; this was the smallest difference.

Comparative Test Analysis
From the results of the subject-independent comparison experiments in Table 3, it can be seen that the MSAENet shows the best performance on both the BCIIV2a and the SMR-BCI datasets. On the OpenBMI dataset, the MIN2Net shows significantly higher results than the other four methods, but there is no significant difference among the results of the other four methods. The EEGNet is a lightweight and compact framework for learning potential features by deep convolution and separable convolution, and the EEGNet contains temporal convolution kernels in the first block, which are half the length of the sampling rate, for extracting time-domain information. However, the input data distribution of EEG-based BCI systems may change during intra-session transitions [11], and the timedomain features are variable, which cannot be easily captured as stable features when considering feature extraction between different sessions; thus, the EEGNet may not show good generalizability when applied to multiple datasets because of the non-stationary characteristics of the EEG signals. Compared with the EEGNet, the MSAENet can learn features more comprehensively in the spectral and spatial domains, capture features at different scales and extract differences between sessions, which is more suitable for subject independence and reduces the calibration time of subjects. The DeepConvNet network shows a trend consistent with EEGNet, with a low F1_score and poor generalization on the BCIIV2a dataset. It has been mentioned in the literature that F1_score is more valuable than ACC [35].The DeepConvNet learns potential features of subjects from deep level by four blocks; due to the paradigm of different datasets, it may involve different time-domain, frequency domain and spatial domain information, but the DeepConvNet has not considered multi-scale and multi-domain fusion up to this point. Therefore, the DeepConvNet has some advantages for a single dataset but cannot show stable reliability on every dataset when multiple datasets are considered. Compared to the DeepConvNet, the MSAENet performs multi-domain fusion from spectral and spatial domain aspects to capture the unique features of different domains. The MIN2Net shows an advantage on the OpenBMI dataset but does not show a great advantage on the other two datasets, especially the dataset BCIIV2a with an F1_score of only 49.09%; this differs from the MSAENet by 19.81%. Although the MSAENet does not show the best performance on the OpenBMI dataset as it only differs from the optimal result by 3.28%, which means that it has strong generalizability. Our analysis of the MIN2Net's strong generalizability is that it is due to the integration of deep metric learning into the autoencoder to learn compact and discriminative potential features from the EEG and to classify them simultaneously. While the MSAENet can maintain better generalization with the MIN2Net, in addition to extracting features from multiple scales and domains, we also improve Softmax's problem that it only considers inter-class distance and ignores intra-class compactness to learn richer latent features from EEG. The SSCNN also shows better stability on the SSCNN based on spectral and spatial domain feature fusion, but it fails to refine the feature granularity from multiple scales. The MSAENet integrates multi-domain fusion, refines feature extraction granularity and learns intra-class compact features and richer feature representations from different sessions, thus achieving satisfactory results in subject independence with strong generalizability.

Analysis of Ablation Experiments
Through comparative experiments, we found that the MSAENet showed great superiority compared with the other four methods. The advantages of the MSAENet were further analyzed using ablation experiments. From the results of the subject-independent ablation experiments in Table 4, we found that the RawDataNet network without spatial-spectral feature pre-extraction had the greatest difference with the MSAENet, and its results were much lower than those of the MSAENet, especially on the two datasets, i.e., BCIIV2a and SMR-BCI; the maximum difference in accuracy is about 20%. The results show that the spatial-spectral feature pre-extraction can abstract the spatial-spectral signals of multiple subjects into one dimension and reduce the network's training time and its difficulty. It can not only solve the differences between subjects and different datasets but also retain the original data characteristics as much as possible and eliminate invalid information. Spatial-spectral features combined with CSP and frequency bands can be used as a feature extraction method for motor imagery learning and classification [14]. Both the Multi-ScaleNet and the AENet can prove that multi-branch network models can learn features from different angles; that different convolution methods have different feature extraction capabilities; and that the multi-branch can realize feature extraction diversification [17]. The WithoutLossNet proves the effectiveness of the central loss function on the BCIIV2a and SMR-BCI datasets. The central loss function can well compensate for the shortcomings of Softmax. The combination of the central loss function and the Softmax classifier can consider the relationship between the class and different classes at the same time, so that the same type of data converges to the feature center, and the distance between the feature centers between different classes expands [36,37]. The SingleScaleNet replaces different scale convolution kernels with 3 × 3 convolution kernels, which verifies that the features extracted from different scales of sensory fields are diverse. The feature representations learned by multi-scale are more comprehensive and diverse. Overall, our network integrates the advantages of different components, and the learned features are more diverse and comprehensive.

Parameter Performance Analysis
The results for the number of network parameters and prediction time for the comparison experiments are shown in Table 5. Although the number of parameters in the EEGNet and the DeepConvNet is much lower than that of the MIN2Net, the SSCNN and the MSAENet, these two methods do not have applicability. The number of training parameters of the SSCNN is as high as 77,577,714, and the average prediction time is as long as 1.0934 s on OpenBMI. While MSAENet keeps the two evaluation indicators of ACC and F1_score higher than SSCNN, its trainable parameters and prediction time are much lower than the SSCNN. The MSAENet greatly reduces the number of parameters and shortens the prediction time while maintaining a high accuracy rate.

Visualization Analysis
The visualization results of the comparison experiment with the subject-independent setting are shown in Figure 2. The one-dimensional vector of the last layer of the MSAENet is embedded into a two-dimensional space for visualization by the t-SNE method. Different shapes indicate that different parts of the motor imagery are performed: triangles indicate the left hand, circles indicate the feet, and crosses indicate the right hand. Different colors are used to distinguish the visualization results in order to represent them more intuitively. The MSAENet's visualization results in better classification than other methods, with a more compact intra-class distribution and larger inter-class spacing. The visualization distribution of the MSAENet outperforms the other networks on the two datasets BCIIV2a and SMR-BCI, which is consistent with the results of the data in Table 3.  The visualization results of the subject-independent ablation experiment ar in Figure 3. It can be seen from the figure that in the RawDataNet network the v tion results of the three datasets are messy and cannot effectively distinguish th nary tasks. Compared with other ablation experiments, the MSAENet not only h The visualization results of the subject-independent ablation experiment are shown in Figure 3. It can be seen from the figure that in the RawDataNet network the visualization results of the three datasets are messy and cannot effectively distinguish the imaginary tasks. Compared with other ablation experiments, the MSAENet not only has high classification accuracy but also more compact intra-class distance and good cohesion,  The visualization results of the subject-independent ablation experiment are shown in Figure 3. It can be seen from the figure that in the RawDataNet network the visualization results of the three datasets are messy and cannot effectively distinguish the imaginary tasks. Compared with other ablation experiments, the MSAENet not only has high classification accuracy but also more compact intra-class distance and good cohesion, The visualization results of the subject-independent ablation experiment are shown in Figure 3. It can be seen from the figure that in the RawDataNet network the visualization results of the three datasets are messy and cannot effectively distinguish the imaginary tasks. Compared with other ablation experiments, the MSAENet not only has high classification accuracy but also more compact intra-class distance and good cohesion, which further confirms the data in Table 4.

Future Directions
Despite the good results of the MSAENet on subject independence, there some shortcomings that need further improvement. First, the attention mechanis ule can automatically extract the features of interest. The integration of the attentio anism module into the MSAENet module to further improve classification perfo should be considered. Second, the MSAENet extracted valid discriminative f which were considered for application in other EEG measurements, such as SS nally, combining the MSAENet with transfer learning to improve the generalizatio BCI system can be considered. The visualization results of the subject-independent ablation experiment are shown in Figure 3. It can be seen from the figure that in the RawDataNet network the visualization results of the three datasets are messy and cannot effectively distinguish the imaginary tasks. Compared with other ablation experiments, the MSAENet not only has high classification accuracy but also more compact intra-class distance and good cohesion, which further confirms the data in Table 4.

Future Directions
Despite the good results of the MSAENet on subject independence, there are still some shortcomings that need further improvement. First, the attention mechanism module can automatically extract the features of interest. The integration of the attention mechanism module into the MSAENet module to further improve classification performance should be considered. Second, the MSAENet extracted valid discriminative features, which were considered for application in other EEG measurements, such as SSVEP. Finally, combining the MSAENet with transfer learning to improve the generalization of the BCI system can be considered.

Conclusions
In this paper, a two-branch network structure based on feature fusion is proposed. It was verified that AE and multi-scale feature fusion can extract richer features from EEG signals. A feature pre-extraction approach for data based on both the spatial and frequency domains is expected to extract common information of subjects from different par- The visualization results of the subject-independent ablation experiment are shown in Figure 3. It can be seen from the figure that in the RawDataNet network the visualization results of the three datasets are messy and cannot effectively distinguish the imaginary tasks. Compared with other ablation experiments, the MSAENet not only has high classification accuracy but also more compact intra-class distance and good cohesion, which further confirms the data in Table 4. represent feet MI.

Future Directions
Despite the good results of the MSAENet on subject independence, there are still some shortcomings that need further improvement. First, the attention mechanism module can automatically extract the features of interest. The integration of the attention mechanism module into the MSAENet module to further improve classification performance should be considered. Second, the MSAENet extracted valid discriminative features, which were considered for application in other EEG measurements, such as SSVEP. Finally, combining the MSAENet with transfer learning to improve the generalization of the BCI system can be considered.

Conclusions
In this paper, a two-branch network structure based on feature fusion is proposed. It was verified that AE and multi-scale feature fusion can extract richer features from EEG signals. A feature pre-extraction approach for data based on both the spatial and frequency domains is expected to extract common information of subjects from different paradigms. The introduction of the centroid function can compensate for the shortcoming of traditional Softmax, which ignores intra-class cohesiveness. Our proposed network shows superior performance on all three datasets, demonstrating that the network is highly generalizable, achieves zero calibration and facilitates the development of applications for the MI-BCI system.