Electroencephalogram-Based Motor Imagery Signals Classification Using a Multi-Branch Convolutional Neural Network Model with Attention Blocks

Brain signals can be captured via electroencephalogram (EEG) and be used in various brain–computer interface (BCI) applications. Classifying motor imagery (MI) using EEG signals is one of the important applications that can help a stroke patient to rehabilitate or perform certain tasks. Dealing with EEG-MI signals is challenging because the signals are weak, may contain artefacts, are dependent on the patient’s mood and posture, and have low signal-to-noise ratio. This paper proposes a multi-branch convolutional neural network model called the Multi-Branch EEGNet with Convolutional Block Attention Module (MBEEGCBAM) using attention mechanism and fusion techniques to classify EEG-MI signals. The attention mechanism is applied both channel-wise and spatial-wise. The proposed model is a lightweight model that has fewer parameters and higher accuracy compared to other state-of-the-art models. The accuracy of the proposed model is 82.85% and 95.45% using the BCI-IV2a motor imagery dataset and the high gamma dataset, respectively. Additionally, when using the fusion approach (FMBEEGCBAM), it achieves 83.68% and 95.74% accuracy, respectively.


Introduction
Brain-computer interface (BCI) systems interact between humans and machines without physical contact. The recent progress in this area has enabled devices to be controlled by brain signals [1]. The most used brain signals are electroencephalography (EEG) signals since they are non-invasive (measured from the scalp), have a high time resolution, and are relatively inexpensive [2][3][4]. Dealing with EEG signals is challenging because the signals are weak, may contain artefacts, are dependent on the patient's mood and posture, and have low signal-to-noise ratio [5].
To measure this signal, researchers use an elastic cap worn in the head where the EEG electrodes are fitted. Such arrangement ensures that each experiment session's data are collected from the same area on the scalp [6]. An EEG signal is a combination of numerous frequencies of the brain signal. The majority of studies [7] employ a frequency range of 0-35 Hertz. However, we choose the whole band of frequencies without focusing on band-limited signals.
This paper focused on EEG signals based on motor imagery (MI), which is the act of envisioning limb movement. A subject's MI data are generated when he or she imagines moving a particular limb. In the early 2000s, researchers discovered that using common spatial patterns (CSP) was the best technique to identify EEG-based MI (EEG-MI) signals. For this approach, a collection of linear transformations, also known as spatial filters or distance optimizers, is sought over a variety of classes. The energy of the filters constitutes the feature set, which is fed to a support vector machine (SVM) [8].

•
Develops a lightweight deep learning-based multi-branch model to classify EEG-MI signals. • Applies attention mechanism to the proposed model to improve the accuracy. • Develops a general model that can perform well with fixed hyperparameters.

•
Investigates the effect of the fusion technique in the proposed model.

•
Validates the efficiency and strength of the model in data variations by using multiple datasets.
The following is how the paper is organized. The literature review is provided in Section 2. The suggested Multi-Branch EEGNet with Convolutional Block Attention Module (MBEEGCBAM) is presented in Section 3. Section 4 provides experimental results and discussion, while Section 5 concludes the paper.

Related Work
In the handcrafted method, the feature extraction and the classification are done separately [14,15], while in deep learning, these can be done in just one processing block. This gives it an advantage to success, especially in medical signals [16,17]. The most often utilized model in EEG-MI related tasks is convolutional neural networks (CNNs) [18][19][20][21][22], but deep belief networks (DBN) [19], stacked autoencoders (SAEs) [20], and recurrent neural networks (RNNs) [19,23] have also been utilized. In the processing of EEG-MI data, CNN offers several benefits, including the capacity to acquire time-based and spatial information concurrently, the facility to exploit the hierarchical structure of specific signals, and to provide excellent precision on large datasets.
CNN models are now applied in a variety of domains, including EEG-MI. The majority of articles that use deep learning to identify EEG-MI fall into one of four categories, depending on the input structure. Different features, spectral representation, raw signals, or topological maps can all be used as input formulations [7]. In determining the input formulation to employ, the design of the model is crucial. Some researchers have done some preprocessing of EEG signals before feeding them into a CNN. Sakhavi et al. presented one such approach in [24]. In the EEG recordings, the authors applied the filter-bank CSP (FBCSP) [25], then retrieved temporal information and applied channel-wise convolution. The authors used the BCI-IV2a dataset to test their approach, which yielded an average accuracy of 74.46%.
Inspired by the FBCSP, a ConvNet was proposed to classify EEG-MI signals; the input to the ConvNet is raw EEG data [26]. In [18], two models were presented; the first one was the ShallowConvNet, and the second one was the DeepConvNet. The ShallowConvNet has fewer layers while the DeepConvNet is a deeper version of the ShallowConvNet having extra aggregating layers. In [27], EEGNet was proposed as a dense form of prior techniques. It involves a depthwise convolution and a separable convolution, which permits the network for a reduction in the number of parameters. Riyad et al. [28] proposed a structure that has the EEGNet followed by an inception block. In [29], the authors proposed temporal convolutional networks (TCNs) with the EEGNet. All of these models address EEGNet's flaws, which limiting network volume and leading to overfitting. Because of these flaws, even with a larger network, the throughput is still subpar. A multi-branch model, which absorbs attributes from different branches, is recommended as a consequence of this.
Amin et al. developed a multilayer fusion approach to EEG-MI classification in [30]. The features from different layers of a CNN are fused using several fusion strategies. They tested using two classification approaches: subject-specific and cross-subject, and the test included two datasets: BCI-IV2a and high gamma dataset (HGD). In both datasets, the multilayer CNNs with MLP (MCNN) model produced more accuracy than the other stateof-the-art (SOTA) models in the subject-specific classification. Furthermore, the multilayer CNNs with cross-encoding autoencoders (CCNN) model showed a significant accuracy gain in the cross-subject classification. The same researcher proposed in [31] a two-attentionblock inception model. This produced decent accuracy in the BCI-IV2a dataset (74.7%) and HGD (94%).
Recently, a multi-branch 3D CNN to maintain spectro-temporal characteristics was proposed in [22]. The authors embodied the three dimensions as a series of two-dimensional representations based on the sensors' locations and then used the temporal information as the third dimension. To increase the number of training samples, the authors utilized a cropped method. Their finding showed that the proposed 3D CNN outperformed the three single networks in terms of accuracy. In another work, 3D filters were used for the 3D CNN-based EEG-MI classification model [32]. In practice, the 3D filter is harder to construct, whereas the 1D filter is simpler. A network having three one-dimensional filters to cover all three dimensions in 3D may outperform traditional convolutional networks while requiring much less computation, according to researchers in [33]. In our proposed model, there is a 2D CNN with two 1D filters applied along time/space; this model can lessen computation while increasing the model's skill to cope with subject-specific difficulties compared to 3D filters.
The authors in [34] introduced a CP-MixedNet structure, where each of the convolution layers collects EEG temporal information at different scales. Using the self-attention process, the authors of [35] developed a spatial-temporal representation of raw EEG data. When coding EEG-MI channels, the spatial self-attention module was used. The raw signal was filtered to various band ranges by the authors of [36] to produce three band-limited signals. Each band-limited signal is passed through three parallel branches with varied filter sizes. This caused a massive number of parameters more than 1215 K for the whole system. The system's use in many applications is limited as a result of this scenario. Furthermore, because the filter size did not vary, the influence of changing localities in channels was not accounted for in the model. In [37], the authors proposed a more sophisticated approach based on a temporal-spectral-based squeeze-and-excitation (SE) feature fusion network (TS-SEFFNet). It is a computationally expensive network with a huge number of parameters.
A combination between the multi-scale and the attention was proposed in [38]. Based on the attention process, the authors developed a multi-scale convolutional neural network using attention mechanism for fusion (MS-AMF). In the BCI-IV2a dataset, the experimental findings demonstrated that the network had superior classification than the baseline technique with 79.9% average accuracy. However, this model has a preparation part for the data before inputting them into the model. Jia et al. [39] proposed a big model that has several branches on each different scale; this increased the computation complexity. It contains five parallel branches each having an inception block followed by a residual block and a SE. The EEG Inception block (EIB) has four parts: three 1D convolutions (with different kernel sizes that gradually increase among all EIBs) and pooling operations. The authors did the experiments on two public BCI competition datasets: BCI-IV2a and BCI-IV2b datasets achieving 81.4%, and 84.4% accuracies, respectively.
In our previous work [40], we examine the multi-branch CNN in classifying the raw EEG-MI signal with fewer parameters. In this study, we investigate the effect of adding attention blocks to the multi-branch EEGNet.

Datasets
In this research, we evaluated our proposed model with two frequently used public EEG-MI datasets. Data from nine people were gathered using 22 EEG electrodes in the BCI Competition IV-2a dataset (BCI-IV2a) at a rate of 250 Hz [41]. In addition, data on eye movement were collected using three additional electrooculography (EOG) channels. There are four MI classes: left hand, right hand, feet, and tongue.
To validate the proposed model's robustness against data variations, we evaluated it using another dataset which is the HGD. The HGD has more trials than the BCI-IV2a, and has four classes: left hand, right hand, both feet, and rest. The HGD was collected in a controlled setting from 14 volunteers [18]. The data were collected using a total of 128 channels, only 44 related to MI, at a sampling frequency of 500 Hz.

EEG Data
For the BCI-IV2a dataset, from the onset of the pre-cue through the completion of each trial, we obtained 4.5 s of data of sampling frequency 250 Hz (250 × 4.5 = 1125 samples). Each trial produced a data matrix of dimension (22 × 1125).
Downsampling the HGD dataset from 500 Hz to 250 Hz resulted in an improvement in the data quality. Additionally, channels were decreased from 128 to 44 to eliminate repetitive information. We excluded the electrodes not connecting to the motor imagery area. We picked only 44 sensors with C in their name (according to the database description) as they cover the motor cortex. To be consistent with the BCI-IV2a dataset, we used each trial of 4.5 s (0.5 s before the cue to the end of the trial) to produce 1125 samples per trial with a data matrix of dimension (44 × 1125) [35]. There were no further filters used, and each channel was uniform. The accuracy was calculated across trials for the same subject (within subject).

EEGNet Block
Local connection, invariance to location, and invariance to local changeover are three fundamental properties of the cerebral cortex. CNN's primary concept is to use a filter to examine the influence of adjacent neurons [42,43]. The filter size we use is determined by the data type and the feature map we wish to create. The first block in our proposed model is the EEGNet which was introduced in [27]. The EEGNet block contains three convolution operations with varied window sizes, which are defined by the kernel size.
The first convolution layer uses 2D filters followed by a batch normalization. Batch normalization aids in the acceleration of training and the regularization of the model [44]. The second convolutional layer uses depthwise convolution followed by batch normalization and activation function in the form of an exponential linear unit (ELU), average pooling, and dropout. The third convolutional layer uses separable convolution. A simplified architecture of the EEGNet is shown in Figure 1.

M Attention Block
ion is well established to play a significant influence in human perception. A human uses a sequence of sights and categorically focuses on significant sections of the image to apprehend the visual meaning [45]. this idea comes the attention mechanism in deep learning. It is a module that can be added to the model to on relevant attributes and ignore others.
f the attention modules is the Convolutional Block Attention Module (CBAM) described in [46], where the s built a module to emphasize significant characteristics along the channel and spatial axes. Each branch may what' and 'where' to pay attention in the channel and spatial axes by using the sequence of attention modules wn in Figure 2). Because the module learns which information to highlight or hide, it efficiently helps the f information across the network. CBAM has two submodules: the channel attention submodule and the attention submodule. In the channel attention submodule, the input features from the preceding block are rrently transmitted to the average pooling and max-pooling layers. The features map generated by both g layers is then transmitted to a shared network, which is made up of an MLP with one hidden layer. In this layer, a reduction ratio was used to reduce the number of activation maps which reduces the parameter ad. After applying the shared network to each pooling feature map, element-wise summing is used to merge tput feature maps. Then, to generate the feature vectors that will be the input for the spatial attention dule, the element-wise multiplication is used between the output feature map from the channel attention dule and the input features map for the attention module. When calculating spatial attention, the channel axis

CBAM Attention Block
Attention is well established to play a significant influence in human perception. A human uses a sequence of limited sights and categorically focuses on significant sections of the image to apprehend the visual meaning [45]. From this idea comes the attention mechanism in deep learning. It is a module that can be added to the model to focus on relevant attributes and ignore others.
One of the attention modules is the Convolutional Block Attention Module (CBAM) described in [46], where the authors built a module to emphasize significant characteristics along the channel and spatial axes. Each branch may learn 'what' and 'where' to pay attention in the channel and spatial axes by using the sequence of attention modules (as shown in Figure 2). Because the module learns which information to highlight or hide, it efficiently helps the flow of information across the network. CBAM has two submodules: the channel attention submodule and the spatial attention submodule. In the channel attention submodule, the input features from the preceding block are concurrently transmitted to the average pooling and max-pooling layers. The features map generated by both pooling layers is then transmitted to a shared network, which is made up of an MLP with one hidden layer. In this hidden layer, a reduction ratio was used to reduce the number of activation maps which reduces the parameter overhead. After applying the shared network to each pooling feature map, element-wise summing is used to merge the output feature maps. Then, to generate the feature vectors that will be the input for the spatial attention submodule, the element-wise multiplication is used between the output feature map from the channel attention submodule and the input features map for the attention module. When calculating spatial attention, the channel axis average-pooling and max-pooling processes are used. As a result, a convolution layer is used to build an efficient feature descriptor. Both submodules are presented in Figures 3 and 4. average-pooling and max-pooling processes are used. As a result, a convolution layer is used to build an efficient feature descriptor. Both submodules are presented in Figure 3 and

Proposed Models
We propose a multi-branch EEG-MI classification system, where each branch has its own set of parameters to deal with the subject-specific problem. More specifically, we use three branches in the proposed system. Using the suggested technique, the convolution size, number of filters, dropout probability, and attention parameters may be determined for all subjects. It is also possible to tailor the model to a certain topic at the same time as increasing its applicability. In the first convolutional layer, based on local and global modulations, the model learns temporal properties and spatial attributes based on spatially distributed unmixing filters. average-pooling and max-pooling processes are used. As a result, a convolution layer is used to build an efficient feature descriptor. Both submodules are presented in Figure 3 and

Proposed Models
We propose a multi-branch EEG-MI classification system, where each branch has its own set of parameters to deal with the subject-specific problem. More specifically, we use three branches in the proposed system. Using the suggested technique, the convolution size, number of filters, dropout probability, and attention parameters may be determined for all subjects. It is also possible to tailor the model to a certain topic at the same time as increasing its applicability. In the first convolutional layer, based on local and global modulations, the model learns temporal properties and spatial attributes based on spatially distributed unmixing filters.

Proposed Models
We propose a multi-branch EEG-MI classification system, where each branch has its own set of parameters to deal with the subject-specific problem. More specifically, we use three branches in the proposed system. Using the suggested technique, the convolution size, number of filters, dropout probability, and attention parameters may be determined for all subjects. It is also possible to tailor the model to a certain topic at the same time as increasing its applicability. In the first convolutional layer, based on local and global modulations, the model learns temporal properties and spatial attributes based on spatially distributed unmixing filters.

Proposed Models
We propose a multi-branch EEG-MI classification system, where each branch has its own set of parameters to deal with the subject-specific problem. More specifically, we use three branches in the proposed system. Using the suggested technique, the convolution size, number of filters, dropout probability, and attention parameters may be determined for all subjects. It is also possible to tailor the model to a certain topic at the same time as increasing its applicability. In the first convolutional layer, based on local and global modulations, the model learns temporal properties and spatial attributes based on spatially distributed unmixing filters.
The proposed method Multi-Branch EEGNet with Convolutional Block Attention Module (MBEEGCBAM) can be divided into two parts: EEGNet block and Convolutional Block Attention Module (CBAM). Those basic blocks, EEGNet and CBAM, contain layers as described in [27,46], respectively.
The architecture of the MBEEGCBAM is shown in Figure 5. It has three different branches, each branch has an EEGNet block, channel attention block, and spatial attention block followed by a fully connected layer. Each branch has a varied number of parameters to capture different features. Moreover, after the improvement shown by the fusion in medical signals and images [47][48][49], we investigate the effect of fusion of the output feature maps from EEGNet blocks with the output from the EEG-CBAM blocks to reduce feature loss and construct a comprehensive feature map. For that, we propose the FMBEEGCBAM model ( Figure 6) that has the same blocks and connections as in the MBEEGCBAM model with an extra step. In this model, we add two concatenate layers: one after the EEGNet blocks and the other one after the CBAM blocks, then we flat and fuse both concatenate layers before using the fused layer as input into the softmax layer for classification. We test our models in BCI-IV2a and HGD, which are two benchmark datasets in MI EEG classification. comprehensive feature map. For that, we propose the FMBEEGCBAM model (Figure 6) that has the same blocks and connections as in the MBEEGCBAM model with an extra step. In this model, we add two concatenate layers: one after the EEGNet blocks and the other one after the CBAM blocks, then we flat and fuse both concatenate layers before using the fused layer as input into the softmax layer for classification. We test our models in BCI-IV2a and HGD, which are two benchmark datasets in MI EEG classification.

Training Procedure
In the realm of EEG-MI research, the emotional and physical state of research volunteers can vary greatly. For that, we employed the within-subject approach to classifying the data in this research [30]. For both datasets, one session was used for training and the other was used for testing. Global hyperparameters, which were obtained in our

Training Procedure
In the realm of EEG-MI research, the emotional and physical state of research volunteers can vary greatly. For that, we employed the within-subject approach to classifying the data in this research [30]. For both datasets, one session was used for training and the other was used for testing. Global hyperparameters, which were obtained in our previous work [40], were employed for all subjects, as shown in Table 1. The learning rate was 0.0009, batch size was 64, and the number of epochs was 1000. The Adam optimizer was used, and the cost function was the cross-entropy error function.

Experiments
The Tensorflow deep learning library with Keras API was used in all experiments in Google's Colab environment.

Performance Metrics
To analyze our models, we used the following performance metrics: accuracy (%), precision, recall, F1 score, and Cohen's Kappa test. Table 2 shows the performance comparison between the proposed models and other SOTA models. In particular, the average classification accuracies, Kappa values, and F1 scores obtained by the FBCSP [25], ShallowConvNet [18], DeepConvNet [18], EEGNet [27], CP-MixedNet [34], TS-SEFFNet [37], MBEEGNet [40], and MBShallowCovNet [40] from the BCI-IV2a and HGD datasets are summarized in Table 2. Our methods have the highest average accuracy, Kappa, and F1 score as can be observed. Moreover, we compared our results with those of our previous work [40] which contains lightweight multi-branch models without attention blocks. We found that the attention block improves the accuracy by around 1%.

Overall Comparison
Using the two public datasets, we evaluate the performance of our models. Figures 7 and 8 show how our methods performed against the SOTA models in the BCI-IV2a and HGD. From the figures, we can see that the proposed models achieve at least 8.14% higher accuracy than other baseline models in the BCI-IV2a. However, in HGD the improvement was by around 2% in both MBEEGCBAM and FMBEEGCBAM. Using the two public datasets, we evaluate the performance of our models. Figure 7 and Figure 8 show how our methods performed against the SOTA models in the BCI-IV2a and HGD. From the figures, we can see that the proposed models achieve at least 8.14% higher accuracy than other baseline models in the BCI-IV2a. However, in HGD the improvement was by around 2% in both MBEEGCBAM and FMBEEGCBAM.  The methods compared are ShallowConvNet [18], DeepConvNet [18], FBCSP [25], EEGNet [27], CP-MixedNet [34], TS-SEFFNet [37], MBEEGNet [40], MBShallowConvNet [40], the proposed MBEEGCBAM, and the proposed FMBEEGCBAM.

Results of MBEEGCBAM
The proposed model was trained on session "T" from the BCI-IV2a dataset, while in the HGD, the proposed model was trained in all sessions in the dataset except the last two sessions which were kept for the testing. In the experiments, the within-subject or subject-specific approach was used.
One of the main focuses of this study was to find the optimal hyperparameters that can advance the accuracy with less complication. Therefore, first, we found the best hyperparameters in the EEGNet block by performing multiple experiments. Then, we carried out other experiments to choose the best reduction ratio and kernel size in the CBAM block. Figure 9 shows the accuracy comparison between different kernel sizes and Ratios in CBAM blocks on different EEGNet blocks. As we can see from Figure 9, in EEGNet Block 1 (Figure 9a) [18], DeepConvNet [18], FBCSP [25], EEGNet [27], CP-MixedNet [34], TS-SEFFNet [37], MBEEG-Net [40], MBShallowConvNet [40], the proposed MBEEGCBAM, and the proposed FMBEEGCBAM.

Results of MBEEGCBAM
The proposed model was trained on session "T" from the BCI-IV2a dataset, while in the HGD, the proposed model was trained in all sessions in the dataset except the last two sessions which were kept for the testing. In the experiments, the within-subject or subject-specific approach was used.
One of the main focuses of this study was to find the optimal hyperparameters that can advance the accuracy with less complication. Therefore, first, we found the best hyperparameters in the EEGNet block by performing multiple experiments. Then, we carried out other experiments to choose the best reduction ratio and kernel size in the CBAM block. Figure 9 shows the accuracy comparison between different kernel sizes and Ratios in CBAM blocks on different EEGNet blocks. As we can see from Figure 9, in EEGNet Block 1 (Figure 9a), the maximum accuracy was attained (74.83%) at ratio 2 and kernel size 2 × 2. Ratios 2 and 4 normally give better accuracy in EEGNet Block 1; however, ratio 8 gave better accuracy in EEGNet Block 2 and Block 3. On the hand, kernel size 8 × 8 and 4 × 4 gave better accuracy in EEGNet Block 1 and Block 2 while kernel size 2 × 2 provided the best accuracy in EEGNet Block 3. We chose Ratio 2 for EEGNet Block 1 and Ratio 8 for EEGNet Block 2 and Block 3; for the kernel size, we chose 2 × 2 for EEGNet Block 1 and Block 3 and size 4 × 4 for EEGNet Block 2. The hyperparameters that we used in the CBAM block in each branch of our proposed models are mentioned in Table 1. Table 3 shows the performance of each branch separately, as well as the multi-branch model without attention blocks compared with our proposed model using the BCI-IV2a dataset. From the table, we can see that the proposed method, which is a combination of different EEGCBAM branches, enhances the performance of EEG-MI classification. Bioengineering 2022, 9, x FOR PEER REVIEW 11 of 16 (a) (b) (c) Figure 9. Accuracy comparison between different kernel sizes and ratio in CBAM block on EEGNet Block 1 (a), EEGNet Block 2 (b), and EEGNet Block 3 (c). Table 3 shows the performance of each branch separately, as well as the multi-branch model without attention blocks compared with our proposed model using the BCI-IV2a dataset. From the table, we can see that the proposed method, which is a combination of different EEGCBAM branches, enhances the performance of EEG-MI classification.
The comprehensive findings of MBEEGCBAM using the BCI-IV2a dataset and HGD are shown in Tables 4 and 5. In the tables, LH, RH, F, and Tou represent left hand, right hand, feet, and tongue MI classes, respectively. Using Wilcoxon signed-rank test, there is a significant increase with p < 0.05 in the average accuracy and Kappa value using MBEE-GCBAM compared to other SOTA models described in [18,29,34,37].   The comprehensive findings of MBEEGCBAM using the BCI-IV2a dataset and HGD are shown in Tables 4 and 5. In the tables, LH, RH, F, and Tou represent left hand, right hand, feet, and tongue MI classes, respectively. Using Wilcoxon signed-rank test, there is a significant increase with p < 0.05 in the average accuracy and Kappa value using MBEEGCBAM compared to other SOTA models described in [18,29,34,37].

Results of FMBEEGCBAM
To study the effect of the fusion of multi-branches, we added a connection between the output feature maps from EEGNet blocks with the output from the EEG-CBAM blocks. Tables 6 and 7 show the detailed result using both datasets, the BCI-IV2a and the HGD. From the tables, we can see that the proposed fusion model improves the classification accuracy in six subjects out of nine in the BCI-IV2a dataset, while in the HGD, eight subjects have an improvement in the accuracy. The drawback of this model is the increase in the number of parameters. The fusion model has 3808 parameters more than the MBEEGCBAM model with around a 1% increase in the classification accuracy.

Feature Discrimination Discussion
Using a confusion matrix, we demonstrate the competence of features obtained by the proposed MBEEGCBAM for different MI classes. Figure 10 shows the confusion matrixes of the proposed model and the SOTA models in both datasets. We see that the proposed MBEEGCBAM significantly improved accuracy in four MI tasks across both datasets, especially in the "Foot" task, which reached an average increase of 12.8% in the BCI-IV2a and 11.4% in the HGD. The rest of the tasks increased by around 9.13% in the BCI-IV2a and 3.5% in the HGD. To study the discriminative nature of the features obtained by the MBEEGCBAM, the t-SNE was used to visualize the features [50] (see Figure 11). Compared to ShallowConvNet [18], DeepConvNet [18], and EEGNet [27], the proposed MBEEGCBAM model extracted more separable features from EEG-MI.

Feature Discrimination Discussion
Using a confusion matrix, we demonstrate the competence of features obtained by the proposed MBEEGCBAM for different MI classes. Figure 10 shows the confusion matrixes of the proposed model and the SOTA models in both datasets. We see that the proposed MBEEGCBAM significantly improved accuracy in four MI tasks across both datasets, especially in the "Foot" task, which reached an average increase of 12.8% in the BCI-IV2a and 11.4% in the HGD. The rest of the tasks increased by around 9.13% in the BCI-IV2a and 3.5% in the HGD. To study the discriminative nature of the features obtained by the MBEEGCBAM, the t-SNE was used to visualize the features [50] (see Figure 11). Compared to ShallowConvNet

Conclusions
In this paper, we propose lightweight multi-branch models with attention, which improve the performance of EEG-MI classification with fewer parameters. The multibranch model concatenates different features from three branches. When compared to other SOTA models, our model exhibits promising results in terms of accuracy, Kappa value, and F1 score. Our results were more accurate than other multi-branch models and required less human intervention. The study used the BCI-IV2a dataset and the HGD dataset, both of which are freely available. The experiment used a within-subject method, with global hyper-parameters applied to all subjects in both datasets. The proposed MBEEGCBAM had an average classification accuracy of 82.85% on the BCI-IV2a dataset, while that of the proposed FMBEEGCBAM was 83.68%. The average accuracy on the HGD in MBEEGCBAM and FMBEEGCBAM was 95.45% and 95.64%, respectively. In the future, we want to apply different fusion strategies in the proposed models.