1. Introduction
Motor impairment is a critical disability that occurs when motor limbs lose their main movement role due to different reasons [
1,
2]. At any stage of their lives, people can be affected by short-term or long-term motor disabilities, meaning they cannot practice normal daily routines and activities due to mobility deficits. This has raised the need for communication methods to enable people with motor disabilities to easily interact with their surrounding environment and enhance the quality of their lives. The brain–computer interface (BCI) is one of the communication methods that enables users to interact and control external environments without muscle activity by decoding messages and commands directly with brain signals [
3,
4,
5]. A commonly used non-invasive technique to capture brain activations is electroencephalography (EEG) [
6]. Its high temporal resolution, affordable price, and ease of use make EEG a favorable choice for BCI applications [
7]. EEG-based motor imagery has become prominent in EEG-based BCI applications and research domains. Many studies have taken advantage of EEG-based MI to address problems related to the control and rehabilitation process [
8,
9].
The main problem in MI detection is decoding messages and commands from an EEG signal, and different traditional machine learning [
10,
11] and deep learning [
12,
13,
14] techniques have been proposed to solve this problem. The notable traditional machine learning techniques are based on a common spatial pattern (CSP) in combination with local activity estimation (LAE), Mahalanobis distance, and support vector machines (SVMs) [
15,
16]. The most representative deep learning models that have been widely used for decoding MI tasks are EEGNet [
17], shallow, and DeepNet [
18]. Most works based on traditional machine learning and deep learning focus on a small number of MI tasks ranging from two to four. Additionally, MI tasks involve bilateral, contralateral, and unilateral movements of the upper limbs. However, the problem becomes challenging when the number of MI tasks is greater than four and involves unilateral motor imagery, such as the left hand, right hand, left leg, and right leg. There are a few research works that have addressed more than four MI tasks [
10,
11,
13]. These works build solutions based mainly on existing traditional machine learning techniques and deep learning models. This necessitates the development of custom-designed techniques. In view of this, a custom-designed solution has been proposed for this problem, leveraging the spectral and multiscale nature of EEG trials, and in this paper specifically, a deep learning model has been introduced that utilizes attention and sequential modeling techniques, along with CNN layers.
An EEG signal is composed of various brain waves (rhythms or frequency bands) such as delta (0.5–4 Hz), theta (4.0–7.0 Hz), alpha (8.0–12.0 Hz), beta (12.0–30.0 Hz), and gamma (30.00–50.0 Hz) [
19]. These brain waves play a key role in decoding motor imagery events from EEG brain signals [
7,
19]. The MI triggers changes in the brain waves corresponding to a subject’s intent movements. These changes help to discriminate various MI tasks, which eventually causes an MI BCI system to detect MI tasks and interpret them to assist disabled people in interacting with the environment [
3,
4,
5]. However, the existing MI detection methods do not adequately focus on analyzing brain waves and extracting spectral information from EEG trials. Further, the multibranch temporal analysis of an EEG signal [
14] and long-term temporal dependencies also play an important role in discriminating MI tasks [
14]. Based on these observations, the SSTMNet has been proposed to detect MI tasks. Specifically, the main contributions of this work are as follows:
A spectral–spatio-temporal multiscale deep network—SSTMNet—that first performs a spectral analysis of an input EEG trial by decomposing each channel into brain waves, paying due attention relevant to an MI task to each brain wave, then working out long-term temporal dependencies and performing a multiscale analysis; finally, it performs a deep analysis to extract high-level spectral–spatio-temporal multiscale features and classifies the trial.
A data augmentation technique to deal with the small dataset problem for each MI task. It generates new EEG trials from EEG signals belonging to the same class in the frequency domain with the idea that the coefficients of the same frequencies must be fused, ensuring label-preserving trials.
The effect of the data augmentation technique and the performance of SSTMNet for unilateral MI tasks on the benchmark public-domain dataset are thoroughly evaluated and compared with similar state-of-the-art methods.
A qualitative analysis of the discriminatory potential of the features learned by the SSTMNet and its decision-making mechanism is presented.
The remaining sections are organized as follows:
Section 2 presents an overview of the most relevant state-of-the-art works on EEG-based MI detection. The proposed method is explained in
Section 3. The details of the evaluation protocol, the description of the dataset, and data augmentation are provided in
Section 4. The experimental results and discussion are described in
Section 5 and
Section 6, respectively. Finally,
Section 7 concludes the paper.
2. Related Works
The problem of decoding MI tasks from EEG signals has been under consideration since the 1990s [
20]. The detection of motor imagery patterns in EEG signals is rapidly progressing using conventional machine learning and deep learning techniques. This section reviews the representative works on MI task detection using EEG signals. A summary of the comparison between various methods is presented in
Table 1.
Most published works have addressed from two to four motor imagery classes and were evaluated using many benchmark datasets developed for these tasks [
21,
22,
23]. Regarding more than four MI tasks, there are few studies as well as the benchmark datasets [
24,
25,
26,
27]. Ofner et al. [
25] focused on six MI tasks associated with the right upper limbs (elbow, forearm, and hand) and rest state. Yie et al. [
26] developed a dataset that involves six simple and compound MI tasks related to the hands and legs. Jeong et al. [
27] developed a dataset that contains eleven MI tasks specific to the same upper limb, including arm-reaching, hand-grasping, and wrist-twisting. Kaya et al. [
24] built a large MI dataset that involves five fingers, hands, legs, tongue, and a passive state. It is divided into the following three paradigms: five fingers (5F), hands/legs/tongue and passive (HaLT), and hands and passive state (CLA). Most studies focus on unilateral and contralateral MI tasks for the upper body parts. However, Kaya et al.’s [
24] dataset focused on unilateral MI tasks for the hands and legs; they also introduced challenging MI tasks involving using one finger at a time.
Table 1.
Summary comparison of literature reviews between dataset, number of subjects (#Sub), method, number of channels (#Ch), and number of MI tasks (#MI Tasks).
Table 1.
Summary comparison of literature reviews between dataset, number of subjects (#Sub), method, number of channels (#Ch), and number of MI tasks (#MI Tasks).
References | Dataset | # Sub | Method | # Ch | # MI Tasks | Acc. |
---|
Lian et al., (2024) [12] | Private | 10 | CNN + GRU + Attention | 34 | Six Tasks (RH, LH, both hands, both feet, right arm + left leg, and left arm + right leg) | 64.40% |
Wirawan et al., (2024) [28] | MIMED | 30 | SVM + baseline reduction | 14 | Raising RH, lowering RH, raising LH, lowering LH, standing, and sitting | 83.23% |
George et al., (2022) [13] | HaLT [24] | | DeepNet | 19 | LH, RH, Passive state, LL, Tongue, RL | 81.74–83.01% |
George et al., (2022) [14] | | DeepNet | - | 81.49% |
George et al. [29]-2022 | 12 | Multi-Shallow Net | - | 81.92% |
Mwata-Velu et al., (2023) [30] | | | EEGNet | 8–6 | | 83.79% |
Yan et al., (2022) [31] | | | Graph CNN + Attention | 19 | | 81.61% |
Yang et al., (2024) [10] | 5F [24] | 8 | Features selection + Feature fusion + Ensemble learning (SVM, RF, NB) | 22 | 5 Fingers | 50.64% |
Degirmenci et al., (2024) [11] | NoMT + 5F [24] | 13 | Intrinsic Time-Scale Decomposition (ITD) + Ensemble Learning | 19 | 5 Fingers + Passive state | 55% |
A few studies focused on six MI tasks. Lian et al. [
12] proposed a hybrid DL model to extract spectral and temporal features. It consists of four stacked CNN layers and two GRU layers, followed by an attention technique to reweight the extracted features. The evaluation was carried out on a private dataset that involved six MI tasks performed by ten healthy subjects, yielding an accuracy of 64.40%. Wirawan et al. [
28] developed a motor execution (ME) and imagery (MI) dataset for six motor tasks related to different hand positions, as well as standing and sitting movements. They preprocessed an EEG trial using different methods, such as separating a baseline signal from the trial, normalization, and frequency decomposition. Then, they extracted features using differential entropy and used baseline reduction to overcome inferences in a trial. Finally, they used decision trees and SVM to assess the baseline reduction process in recognizing motor patterns. The results showed that SVM with a baseline reduction approach achieved an 83.23% accuracy.
George et al. [
13] employed DeepNet and examined different data augmentation methods on the HaLT paradigm [
24]. The accuracy ranged between 81.74% and 83.01% compared to no DA 80.73%. George et al. [
14] employed Bi-GRU, DeepNet, and Multi-Branch Shallow Net to classify MI tasks in the HaLT paradigm; they used within- and cross-subject transfer learning approaches, in addition to the subject-specific approach for evaluation. The subject-specific evaluation resulted in a better performance (Bi-GRU 77.70 ± 11.06, DeepNet 81.49 ± 11.43, and Multi-Shallow Net 78.32 ± 7.53). George et al. [
29] also conducted a study comparing the performances of traditional methods, existing deep learning models, and their combinations in classifying MI tasks in the HaLT paradigm. Deep learning models showed superior results over conventional methods. The highest accuracy achieved among all experiments was 81.92 ± 6.48 using Multi-Shallow Net. Mwata-Velu et al. [
30] used EEGNet and introduced two channel selection methods based on channel-wise accuracy and mutual information. They evaluated their methods using the subject-independent protocol with 10-fold cross-validation in the HaLT paradigm; the study reached an accuracy of 83.79%. Yan et al. [
31] employed a graph CNN with an attention mechanism, and the EEG signal was reshaped into two graph representations, spatial–temporal and spatial–spectral. The two representations were fed into the model in parallel and then their features were fused. A subject-specific protocol with a 5-fold CV was used to classify the HaLT paradigm, achieving an accuracy of 81.61%.
Some studies focused on classifying five MI tasks [
10,
11]. Yang et al. [
10] proposed a method for evaluating and selecting EEG signal sub-bands in two stages. Firstly, bands were identified with high correlation coefficients between features and labels, and then their effectiveness was verified by computing their average accuracy using SVM, Random Forests, and Naive Bayes. Then, Fourier transform amplitudes and Riemannian geometry were used to extract features, followed by a feature selection process. Finally, ensemble learning was applied through weight voting. They achieved an average accuracy of 50.64 ± 10.88 in classifying five fingers (5F paradigm). Degirmenci et al. [
11] proposed a novel feature extraction method by decomposing EEG signals using the intrinsic time-scale method and selecting the first three high-frequency components (PCRs) for further analysis. Various features such as power, mean, and sample entropy were extracted from individual PCRs and their combinations and analyzed using the feature selection method. Well-known machine learning classifiers were utilized to classify selected features, and the performance of the ensemble learning classifier was the best. The study combined the NoMT and 5F paradigms in [
24] to classify six classes, and the accuracy ranged from 35.83 to 55% using a subject-dependent protocol.
The studies in [
12,
28] focused on a large number of MI tasks, most of which were bilateral and contralateral MI movements, and unilateral MI was evaluated only for hands. In [
12], trials were recorded through 34 channels distributed over the skull, leading to interference and redundancy in the recorded signals. In [
28], the reported accuracy was on a motor execution dataset rather than motor imagery. Some studies employed existing deep learning models such as DeepNet and EEGNet [
13,
14,
29,
30]. Additionally, the studies in [
13,
29] used many preprocessing techniques such as bandpass filter, baseline correction, artifact correction, data re-referencing, and trial rejection process.
A few studies focused on more than four MI tasks and employed existing deep learning models or traditional machine learning algorithms. As these studies employed existing techniques, they did not deeply analyze the spectral information and multiscale content of an EEG signal and yielded a poor performance. Further research is needed to overcome the limitations of the existing methods.
4. Evaluation Procedure
This section presents the evaluation procedure that followed to evaluate the performance of SSTMNet. It also gives a description of the public-domain benchmark EEG MI dataset [
24] and performance metrics used to assess SSTMNet’s performance. Finally, the data augmentation method that was applied to the dataset is introduced in detail.
SSTMNet was implemented using Python version 3.10.12, and the experiments were conducted on Google Colab Pro+ equipped with a Tesla T4 GPU and 15 GB RAM. In this section, a detailed description of the training procedure is also provided.
4.1. Dataset Description and Preparation
SSTMNet has been tested on a public-domain benchmark EEG MI dataset [
24] which involves three paradigms. This work is mainly focused on the HaLT paradigm, which stands for hands, legs, and tongue; it involves MI tasks, i.e., left- and right-hand, left- and right-leg, passive, and tongue imaged movements. Twelve healthy subjects participated in this paradigm labeled from ‘A’ to ‘M’, excluding the letter ‘D’. The number of sessions varied from 1 to 3 for each subject, resulting in a final production of 29 files. The duration of sessions ranged from 50 to 55 min. Each session was divided into interaction periods (IPs); it started with 2.5 min as an initial relaxation, and then the IPs were separated with 2.5 min (as depicted in
Figure 5a). In each IP segment, a subject performed 300 trials, and each trial took 1 s for the MI task and 1.5 s to 2 s as a pause, as shown in
Figure 5b. The EEG-1200 system with Neurofax software was used to record the 19-channel EEG signals at a sampling rate of 200 Hz;
and
were recorded as ground channels and
as a bipolar channel for synchronization purposes. The 10/20 international system was adopted to position the electrodes over the scalp. Due to the use of the EEG-1200 system and Neurofax software, two hardware filters were applied to the recorded data, a notch filter (50 Hz) and bandpass filters (0.53–70 Hz).
For the experiments, EEG trials of 2.5 s were extracted, as shown in
Figure 5c. Each trial involved 1 s for MI task and 1.5 s of the pause period to ensure that the entire motor imagery signal was captured due to the continuity of the thought process after 1 s.
4.2. Dataset Augmentation
On average, each subject and class had a set of 159 EEG trials, which is insufficient to train even a lightweight deep network. To tackle the small dataset problem, a data augmentation method was proposed to create new EEG trials from the existing EEG trials in the frequency domain. The idea is that trials belonging to the same class share similar patterns in the frequency domain, and the coefficients of the same frequencies must be interpolated to ensure the creation of new label-preserving trials.
A trial
in a time domain is transformed into a frequency domain
using DFT [
44] as follows:
where
are the frequency coefficients.
Two different trials from the same class,
, are selected, where
, and
is the total number of trials of the class.
are transformed to
using Fast Fourier Transform (FFT). The new trial
is created by interpolation in the frequency domain, as follows:
The new trial
is in the frequency domain. It is transformed to the time domain
using inverse Fast Fourier transform (IFFT), as follows:
Please note that the new trial is created by interpolating the corresponding frequency coefficients. This interpolation ensures that the new trial
has the same label as
, i.e., the data augmentation technique is label-preserving. The details for creating new trials are described in Algorithm 1.
Algorithm 1: Data Augmentation using Fourier Transform. |
Input- -
corresponding to the class c where c ∈ {left hand, right hand, passive, left leg, tongue, right leg}. - -
is the number of trials to be generated such that
|
Output- -
The generated trials .
|
Processing. Do - -
Randomly select two trials - -
Compute Fourier transform and of and - -
For - ◦
- ◦
Inverse Fourier Transform of - ◦
While () |
Furthermore, a set of trials is generated with noise through extrapolation by setting to a negative value. This study ends up with 75% of trials being created by interpolation, while only 25% of trials are created by extrapolation (ρ = −0.25).
4.3. Hyperparameter Tuning
Two main architectural hyperparameters affecting the performance of SSTMNet are normalization layers and activation functions. There are various choices for them, as shown in
Table 3. In addition, the training of the model is affected by the optimizer, batch size, and learning rate. Different choices are possible for them, as shown in
Table 3. SSTMNet was created and trained using Pytorch and the Pytorch lightening framework. The Optuna [
42] package was employed to find the best options for the learning rate, optimizer, dropout value, activation functions, and normalization layers using Subject A1. For finding the best hyperparameters, grid search was used as a background procedure to explore the search space, as presented in
Table 3. Additionally, the maximization criteria were set to maximum validation accuracy for 100 trials with 30 epochs per trial.
The results revealed that the best optimizer was Ndam, with a batch size of 96 and a learning rate of
. For the activation and normalization layers, GELU and lazy batch normalization were found to be the best choices. For onward experiments, we set these hyperparameters to their best options, and we explore other architectural parameters of the SSTMNet in
Section 5. For training the model, the Nadam algorithm was used with a learning rate of
. The cross-entropy loss function shown as follows was used for training:
where
and
are the actual label in one-hot-encoding and the predicted probability vector of an EEG trial
4.4. Evaluation Method
The SSTMNet model was evaluated using a subject-specific approach and stratified 10-fold cross-validation, following state-of-the-art methods [
29,
30,
31], i.e., the data of each subject corresponding to six imagery tasks were divided into 10 folds; 1 fold was left out for testing and validation each and the remaining 8 folds were used as training set. In each fold, the dataset was split into 80% for training, 10% for validation, and 10% for testing. In this way, 10 experiments were performed, considering each fold, in turn, as a test set; the average performance of all 10 folds is reported in the results. This approach allowed for the generalization of the model to be tested over various training and test sets. For each fold, the model was trained using 200 epochs.
The metrics that were opted for to assess the model’s performance were accuracy, F1 score, precision, recall, and ROC curve. They are based on counting true positives (TPs), true negatives (TNs), false positives (FPs), and false negatives (FNs). The problem is a multiclass problem, where j = 1, 2, 3, 4, 5, 6 corresponding to {left hand, right hand, Rest, left leg, tongue, right leg}. As the problem is multiclass, the micro-approach was followed for computing accuracy, while the macro-approach was used for the remaining metrics [
45,
46].
5. Experimental Results
To validate the performance of the SSTMNet model and explore the effects of the hyperparameters in each block, several experiments were performed. The details are given in the following subsections.
The effects of the hyperparameters were analyzed, focusing on five subjects according to the performance of SSTMNet on them, ranging from poor to best. Different values of the hyperparameters were explored in this ablation study. The values of the hyperparameters that produced the highest accuracy were selected and used to explore the upcoming blocks for further improvements. SSTMNet was also compared with similar state-of-the-art methods on the same motor imagery problem and the dataset.
5.1. Performance of SSTMNet
This section presents the performance of SSTMNet with the best hyperparameters selected based on the Optuna study results (see
Section 4.3) and the initial hyperparameters shown in
Table 4. It was evaluated using the evaluation protocol and the performance metrics described in
Section 4.4. The results for all subjects are shown in
Figure 6. The average F1-score, precision, and recall of SSTMNet were 56.19%, 58.69%, and 55.68%, respectively. The average accuracy across all sessions was 77.52%. The model gave the best performance for C (session 1), J (session1), L (session 1 and 2), and M (session 1); in all these subjects, the accuracy was above 93% and F1-score, recall, and precision were above 80%. For some subjects, the accuracy was below average (77.52%), such as B (sessions 2, 3), E (session 3), F (session 3), H and I (all sessions), and k (session 2). This was probably due to the label noise in the annotation of these cases.
As far as the parameter complexity of the model is concerned, it involves 1,047,981 learnable parameters, which provide enough capacity for the model to learn intricate EEG patterns to discriminate various MI tasks.
5.2. Ablation Study on Hyperparameters of Blocks
Various hyperparameters are involved in the SSTMNet blocks, as shown in
Table 4. Several experiments were performed to find out the best value of each hyperparameter. Details are given in the following subsections.
5.2.1. Attention Block
This block contains one hyperparameter, which is the compression ratio
r. Different ratio values were examined, such as
to analyze its effect on different performance measures and the model’s complexity. In terms of all measures, 10 and 32 almost gave the same results, but 32 resulted in less standard deviation in all measures, as shown in
Figure 7. In addition, the complexity of the attention block decreased significantly when the ratio increased to 32, as can be seen in
Figure 8. It can be concluded that the best ratio in the attention block was 32 in terms of model performance, stability, and block complexity.
5.2.2. Temporal Dependency Block
This block involves two hyperparameters, the number of neurons
and number of layers
. Two sets of experiments were conducted sequentially. First, different experiments were performed with
to explore its impact on model performance and block complexity. The results are shown in
Figure 9. The best results with slight differences were obtained when
or
, and the model seemed to be stable when
, as shown in
Figure 9. However, when
, the block complexity significantly increased (about 277%), causing overfitting (see
Figure 10). As a result,
was preferred over
due to a significant increase in the number of learnable parameters. The results show a direct relationship between all performance measures and the number of neurons.
Afterwards, the effect of the number of layers was evaluated using L = 2, 4, 8, 16. The number of layers in the temporal dependency block reveals an inverse relationship in terms of all performance measures. The best performance was obtained when
(see
Figure 11). In this case, the learnable parameter complexity of the block was reduced by half (see
Figure 12).
5.2.3. Multiscale Block
This block was employed to extract high-level features at different scales with a view to handle inter and intra-subject variabilities [
36]. Different scales were analyzed that were selected to be divisible by two, the power of two, and the sample rate over the power of two, as shown in
Table 5. The kernel sizes
, and
yielded almost the same results for all performance metrics, as shown in
Figure 13. Out of the three combinations,
revealed a lower standard deviation and the parameter complexity of the block, as indicated in
Figure 14. It was inferred that the
is the best choice.
5.2.4. Sequential Block
This block was utilized to detect high-level spectral–spatio-temporal features and reduce the model complexity by compressing the features along the channel and temporal dimensions. Two hyperparameters were independently analyzed, depth-wise kernel size and standard CNN kernel size.
Four depth-wise kernel sizes were tested, which included 8, 16, 32, and 64. The results shown in
Figure 15 indicate that the depth-wise kernel size of 32 resulted in the best performance in terms of all metrics. A size less than 32 did not allow for learning of the discriminative features, and a size greater than 32 caused the learning of redundant features, which affected the model’s performance. Further, this choice of the size had the best compromise for the complexity of the sequential block, as is depicted in
Figure 16.
The kernel size of the standard CNN has a significant effect on the complexity of the block, as it is followed by the fully connected layer. It was endeavored to select a kernel size that best compromised between model performance, generalization, and block complexity. The following four choices were considered: 8, 16, 32, and 64. The results in
Figure 17 show that, even in this case, the size of 32 was the best choice in terms of all metrics. However, the results for size 16 were close to those of 32 and more stable because the standard deviation in this was less. Further, the block complexity decreased by 71.99% compared to the kernel size of 32, as shown in
Figure 18. In view of this significant decrease in the block complexity, the best choice was 16.
5.3. Impact of Data Augmentation
The proposed data augmentation method was used to handle the small dataset problem and effectively train the model by exposing it to many augmented trials. The number of training samples was quadrupled to overcome overfitting and improve the model generalization to unseen samples. To ensure the effectiveness of augmented trials in training the model, two experiments with and without data augmentation were conducted. The model, trained on only the original dataset, showed a decrease across all performance measures, as is evident from
Figure 19.
By observing individual subject accuracy with and without data augmentation, it is noticed that all subjects’ performance improved significantly with data augmentation, except for A3, which showed a small decrease. All trials created by interpolation and extrapolation contributed to the model’s effectiveness in training and producing good results for unseen samples.
5.4. Comparison to the State-of-the-Art Methods
To validate and ensure the effectiveness of SSTMNet, it was compared with the state-of-the-art methods on the same dataset, i.e., the HaLT paradigm. The focus was on the methods that used all six classes, all subjects, and sessions. SSTMNet was compared with the methods that followed the subject-dependent protocol evaluation, and the results reported were session-wise. The comparison is presented in
Table 6.
George et al. [
13] evaluated different data augmentation methods using DeepNet. George et al. [
14] employed within- and cross-subject transfer learning, in addition to a subject-dependent protocol using BiGRU, DeepNet, and multi-branch Shallow network [
18]. The SD protocol yielded (77.70 ± 11.06, 81.49 ± 11.3, and 78.32 ± 7.53) using the DL models mentioned above. The studies in [
13,
14] only revealed an average accuracy and no information about session-wise accuracy. George et al. [
29] conducted a comparative study to compare the SOTA classifiers, DL models with trial-wise and cropping methods, and SOTA feature extraction methods with DL models as classifiers. In the references [
13,
29], the accuracy was reported as 83.01% and 81.92%, respectively. However, this accuracy was obtained after applying many preprocessing techniques such as bandpass filter, baseline correction, artifact correction, and data re-referencing. In addition, the studies in [
13,
14,
29] conducted trial rejections without revealing the procedures for rejecting trials and the number of rejected trials.
Yan et al. [
31] utilized a graph CNN with an attention mechanism and reshaped the raw signal into two graph representations, spatial–temporal and spatial–spectral representations, achieving an average accuracy of 81.61%. However, removing the attention block degraded the performance by 8.10%. Mwata-Velu et al. [
30] proposed two preprocessing approaches to select the six and eight most discriminative channels using EEGNet for preprocessing and evaluation. Although the average accuracy reached 83.79%, the number and name of channels varied across subjects, which cannot be applied in real-world applications. Additionally, the model was trained for 1500 epochs and 10-folds, which is time-consuming. Both studies combined all sessions per subject to train, validate, and test DL models.
Due to the similar evaluation protocol, the accuracy achieved in this work is compared to those of [
24,
29]. It is worth mentioning that the trial rejection process was not employed to fairly evaluate the proposed model for classifying MI tasks. It achieved an accuracy of 77.52% ± 4.00 by computing the average of 10 folds and 83.76% by reporting the best fold that outperformed work in [
29]. Furthermore, the majority of subjects performed better in this work compared to the results in [
29] (see
Table 6). Additionally, there was no significant difference between the maximum values of SSTMNet and those in [
29], and the differences were not significant at the 5% significance level. The proposed model resulted in an average accuracy improvement of 19.13% in comparison to the method by Kaya et al. [
24]. Furthermore, the mean performance of SSTMNet was significantly better than that of the method by Kaya et al. [
24] at the 5% significance level (
p-value = 7 × 10
−11). Finally, it should be pointed out that the subject-wise results of Kaya et al. [
24] were obtained from the study in [
29].
5.5. Feature Visualization Based on t-SNE
To assess whether SSTMNet learned discriminative features, three subjects, C1, G3, and L2, were selected, and their features learned by the model were plotted using t-SNE, as shown in
Figure 20,
Figure 21 and
Figure 22. It is clear from these figures that the features of all trials were well-clustered and separated.
However, passive state features were classified as left leg and hand in subjects C1 and L2, respectively.
Figure 21 depicts the misclassification of only two tongue cases and two different sides of leg movements in subject G3. The misclassification was due to the consistency of features extracted from tongue, right hand, and left leg movements in subject G3, and rest and left hand in subject L2. In addition, other misclassified cases could be caused by label noise and imagining different movements. Overall, the proposed model was able to extract discriminative features after processing signals through the attention, temporal dependency, multiscale, and sequential blocks.
5.6. Confusion Matrix
A confusion matrix is an interpretation method that provides insight into a model’s decision making and helps to identify correctly classified and misclassified trials and their relationships. By investigating the confusion matrix and t-SNE plots at the same time, we can determine tasks that are difficult to discriminate due to either similarity in the features extracted by the model or label noise.
Figure 23 shows the confusion matrices for six imagery tasks of three subjects, C1, G3, and L2. The presented confusion matrices show consistency with t-SNE feature plots. Only one rest case was misclassified in subjects C1 and L2 with left leg and hand movements, as presented in
Figure 23a,c. Furthermore, two cases of tongue movements and one case of right hand and left leg movements were misclassified in subject G3.
5.7. ROC Curve for Multiclass Classification
The Receiver Operating Characteristic (ROC) curves are presented in
Figure 24 for subjects C1, G3, and L2. It shows the true positive rate (TPR) and false positive rate (FPR) for each MI task at different thresholds values to show the effectiveness of classifier in predicting MI tasks correctly. The area under the curve (AUC) in all three subjects and among six MI tasks produces a high value (above 0.90), which indicates the robustness of the SSTMNet in discriminating each MI task accurately.
5.8. The Most Effective Channel in Each MI Tasks Based on SHAP Value
A commonly used interpretation techniques, namely Shapley Additive Explanation [
47], has been employed to measure the effectiveness of channels of an EEG signal in model performance. In order to interpret the model prediction results and understand which channels help the model to reach the correct decision, the DeepLift SHAP technique [
48] is used.
Three subjects, C1, G3, and L2, are chosen to interpret the model prediction. Each of the
Figure 25,
Figure 26 and
Figure 27 displays the channels that have a positive impact on predicting the MI task. The positive average SHAP values of the channels with the nominated channels are depicted on the
x-axis and the
y-axis.
It is obvious that the performance of the model is greatly influenced by the channels of the frontal and temporal lobes in all three subjects. Additionally, different MI tasks are affected by different groups of channels. The most common channels that obtained a positive SHAP value among the three subjects in different MI tasks are as follows:
6. Discussion
The SSTMNet model achieved an accuracy of 83.76%, 68.70%,70.61%, and 69% in F1 score, precision, and recall by computing the average of the best fold. Furthermore, almost 50% of subjects achieved an accuracy above 90%, and 10 subjects achieved above 80% for the remaining performance metrics. The SSTMNet model was also able to achieve a 100% and 99% accuracy in many subjects, such as C (session 1), J (session 1), L (session 2), and M (session 1). By comparing the average results obtained with the state-of-the-art models presented in
Table 6, the SSTMNet model produced a higher accuracy in more than half of the subjects.
These reported results demonstrate the significance of performing spectral analysis and magnifying brain waves relevant to MI tasks. In addition, analyzing the long-term temporal dependencies among brain waves and inspecting multiscale and high-level features assist in adequately detecting motor imagery tasks. The proposed data augmentation was also able to expand the size of the dataset and effectively maintain label-preserving. This is reflected in the performance of the SSTMNet model, which overcomes overfitting and produces a higher accuracy compared to training the SSTMNet model without data augmentation.
The SSTMNet model achieved a good accuracy in terms of average and subject-wise results without using intensive preprocessing or trial rejection techniques, as presented in studies [
13,
14,
29]. This indicates the robustness of the SSTMNet model to noise and artifacts. Most of the works that were presented utilized deep learning models such as EEGNet, DeepnNet, and shallowNet. However, in the SSTMNet model, the effectiveness of analyzing the frequency bands was explored by decomposing and reshaping signals in such a way that they could be meaningfully recalibrated using the attention mechanism. Afterward, the impact of integrating the temporal dependency block, sequential, and parallel CNN layers in detecting salient features related to the MI tasks was also studied. The t-SNE plots and confusion matrices presented in
Section 5.5 and
Section 5.6 showed the ability of the SSTMNet to target meaningful features. They also help, to some extent, in identifying the reasons for misclassified cases, such as the similarity between extracted MI features or label noise.
Although the SSTMNet achieved a high accuracy in many subjects, it still performed poorly in some cases, such as B (sessions 2, 3), E (session 3), F (session 3), H and I (all sessions), and k (session 2). These cases require a deep analysis of their EEG signals to determine the reasons behind their poor results and to ensure whether there is a label noise in the poor cases. Furthermore, The SSTMNet model was created by using a deep block architectural design that resulted in a complex model, which requires various techniques to handle its complexity. In the future work, channels reduction and Low-Rank Adaption (LoRA) methods will be used to reduce the model’s complexity
To sum up, the SSTMNet model and data augmentation methods achieved comparable average accuracy results and a higher accuracy in almost half of the subjects compared with the state-of-the-art methods with only local channel standardization and spectral decomposition.