Spatio-Temporal Deep Learning with Adaptive Attention for EEG and sEMG Decoding in Human–Machine Interaction

Fu, Tianhao; Zhou, Zhiyong; Yuan, Wenyu

doi:10.3390/electronics14132670

Open AccessArticle

Spatio-Temporal Deep Learning with Adaptive Attention for EEG and sEMG Decoding in Human–Machine Interaction

by

Tianhao Fu

¹,

Zhiyong Zhou

^2,*

and

Wenyu Yuan

²

¹

Mechanical College, Shanghai DianJi University, Shanghai 201308, China

²

School of Art and Design, Shanghai DianJi University, Shanghai 201308, China

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(13), 2670; https://doi.org/10.3390/electronics14132670

Submission received: 28 May 2025 / Revised: 25 June 2025 / Accepted: 28 June 2025 / Published: 1 July 2025

Download

Browse Figures

Versions Notes

Abstract

Electroencephalography (EEG) and surface electromyography (sEMG) signals are widely used in human–machine interaction (HMI) systems due to their non-invasive acquisition and real-time responsiveness, particularly in neurorehabilitation and prosthetic control. However, existing deep learning approaches often struggle to capture both fine-grained local patterns and long-range spatio-temporal dependencies within these signals, which limits classification performance. To address these challenges, we propose a lightweight deep learning framework that integrates adaptive spatial attention with multi-scale temporal feature extraction for end-to-end EEG and sEMG signal decoding. The architecture includes two core components: (1) an adaptive attention mechanism that dynamically reweights multi-channel time-series features based on spatial relevance, and (2) a multi-scale convolutional module that captures diverse temporal patterns through parallel convolutional filters. The proposed method achieves classification accuracies of 79.47% on the BCI-IV 2a EEG dataset (9 subjects, 22 channels) for motor intent decoding and 85.87% on the NinaPro DB2 sEMG dataset (40 subjects, 12 channels) for gesture recognition. Ablation studies confirm the effectiveness of each module, while comparative evaluations demonstrate that the proposed framework outperforms existing state-of-the-art methods across all tested scenarios. Together, these results demonstrate that our model not only achieves strong performance but also maintains a lightweight and resource-efficient design for EEG and sEMG decoding.

Keywords:

human–machine interface (HMI); EEG decoding; sEMG decoding; adaptive attention; multi-scale convolution

1. Introduction

A human–machine interface (HMI) is a framework designed to mitigate patient disability by facilitating connectivity and control between users and external rehabilitation devices [1]. Two widely used HMI paradigms are brain–computer interfaces (BCIs) and electromyography-based HMIs (EMG-HMIs) [2,3]. In both paradigms, non-invasive BCI and surface EMG-HMI (sEMG-HMI) systems have garnered significant attention due to their convenience, low risk, and relative simplicity compared to invasive methods in medical applications [4,5,6]. Non-invasive BCI techniques control robotic devices to facilitate motor recovery in patients with stroke or other neural injuries by measuring changes in cortical neuronal activity during motor imagery (MI) tasks [7]. And the sEMG-HMI system employs gesture recognition based on surface EMG signals to assist individuals with mobility impairments in controlling prosthetic devices by distinguishing signal patterns produced by muscle activations across different gestures [8]. However, the susceptibility of EEG and sEMG signals to various sources of interference poses significant challenges for accurate signal recognition. These contaminants, such as motion artifacts, ambient noise, and electromagnetic interference, adversely affect amplitude, temporal, and spectral features commonly used in assessment and control applications [9,10]. Additionally, physiological variability among individuals and constraints in computational resources further complicate the development of robust HMI systems [11,12].

Recent advances in deep learning have prompted researchers to explore diverse neural network architectures to enhance the classification accuracy of EEG and sEMG signals. Convolutional neural networks (CNNs) [13] excel at extracting spatially localized features, but their receptive fields are inherently limited, making it difficult to model long-range temporal dependencies that are essential for sequential biosignals. Recurrent neural networks (RNNs) [14], long short-term memory networks (LSTMs) [15], and bidirectional LSTMs (Bi-LSTMs) [16] are designed for sequence modeling, yet they often suffer from vanishing gradients, limited parallelization, and difficulty in learning long-term dependencies over noisy and variable-length input sequences. Transformer models, with their superior sequence-modelling capabilities, have also attracted attention in biosignal processing. However, conventional Transformer architectures [17] exhibit high computational complexity and typically demand extensive pre-training data before being adapted to recognition tasks. Given the prevalent data scarcity in the EEG and sEMG HMI paradigms, the direct application of Transformers in this domain remains constrained.

To address these challenges, this paper proposes SCGTNet, a lightweight spatio-temporal fusion Transformer model that integrates an adaptive spatial attention mechanism with a multi-scale temporal feature extraction strategy to enhance both classification accuracy and computational efficiency for EEG and sEMG signals. The main contributions of this study are as follows:

We introduce a novel CNN–Transformer architecture, SCGTNet, for EEG and sEMG classification in human–computer interaction. By capturing short-term local features and long-term dependencies, SCGTNet improves overall classification accuracy.
We design an adaptive spatial attention (ASA) module that dynamically re-weights channel features to enhance signal discriminability, focusing the model on salient channels while suppressing redundant information.
We propose a multiscale temporal feature extraction module (MSC), which consists of a one-dimensional convolutional parallel, combined with gated recursive units (GRUs) to improve the model’s ability to learn complex temporal patterns.

We evaluate SCGTNet on the BCI IV 2a (EEG) and NinaPro DB2 (sEMG) datasets. Experimental results demonstrate that SCGTNet outperforms state-of-the-art methods in both classification accuracy and computational efficiency.

The remainder of this paper is organized as follows. Section 2 reviews related work on deep learning approaches for EEG and sEMG signal classification. Section 3 details the proposed adaptive spatial attention structure, multi-scale convolution module, and overall SCGTNet design. Section 4 describes the datasets, preprocessing procedures, and experimental setup, followed by ablation studies and comparative analyses. Finally, Section 5 and Section 6 discuss the results, conclude the paper, and outline directions for future work.

2. Related Works

In recent years, deep learning has facilitated significant advances in recognition applications within electroencephalography (EEG) [18] and surface electromyography (sEMG) [3]. Dose et al. [19] proposed a method for an EEG-based motor imagery (MI) brain–computer interface (BCI) system, employing a convolutional neural network (CNN) layer for generalized feature learning and dimensionality reduction, followed by a traditional fully connected (FC) layer for classification. Hajian et al. [20] introduced a CNN for gesture recognition that extracts spatial features by stacking multiple convolutional layers, using EMG spectrograms as input. However, the reliance on a fixed convolutional kernel hinders the method’s ability to dynamically adapt to rapid changes in neural signals across different motion states.

To overcome this limitation, Li et al. [21] proposed a neural network feature fusion algorithm for motor imagery brain–computer interfaces (MI-BCIs), in which convolutional neural networks (CNNs) and long short-term memory (LSTM) networks operate in parallel. In their design, the CNN branch employs convolutional layers followed by a flattening layer to extract spatial features, while the LSTM branch captures temporal dependencies. All extracted features are then merged within a fully connected layer to enhance classification accuracy. Bao et al. [22] similarly introduced a CNN–LSTM architecture; a CNN first derives deep features from sEMG spectrograms, and an LSTM-based sequential regression module subsequently estimates muscle kinematics. However, both approaches still incur information loss during feature fusion, and in practical scenarios, such as prolonged, multimodal rehabilitation training and exercise monitoring, interindividual variability and diversity of movement patterns place even greater demands on model robustness.

To address these challenges, Lee et al. [23] employed the short-time Fourier transform to efficiently extract time–frequency representations of EEG signals and incorporated a self-attention mechanism to selectively emphasize salient features. Wan et al. [24] developed EEG-Former, a hierarchical inflated convolutional architecture that simultaneously captures exponentially expanding receptive fields (dilation rates of 1, 3, and 9) by exploiting short-time (<200 ms) transient components of event-related potentials alongside their long-term correlations with rhythmic oscillations. However, both methods still rely on static parameterization for spatial channel weight assignment, limiting their ability to adaptively respond to real-time inputs.

For signals exhibiting time-varying characteristics, multi-scale convolution has become a focal point of recent research. Zhang et al. [25] introduced a multi-scale Inception module within the EEG-Inception model—an important non-invasive approach to motor imagery (MI) classification in brain–computer interface (BCI) studies—that employs convolutional kernels of varying sizes to extract local details and global patterns in parallel across time–frequency features. Ding et al. [26] subsequently proposed a parallel multi-scale convolutional architecture to enhance the classification accuracy of sEMG-based limb motion recognition by incorporating filters of different scales while accounting for muscle independence. However, these networks still depend on fixed kernel sizes in their underlying CNN architectures, which limits their capacity to dynamically adjust feature extraction weights.

Furthermore, the Transformer model has revolutionized the landscape of natural language processing (NLP) [27]. Its ability to process all data points in parallel significantly reduces training time. Transformers are highly scalable, and large-scale models can capture complex patterns in massive datasets, achieving state-of-the-art performance across a variety of tasks [28]. However, prevalent data scarcity in EEG- and sEMG-based human–machine interface (HMI) paradigms constrains the direct and independent application of Transformers in this domain [29].

Compared with existing approaches, the adaptive spatial attention–parallel convolution architecture proposed in this work dynamically integrates channel attention weights with multi-scale temporal features, thereby enhancing the fine-grained classification accuracy of EEG and sEMG signals while maintaining a lightweight design. This novel framework offers a promising direction for future EEG- and sEMG-based HMI research.

3. Materials and Methods

In this section, we present the main architecture of SCGTNet, followed by a detailed description of the datasets and preprocessing procedures.

SCGTNet is a lightweight neural network model inspired by the Temporal Fusion Transformer (TFT) architecture [30]. It integrates several deep learning components, including an adaptive spatial attention module, a multi-scale convolution module, GRU, Transformer multi-head attention, and a feed-forward network. By design, these modules enable the effective extraction of both local features and long-range temporal dependencies from time-series signals, leading to improved classification accuracy for electroencephalographic (EEG) and surface electromyographic (sEMG) signals. The overall architecture of SCGTNet is illustrated in Figure 1. The core structure comprises the following main modules:

3.1. Adaptive Spatial Attention Module

The variability in the contribution of different sensor channels to the classification task is a key aspect of this study. To address this, the adaptive spatial attention module employs a channel attention mechanism to dynamically adjust the feature importance of multi-channel time-series data. Initially, Global Average Pooling (GAP) and Global Maximum Pooling (GMP) are performed along the time dimension to generate two feature vectors. These vectors are then concatenated and fed into a fully connected layer, where the channel attention weights are produced via a Sigmoid activation function. The resulting weights are subsequently applied to the original input through broadcast multiplication, thereby enhancing or suppressing specific feature channels.

μ_{c}

denotes the global average pooling, which reflects the overall response strength of the channel. The input tensor is

X \in R^{B \times T \times C}

, B is the batch size, T is the time_steps, C is the number of channels:

μ_{c} = \frac{1}{T} \sum_{t = 1}^{T} X_{b, t, c} \forall b \in [1, B], c \in [1, C]

(1)

ν_{c}

denotes global maximum pooling for capturing significant local features:

ν_{c} = {max}_{1 \leq t \leq T} X_{b, t, c} \forall b, c

(2)

Z is the vector obtained by splicing the results of global average pooling and global maximum pooling along the channel dimensions:

Z = [μ; ν] \in R^{B \times 2 C}

(3)

Getting attention weights W through the fully connected layer,

W_{1} \in R^{2 C \times r}

,

W_{2} \in R^{r \times C}

,

σ

is Sigmoid function,

δ

is ReLU activation, compression ratio r is implicitly set to 1 (due to no explicit dimensionality reduction):

W = σ (W_{2} \cdot δ (W_{1} \cdot Z + b_{1}) + b_{2})

(4)

The output of this module is a channel-weighted feature map:

{\hat{X}}_{b, t, c} = X_{b, t, c} \cdot W_{b, c}

(5)

Figure 2 presents the design of the Adaptive Spatial Attention Module.

3.2. Multi-Scale Convolution Module

To capture features at various time scales, the multi-scale convolution module utilizes a parallel convolution structure that deploys three one-dimensional convolution kernels of different widths—specifically, 3, 5, and 7. This design enables the extraction of short-term (e.g., transient events), medium-term (e.g., periodic patterns), and long-term (e.g., trend changes) features along the time axis. Each branch yields the same number of feature maps, which are then concatenated to form a comprehensive multi-scale feature representation.

Y^{(k)}

is output for each branch, where the input tensor is

X \in R^{B \times T \times C \times 1}

(after reshape), and

W^{(k)} \in R^{k \times 1 \times 1 \times F}

is learnable parameters. The set of convolution kernel widths is

K = {3, 5, 7}

:

Y^{(k)} = Conv 2 D (X; W^{(k)}, b^{(k)}) \in R^{B \times T \times C \times F}, k \in K

(6)

Y_{concat}

is the concatenated output:

Y_{concat} = [Y^{(3)}; Y^{(5)}; Y^{(7)}] \in R^{B \times T \times C \times 3 F}

(7)

Figure 3 illustrates the architecture of the Multi-Scale Convolution Module.

3.3. GRU Encoder

GRUs (Gated Recurrent Units) are a recurrent neural network structure commonly used in sequence modelling to effectively capture temporal dependencies in sequences. Their gating mechanism consists of a reset gate and an update gate, enabling GRUs to manage long-term dependencies in data by selectively controlling the flow of information [31]. The reset gate governs the extent to which past information is discarded, while the update gate regulates the amount of new information that is integrated and carried forward, thereby ensuring effective temporal pattern learning [32].

The GRU update process is shown below,

{h_{t}} \in R^{T \times d}

is the input for previous module,

z_{t}

is the control of historical status retention ratio (update gate),

r_{t}

determines the effect of the previous state on the candidate state (reset gate):

z_{t} = σ (W_{z} \cdot [h_{t - 1}, x_{t}])

(8)

r_{t} = σ (W_{r} \cdot [h_{t - 1}, x_{t}])

(9)

{\tilde{h}}_{t} = tanh (W \cdot [r_{t} ⊙ h_{t - 1}, x_{t}])

(10)

h_{t} = (1 - z_{t}) ⊙ h_{t - 1} + z_{t} ⊙ {\tilde{h}}_{t}

(11)

Normalization GRU output

H \in R^{B \times T \times d}

, with

μ

and

σ

calculation along feature dimensions, and

γ

and

β

represent learnable parameters:

H_{norm} = γ ⊙ \frac{H - μ}{\sqrt{σ^{2} + ϵ}} + β

(12)

3.4. Multi-Head Attention

The Multi-Head Attention (MHA) mechanism is a powerful component in contemporary deep learning architectures, particularly when handling long sequential data. Its primary objective is to enable the model to adaptively focus on the significance of different parts of the input sequence by computing the interrelationships among its elements. In contrast to traditional single-head self-attention mechanisms, the multi-head approach computes several attention heads in parallel, thereby enhancing the model’s representational capacity and enabling it to disentangle complex dependencies [33].

This module is designed to enhance the model’s expressive capacity by capturing long-range dependencies between different segments of the input sequence. In particular, when processing time-series data, various positions within the sequence may contain critical contextual information that can be efficiently extracted through the application of multi-head attention.

The model employs a single Multi-Head Attention layer and the number of heads is 4. We set the key and value dimensions equally as dk = dv. Given a single head of attention

Q, K, V

, count as follows:

Attention (Q, K, V) = softmax (\frac{Q K^{T}}{\sqrt{d_{k}}}) V

(13)

Multiple outputs spliced together and linearly transformed:

MultiHead (Q, K, V) = Concat (hea d_{1}, \dots, hea d_{h}) W^{O}

(14)

We apply a residual connection from the preceding GRU output to the attention output, followed by Layer Normalization and Dropout. In implementation, we use TensorFlow’s built-in layers. All forward and backward computations—including the softmax nonlinearity and LayerNorm normalization—are handled by TensorFlow’s automatic differentiation. No manual gradient derivation or linear-gradient approximation is performed. The residual connection conceptually facilitates gradient flow, but the detailed Jacobian contributions from attention and normalization are computed by the framework.

3.5. Feed-Forward Network

Feedforward neural networks [34] (FFNs) not only improve feature abstraction but also provide the model with increased expressive degrees of freedom to learn intricate patterns through nonlinear transformations. The FFN layer typically comprises two fully connected layers utilizing swish activation functions for mapping and transforming the data. By employing residual connectivity and layer normalization, the FFN effectively mitigates training instability while augmenting the network’s overall representation and robustness.

Initially, a fully connected layer maps the input features to a higher-dimensional space, thus enhancing the nonlinear representational capability of the network. By expanding the network’s width, more complex features can be learned. To prevent overfitting, Dropout is applied after this layer, randomly discarding a portion of neurons to improve the model’s generalization ability. Subsequently, the features are processed by a second fully connected layer that reduces them back to their original dimensions, compressing the high-dimensional representations into a low-dimensional format suitable for classification. Finally, the output of the feedforward network is combined with the output of the multi-head attention module through residual connections and normalized. This operation facilitates smoother gradient propagation during training, thereby enhancing both the stability and efficiency of the training process.

W_{1} \in R^{d \times 4 d}

,

W_{2} \in R^{4 d \times d}

, which results in the formation of bottleneck structures:

FFN (x) = W_{2} \cdot Swish (W_{1} \cdot x + b_{1}) + b_{2}

(15)

3.6. Data and Preprocessing

3.6.1. BCI IV 2a Dataset

The BCI IV 2a database [35] is one of the most well-known and widely used publicly available datasets in the field of Brain–Computer Interface (BCI) research; it was released by a team of researchers from the Graz University of Technology in 2008 with the aim of providing a standardized electroencephalographic (EEG) signal classification task for the benchmark for EEG classification tasks and to facilitate the development of Motor Imagery (MI)-related algorithms. It recorded EEG data from nine subjects, each of whom performed two experiments (sessions), each consisting of six runs. In each run, subjects were required to complete 4 categories of motor imagery tasks, namely left hand, right hand, feet and tongue motor imagery. The task was repeated 12 times for each category, so each experiment contained a total of 288 trials (4 categories × 6 runs × 12 repetitions). Each trial flowed through three phases: cue, task, and rest. In the first phase, a fixation cross was displayed on the screen for 2 s. In the second phase, an arrow was displayed on the screen instructing the subject to perform a specific type of motor imagery for 4 s. In the final rest phase the screen was blank and lasted for a random period of 1.5 to 2.5 s. Data were acquired using 22 EEG electrodes and 3 EOG (electro-ocular) electrodes with a sampling frequency of 250 Hz. The EEG electrodes were arranged according to the international 10–20 system, covering the main motor areas of the brain.

EEG signal preprocessing aims to enhance the Sensorimotor Rhythm (SMR) and suppress artefacts. In this paper, only 22 EEG electrodes were used, the data from 3 EOG (electrooculogram) electrodes were excluded, and the preprocessing was divided into four steps. Firstly, the sampling rate was reduced to 125 Hz using a FIR anti-aliasing filter [36] (cut-off frequency 62.5 Hz, coefficient 128) to reduce the computational burden,

h (k)

is the filter order,

N = 129

:

x_{d o w n} (n) = \sum_{k = 0}^{N - 1} h (k) \cdot x (2 n - k)

(16)

Subsequent elimination of the global potential offset with Average Reference [37]:

x_{r e f}^{i} = x^{i} - \frac{1}{22} \sum_{j = 1}^{22} x^{j}

(17)

The second stage used Common Spatial Pattern (CSP) filtering [38] to extract spatial discriminative features by maximizing the ratio of the two types of task variance.

X_{i} \in R^{22 \times 563}

is the number of i in trial data, while

N_{c}

is the number of attempts in label c. The first 3 pairs (6 in total) of spatial filters were selected, retaining 90% of the discriminative information:

Σ_{c} = \frac{1}{N_{c}} \sum_{i = 1}^{N_{c}} X_{i} X_{i}^{T}

(18)

The Morlet wavelet transform [39,40] is then applied to the CSP filtered signal,

f_{b} = 2

is the bandwidth parameter,

f_{c}

is center frequency. Wavelets are defined as:

ψ_{f, t} (τ) = \frac{1}{\sqrt{π f_{b}}} e^{2 i π f_{c} τ} e^{- τ^{2} / f_{b}}

(19)

Finally for constructing the model input, sliding window segmentation is used, setting the window length to 200 samples and the overlap rate to 50% to balance the timing resolution and computational efficiency. The final output dimension obtained is shown below, where

N_{f r a m e s}

is the total number of frames, and 22 is the number of channels:

N_{f r a m e s} \times 200 \times 22

(20)

Z-score normalized [41]

ϵ

is used to prevent divide-by-zero errors:

μ_{c} = E [x_{c}], σ_{c} = \sqrt{E [{(x_{c} - μ_{c})}^{2}]}, ϵ = 1 e^{- 6}

(21)

{\hat{x}}_{c} = \frac{x_{c} - μ_{c}}{σ_{c} + ϵ}

(22)

3.6.2. NinaPro DB2 Dataset

The NinaPro DB2 [42] database is a standard dataset widely used in gesture recognition and electromyographic signal (EMG) research. The dataset was first released by Atzori et al. in 2012 and has been extended and improved in subsequent studies with the aim of facilitating the development of gesture recognition algorithms based on surface EMG signals. It contains EMG signal data from 40 healthy subjects, 29 males and 11 females, aged between 29 and 34 years. Each subject performed 50 different gestural movements, including finger, wrist and arm movements. These gestures covered common hand movements in daily life, such as clenching the fist, extending the palm, and pinching and fetching. During data acquisition, 12 surface EMG electrodes were used, which were placed on the forearms of the subjects to capture EMG signals from different muscle groups. In addition, the dataset contains synchronously acquired accelerometer data, which is used to provide spatial information of gesture movements. Its EMG signals were sampled at a frequency of 2 kHz, and each gesture movement was repeated six times, each lasting five seconds with a three-second rest period in between. This design allows the dataset to provide a sufficient sample size to support the training and validation of deep learning models. In addition, the dataset contains basic information about the subjects, such as age, gender, and hand habitus, which facilitates the study of the effect of individual differences on gesture recognition performance.

For data preprocessing of this dataset, this paper used a fourth-order Butterworth filter [43] to eliminate low-frequency baseline drift (<20 Hz) and high-frequency EMG noise (>450 Hz), where the cut-off frequency

f_{c} = [20, 450]

, ordinal number

n = 4

. The filtered signal retains the main energy of the muscle fiber action potential (80% of the energy is concentrated at 50–150 Hz):

H (s) = \frac{1}{\sqrt{1 + {(\frac{s}{2 π f_{c}})}^{2 n}}}

(23)

Subsequently, in order to capture the transient features of muscle activity, a sliding window segmentation was used, with the window length set to 200 samples, corresponding to a 100 ms time window to satisfy the condition that 95% of the gesture activation time be less than 150 ms, and the overlap rate set to 50% (i.e., a step size of 100 samples), to balance the temporal resolution with computational efficiency. The final output dimension obtained is shown below, where

N_{f r a m e s}

is the total number of frames, and 12 is the number of channels:

N_{f r a m e s} \times 200 \times 12

(24)

Finally, to eliminate inter-individual signal amplitude differences, z-score normalization is performed for each channel [41], where

ϵ

is to prevent divide-by-zero errors:

μ_{c} = E [x_{c}], σ_{c} = \sqrt{E [{(x_{c} - μ_{c})}^{2}]}, ϵ = 1 e^{- 6}

(25)

{\hat{x}}_{c} = \frac{x_{c} - μ_{c}}{σ_{c} + ϵ}

(26)

4. Experiments and Results

In this section, we first provide details of the experimental setup and evaluation procedures, and then present the results of the experiments conducted to validate the effectiveness of the proposed method.

4.1. Training and Assessment Settings

In the BCI IV 2a dataset, the data from all nine subjects were randomly divided into a training set (80%) and a validation set (20%). In the NinaPro DB2 dataset, experiments were conducted and the performance of the proposed model evaluated by partitioning each subject’s data into a training set (75%) and a validation set (25%). For both datasets, the model was trained using the commonly adopted cross-entropy loss function and the Adam optimizer with a learning rate of 0.001. The maximum number of epochs was set to 200 with a minimum batch size of 256, and an early stopping mechanism was employed to prevent overfitting when the validation performance ceased to improve significantly.

The cross-entropy loss used in this work is given by:

L_{C E} = - \frac{1}{N} \sum_{i = 1}^{N} \sum_{c = 1}^{C} y_{i c} log ({\hat{y}}_{i c})

(27)

where N is the number of samples, C is the number of classes,

y_{i c}

represents the ground truth indicator (1 if class c is the correct label for sample i, 0 otherwise), and

{\hat{y}}_{i c}

denotes the predicted probability for class c of sample i.

To assess the performance of the proposed model, the evaluation metrics used include recall, accuracy, precision, and F1 score. In these metrics, TP (True Positive) represents the number of positive examples correctly identified by the model, TN (True Negative) denotes the number of negative examples correctly classified as negative, FP (False Positive) indicates the number of negative examples incorrectly classified as positive, and FN (False Negative) refers to the number of positive examples misclassified as negative:

R e c a l l = \frac{T P}{T P + F N} \times 100 %

(28)

A c c u r a c y = \frac{T N + T P}{T N + T P + F N + F P} \times 100 %

(29)

P r e c i s i o n = \frac{T P}{T P + F P} \times 100 %

(30)

F 1 = \frac{2 (r e c a l l \times p r e c i s i o n)}{r e c a l l + p r e c i s i o n}

(31)

But for some prior methods, the original code or detailed implementation specifications were not publicly available to guarantee faithful reproduction. Therefore, for these methods we only report the published Accuracy values, and mark Recall, Precision, and F1 as “–” to indicate “not reported or unable to reproduce under identical settings.”

4.2. Ablation Study

To evaluate the contribution of each architectural component in SCGTNet, ablation experiments were carried out using BCI IV-2a as the primary benchmark and NinaPro DB2 for secondary validation. On BCI IV-2a, we systematically omitted the Adaptive Spatial Attention (ASA) block and the Multi-Scale Convolution (MSC) block in isolation. In addition, the MSC block was replaced with a conventional

3 \times 1

convolution to assess the advantage of the multi-scale design. The same protocol was subsequently applied to the NinaPro DB2 dataset to confirm the generalizability of these findings.

Table 1 and Figure 4 summarize the results on the BCI IV-2a dataset. Removing the ASA block led to a 6.14 percentage point drop in overall accuracy and a 4.77 percentage point reduction in F1-score. That highlights the effectiveness of the model in modeling spatial correlations between channels. By focusing the network’s capacity on the most informative electrode pairs, ASA appears to enhance discriminability, especially in complex MI patterns. Exclusion of the MSC block produced a 5.35 percentage point decrease in accuracy and a 4.36 percentage point loss in F1, indicating that capturing temporal features at multiple resolutions is crucial for distinguishing overlapping event-related desynchronizations. When the MSC was replaced by a standard

3 \times 1

convolution, accuracy fell by 3.43 percentage points and F1 by 4.19 points. This suggests that while basic temporal filtering does help, the multi-scale design more effectively accommodates varying gesture durations and frequencies. These findings confirm that both ASA and MSC modules significantly enhance SCGTNet’s discriminative power, with ASA exerting the largest effect on this dataset.

Figure 5 presents the t-SNE embeddings of the learned feature vectors under each ablation setting on BCI IV-2a. In the full model (Figure 5a), clusters corresponding to LEFT, RIGHT, FEET, and TONGUE appear well separated, suggesting strong inter-class discrimination. Upon removing the ASA block (Figure 5b), the LEFT and RIGHT clusters show noticeable overlap, aligning with the decrease in F1 observed in Table 1. Similarly, the exclusion of the MSC block (Figure 5c) increases intra-class dispersion and reduces inter-class separation, which is also reflected in the lower F1. When the MSC is replaced by a standard

3 \times 1

convolution (Figure 5d), an intermediate level of overlap appears, consistent with its moderate F1 performance. These t-SNE visualizations are intended for qualitative inspection and, together with the quantitative results, highlight the roles of ASA and MSC in achieving robust and discriminative feature representations.

On the NinaPro DB2 dataset (Table 2, as shown in Figure 6), omission of the ASA block resulted in a 2.65 percentage point decline in accuracy and a 2.54 point drop in F1-score. Excluding the MSC block caused a 4.21 point reduction in accuracy and a 4.16 point decrease in F1. Substituting MSC with a

3 \times 1

convolution led to a further 4.78 point fall in accuracy and a 4.82 point loss in F1. These outcomes indicate that the comparatively smaller impact of ASA removal versus MSC removal implies that spatial attention is somewhat less critical for sEMG-based hand gesture recognition, whereas multi-scale temporal patterns remain key. These insights not only validate our architectural choices but also guide future refinements, e.g., exploring dynamic attention mechanisms or adaptive scale selection.

4.3. Hyperparameter Sensitivity Analysis

To investigate whether the fixed bandwidth parameter

f_{b} = 2

of the Morlet wavelet is optimal for extracting ERP features from the CSP-filtered EEG signal, a series of controlled experiments were conducted. Specifically, we varied the

f_{b}

parameter within the range of

[1.5, 2.0, 2.5]

and the obtained data were classified by the model under the same conditions to obtain the respective results.

As shown in Table 3, the highest classification accuracy was achieved when

f_{b} = 2

, confirming its suitability for our dataset and model configuration.

To evaluate the effect of different attention head counts on model performance, we varied

n u m_h e a d s \in {1, 2, 4, 8}

for the single Multi-Head Attention layer. In each configuration, all other hyperparameters were kept identical. The models were trained and evaluated under the same protocol on the BCI IV 2a and Ninapro DB2 datasets. Results are summarized in Table 4.

Table 4 shows that for BCI IV 2a, increasing head count from 1 to 4 yields consistent improvements, with a marginal further gain at 8 heads; however, on Ninapro DB2, performance peaks at 4 heads and decreases when using 8 heads. Therefore,

n u m_h e a d s = 4

is selected for the final model based solely on these performance results.

4.4. Comparison Study

To benchmark SCGTNet against existing methods, comparative experiments were conducted on BCI IV 2a (primary) and NinaPro DB2 (secondary) datasets. Classical CNN [44] and LSTM [45] architectures, as well as more advanced EEG models (EEG-Inception [25], MIN2Net [46], EEG-TCNet [47], TFT [30]) and sEMG methods (Wei et al. [48], Ding et al. [26], Zhang et al. [49], Hu et al. [50], TFT [30]) were included.

Table 5 reports that SCGTNet achieved 79.47% accuracy, 79.33% recall, 76.96% precision, and 78.13% F1 score—surpassing the best baseline MIN2Net [46] at 3.60 pp and 3.51 pp, respectively. These gains reflect SCGTNet’s holistic design, which integrates graph-based channel relationships, multi-resolution temporal filtering, and end-to-end joint optimization—unlike prior methods that treat spatial filtering, temporal dynamics, or inter-channel dependencies in isolation. In contrast, inception-style CNNs separate spectral and temporal paths, and transformer-based approaches focus primarily on sequential dependencies; neither fully exploits the complementary information present across channels and scales. As a result, SCGTNet more effectively captures the complex spatiotemporal patterns characteristic of EEG motor imagery, leading to superior classification across all four classes.

Figure 7 presents the confusion matrix for SCGTNet on the BCI IV-2a dataset. It shows that all classes achieve high recognition rates, with the LEFT class reaching 81%, RIGHT 78%, FEET 75%, and TONGUE 79%. Misclassifications are predominantly between LEFT and RIGHT, indicating that residual overlap remains in lateralized patterns, but at a substantially lower rate than in baseline models. This uniform class-wise performance underscores SCGTNet’s balanced discriminative capability across motor imagery categories.

To assess the cross-subject generalization ability of SCGTNet, we further conducted leave-one-subject-out (LOSO) experiments on the BCI IV 2a dataset. Figure 8 shows the performance of SCGTNet and the baseline methods across nine subjects. SCGTNet achieved consistently higher accuracy and exhibited lower inter-subject variability compared to the other methods, indicating its robustness across subjects.

As shown in Table 6, SCGTNet obtained 85.87% accuracy, 87.42% recall, 83.19% precision, and 85.42% F1-score—exceeding Wei et al. [48] (83.70%) and Hu et al. [50] (84.84%) by 2.17 pp and 1.03 pp in accuracy. The superior performance arises from SCGTNet’s unified framework that simultaneously models inter-electrode/muscle-site relations and captures both short-term transients and longer activations within a single network. Whereas traditional CNNs apply fixed-scale filters and LSTMs process sequential features without explicit channel interactions, and recent transformer- or inception-based sEMG methods decouple these concerns, SCGTNet’s end-to-end architecture learns to leverage all these aspects jointly. This design yields more robust gesture discrimination across diverse muscle patterns and subjects, as evidenced by uniformly high per-class recognition rates (>80%) in the confusion matrix (Figure 9).

Figure 9 presents the full 49-class confusion matrix for SCGTNet on NinaPro DB2. Despite the large gesture vocabulary, SCGTNet maintains an average per-class recognition rate above 80%, with the majority of gestures exceeding 85% accuracy. Remaining misclassifications occur mainly among anatomically or functionally similar movements (e.g., individual finger flexions), indicating that residual overlap persists in fine-grained muscle activations. Overall, the balanced class-wise performance across 49 gestures confirms SCGTNet’s robustness in large-scale sEMG classification tasks.

5. Discussion

The ablation experiments highlight the distinct contributions of the adaptive spatial attention and multi-scale temporal convolution modules. Notably, the multi-scale temporal convolution module yields greater improvements on NinaPro DB2, while the adaptive spatial attention module is more effective on BCI IV 2a. These results suggest that different biosignal datasets benefit from different model components, but overall, both modules enhance feature extraction and boost classification performance.

Compared to traditional CNN-, LSTM-, and Transformer-based models, SCGTNet achieves consistently higher accuracy on both datasets. This underscores its strength in capturing spatio-temporal patterns in fine-grained EEG and sEMG signals. Due to the lack of publicly available source code for some prior methods, certain comparisons are limited to published accuracies; we acknowledge that such comparisons may be influenced by differences in experimental setups and reporting standards.

Additionally, Table 7 and Table 8 show that SCGTNet requires fewer parameters and computations than traditional Transformer-based models, indicating its potential for deployment on resource-limited hardware. Its lightweight design and competitive accuracy make it a promising candidate for practical HMI scenarios such as rehabilitation training, wearable devices, and assistive control.

Despite these promising results, further work is needed to evaluate the model’s generalizability across diverse biosignals, including ECG, eye tracking, and particularly EEG–EMG fusion, as well as its scalability to longer or higher-frequency inputs. Future research will focus on rigorous real-time inference benchmarking under controlled hardware conditions, as well as hardware-aware optimization techniques (e.g., pruning, quantization) to bridge the gap between model design and efficient low-latency implementation. Integrating transfer learning, reinforcement learning, or other advanced techniques may also help extend its application scope and enhance adaptability.

6. Conclusions

In summary, this paper proposes SCGTNet, a lightweight multiscale spatio-temporal fusion Transformer for EEG-based motor imagery classification and sEMG-based gesture recognition. Experimental results on BCI IV 2a and NinaPro DB2 show that SCGTNet achieves state-of-the-art performance (79.47% and 85.87%, respectively), outperforming traditional CNN-, LSTM-, and Transformer-based methods. Future work will explore further optimization to enable its practical deployment in real-time human–machine interface scenarios.

Author Contributions

Conceptualization, T.F.; methodology, T.F. and Z.Z.; software, T.F.; validation, Z.Z. and W.Y.; formal analysis, T.F.; investigation, Z.Z.; resources, T.F.; data curation, T.F.; writing—original draft preparation, T.F.; writing—review and editing, Z.Z. and W.Y.; visualization, T.F.; supervision, Z.Z. and W.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

BCI IV 2a dataset source: https://www.bbci.de/competition/iv/ (accessed on 8 October 2024). NinaPro DB2 dataset source: http://ninaweb.hevs.ch/ (accessed on 26 October 2024).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Andreoni, G.; Parini, S.; Maggi, L.; Piccini, L.; Panfili, G.; Torricelli, A. Human machine interface for healthcare and rehabilitation. In Advanced Computational Intelligence Paradigms in Healthcare-2; Springer: Berlin/Heidelberg, Germany, 2007; pp. 131–150. [Google Scholar] [CrossRef]
Ajiboye, A.B.; Willett, F.R.; Young, D.R.; Memberg, W.D.; Murphy, B.A.; Miller, J.P.; Walter, B.L.; Sweet, J.A.; Hoyen, H.A.; Keith, M.W.; et al. Restoration of reaching and grasping movements through brain-controlled muscle stimulation in a person with tetraplegia: A proof-of-concept demonstration. Lancet 2017, 389, 1821–1830. [Google Scholar] [CrossRef] [PubMed]
Xiong, D.; Zhang, D.; Zhao, X.; Zhao, Y. Deep learning for EMG-based human-machine interaction: A review. IEEE/CAA J. Autom. Sin. 2021, 8, 512–533. [Google Scholar] [CrossRef]
Cincotti, F.; Mattia, D.; Aloise, F.; Bufalari, S.; Schalk, G.; Oriolo, G.; Cherubini, A.; Marciani, M.G.; Babiloni, F. Non-invasive brain–computer interface system: Towards its application as assistive technology. Brain Res. Bull. 2008, 75, 796–803. [Google Scholar] [CrossRef] [PubMed]
Rissanen, S.M.; Koivu, M.; Hartikainen, P.; Pekkonen, E. Ambulatory surface electromyography with accelerometry for evaluating daily motor fluctuations in Parkinson’s disease. Clin. Neurophysiol. 2021, 132, 469–479. [Google Scholar] [CrossRef]
Morón, J.; DiProva, T.; Cochrane, J.R.; Ahn, I.S.; Lu, Y. EMG-based hand gesture control system for robotics. In Proceedings of the 2018 IEEE 61st International Midwest Symposium on Circuits and Systems (MWSCAS), Windsor, ON, Canada, 5–8 August 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 664–667. [Google Scholar] [CrossRef]
Edelman, B.J.; Zhang, S.; Schalk, G.; Brunner, P.; Müller-Putz, G.; Guan, C.; He, B. Non-invasive brain-computer interfaces: State of the art and trends. IEEE Rev. Biomed. Eng. 2024, 18, 26–49. [Google Scholar] [CrossRef]
Li, W.; Ma, Y.; Shao, K.; Yi, Z.; Cao, W.; Yin, M.; Xu, T.; Wu, X. The human–machine interface design based on sEMG and motor imagery EEG for lower limb exoskeleton assistance system. IEEE Trans. Instrum. Meas. 2024, 73, 1–14. [Google Scholar] [CrossRef]
Farago, E.; MacIsaac, D.; Suk, M.; Chan, A.D. A review of techniques for surface electromyography signal quality analysis. IEEE Rev. Biomed. Eng. 2022, 16, 472–486. [Google Scholar] [CrossRef]
Sinderby, C.; Lindstrom, L.; Grassino, A. Automatic assessment of electromyogram quality. J. Appl. Physiol. 1995, 79, 1803–1815. [Google Scholar] [CrossRef]
Kamrud, A.; Borghetti, B.; Schubert Kabban, C. The effects of individual differences, non-stationarity, and the importance of data partitioning decisions for training and testing of EEG cross-participant models. Sensors 2021, 21, 3225. [Google Scholar] [CrossRef]
Lotte, F.; Bougrain, L.; Cichocki, A.; Clerc, M.; Congedo, M.; Rakotomamonjy, A.; Yger, F. A review of classification algorithms for EEG-based brain–computer interfaces: A 10 year update. J. Neural Eng. 2018, 15, 031005. [Google Scholar] [CrossRef]
LeCun, Y.; Boser, B.; Denker, J.; Henderson, D.; Howard, R.; Hubbard, W.; Jackel, L. Handwritten digit recognition with a back-propagation network. Adv. Neural Inf. Process. Syst. 1989, 2, 396–404. [Google Scholar]
Zhang, J.; Man, K.F. Time series prediction using RNN in multi-dimension embedding phase space. In Proceedings of the SMC’98 Conference Proceedings. 1998 IEEE International Conference on Systems, Man, and Cybernetics (cat. no. 98CH36218); IEEE: Piscataway, NJ, USA, 1998; Volume 2, pp. 1868–1873. [Google Scholar] [CrossRef]
Greff, K.; Srivastava, R.K.; Koutník, J.; Steunebrink, B.R.; Schmidhuber, J. LSTM: A search space odyssey. IEEE Trans. Neural Netw. Learn. Syst. 2016, 28, 2222–2232. [Google Scholar] [CrossRef] [PubMed]
Byeon, Y.H.; Kwak, K.C. Personal Identification Using Long Short-Term Memory with Efficient Features of Electromyogram Biomedical Signals. Electronics 2023, 12, 4192. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2017; Volume 30. [Google Scholar]
Altaheri, H.; Muhammad, G.; Alsulaiman, M.; Amin, S.U.; Altuwaijri, G.A.; Abdul, W.; Bencherif, M.A.; Faisal, M. Deep learning techniques for classification of electroencephalogram (EEG) motor imagery (MI) signals: A review. Neural Comput. Appl. 2023, 35, 14681–14722. [Google Scholar] [CrossRef]
Dose, H.; Møller, J.S.; Iversen, H.K.; Puthusserypady, S. An end-to-end deep learning approach to MI-EEG signal classification for BCIs. Expert Syst. Appl. 2018, 114, 532–542. [Google Scholar] [CrossRef]
Hajian, G.; Etemad, A.; Morin, E. Generalized EMG-based isometric contact force estimation using a deep learning approach. Biomed. Signal Process. Control 2021, 70, 103012. [Google Scholar] [CrossRef]
Li, H.; Ding, M.; Zhang, R.; Xiu, C. Motor imagery EEG classification algorithm based on CNN-LSTM feature fusion network. Biomed. Signal Process. Control 2022, 72, 103342. [Google Scholar] [CrossRef]
Bao, T.; Zaidi, S.A.R.; Xie, S.; Yang, P.; Zhang, Z.Q. A CNN-LSTM hybrid model for wrist kinematics estimation using surface electromyography. IEEE Trans. Instrum. Meas. 2020, 70, 2503809. [Google Scholar] [CrossRef]
Lee, M.; Cha, M.; Woo, J. In-Depth Inception–Attention Time Model: An Application for Driver Drowsiness Detection. Electronics 2025, 14, 1069. [Google Scholar] [CrossRef]
Wan, Z.; Li, M.; Liu, S.; Huang, J.; Tan, H.; Duan, W. EEGformer: A transformer–based brain activity classification method using EEG signal. Front. Neurosci. 2023, 17, 1148855. [Google Scholar] [CrossRef] [PubMed]
Zhang, C.; Kim, Y.K.; Eskandarian, A. EEG-inception: An accurate and robust end-to-end neural network for EEG-based motor imagery classification. J. Neural Eng. 2021, 18, 046014. [Google Scholar] [CrossRef] [PubMed]
Ding, Z.; Yang, C.; Tian, Z.; Yi, C.; Fu, Y.; Jiang, F. sEMG-based gesture recognition with convolution neural networks. Sustainability 2018, 10, 1865. [Google Scholar] [CrossRef]
Choi, H.S. Feasibility of Transformer Model for User Authentication Using Electromyogram Signals. Electronics 2024, 13, 4134. [Google Scholar] [CrossRef]
Abibullaev, B.; Keutayeva, A.; Zollanvari, A. Deep learning in EEG-based BCIs: A comprehensive review of transformer models, advantages, challenges, and applications. IEEE Access 2023, 11, 127271–127301. [Google Scholar] [CrossRef]
Ali, O.; Saif-ur Rehman, M.; Glasmachers, T.; Iossifidis, I.; Klaes, C. ConTraNet: A hybrid network for improving the classification of EEG and EMG signals with limited training data. Comput. Biol. Med. 2024, 168, 107649. [Google Scholar] [CrossRef]
Lim, B.; Arık, S.Ö.; Loeff, N.; Pfister, T. Temporal fusion transformers for interpretable multi-horizon time series forecasting. Int. J. Forecast. 2021, 37, 1748–1764. [Google Scholar] [CrossRef]
Bouchane, M.; Guo, W.; Yang, S. Hybrid CNN-GRU Models for Improved EEG Motor Imagery Classification. Sensors 2025, 25, 1399. [Google Scholar] [CrossRef]
Chung, J.; Gulcehre, C.; Cho, K.; Bengio, Y. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv 2014, arXiv:1412.3555. [Google Scholar] [CrossRef]
Xiao, L.; Zhong, H.; Liu, J.; Zhang, K.; Xu, Q.; Chang, L. A Novel Source Code Representation Approach Based on Multi-Head Attention. Electronics 2024, 13, 2111. [Google Scholar] [CrossRef]
Ketkar, N.; Moolayil, J.; Ketkar, N.; Moolayil, J. Feed-forward neural networks. In Deep Learning with Python: Learn Best Practices of Deep Learning Models with PyTorch; Springer: Berlin/Heidelberg, Germany, 2021; pp. 93–131. [Google Scholar] [CrossRef]
Brunner, C.; Leeb, R.; Müller-Putz, G.; Schlögl, A.; Pfurtscheller, G. BCI Competition 2008–Graz Data Set A; Institute for Knowledge Discovery (Laboratory of Brain-Computer Interfaces), Graz University of Technology; IEEE Dataport: Piscataway, NJ, USA, 2008; Volume 16, p. 1. [Google Scholar]
Alkhorshid, D.R.; Molaeezadeh, S.F.; Alkhorshid, M.R. Analysis: Electroencephalography acquisition system: Analog design. Biomed. Instrum. Technol. 2020, 54, 346–351. [Google Scholar] [CrossRef] [PubMed]
Hu, S.; Yao, D.; Bringas-Vega, M.L.; Qin, Y.; Valdes-Sosa, P.A. The statistics of EEG unipolar references: Derivations and properties. Brain Topogr. 2019, 32, 696–703. [Google Scholar] [CrossRef] [PubMed]
Blankertz, B.; Tomioka, R.; Lemm, S.; Kawanabe, M.; Muller, K.R. Optimizing spatial filters for robust EEG single-trial analysis. IEEE Signal Process. Mag. 2007, 25, 41–56. [Google Scholar] [CrossRef]
Devnath, L.; Kumer, S.; Nath, D.; Kr, A.D.; Islam, R. Selection of Wavelet and Thresholding Rule for Denoising the ECG Signals. Bachelor’s Thesis, Khulna University, Khulna, Bangladesh, 2015. [Google Scholar] [CrossRef]
Mondal, M.; Devnath, L.; Mazumder, M.; Islam, R. Comparison of Wavelets for Medical Image Compression Using MATLAB. Int. J. Innov. Appl. Stud. 2016, 18, 1023–1031. [Google Scholar] [CrossRef]
Tanaka, T.; Nambu, I.; Maruyama, Y.; Wada, Y. Sliding-window normalization to improve the performance of machine-learning models for real-time motion prediction using electromyography. Sensors 2022, 22, 5005. [Google Scholar] [CrossRef]
Atzori, M.; Gijsberts, A.; Castellini, C.; Caputo, B.; Hager, A.G.M.; Elsig, S.; Giatsidis, G.; Bassetto, F.; Müller, H. Electromyography data for non-invasive naturally-controlled robotic hand prostheses. Sci. Data 2014, 1, 140053. [Google Scholar] [CrossRef]
Mello, R.G.; Oliveira, L.F.; Nadal, J. Digital Butterworth filter for subtracting noise from low magnitude surface electromyogram. Comput. Methods Programs Biomed. 2007, 87, 28–35. [Google Scholar] [CrossRef]
Hou, Y.; Zhou, L.; Jia, S.; Lun, X. A novel approach of decoding EEG four-class motor imagery tasks via scout ESI and CNN. J. Neural Eng. 2020, 17, 016048. [Google Scholar] [CrossRef]
Hou, Y.; Jia, S.; Lun, X.; Zhang, S.; Chen, T.; Wang, F.; Lv, J. Deep feature mining via the attention-based bidirectional long short term memory graph convolutional neural network for human motor imagery recognition. Front. Bioeng. Biotechnol. 2022, 9, 706229. [Google Scholar] [CrossRef]
Autthasan, P.; Chaisaen, R.; Sudhawiyangkul, T.; Rangpong, P.; Kiatthaveephong, S.; Dilokthanakul, N.; Bhakdisongkhram, G.; Phan, H.; Guan, C.; Wilaiprasitporn, T. MIN2Net: End-to-end multi-task learning for subject-independent motor imagery EEG classification. IEEE Trans. Biomed. Eng. 2021, 69, 2105–2118. [Google Scholar] [CrossRef]
Ingolfsson, T.M.; Hersche, M.; Wang, X.; Kobayashi, N.; Cavigelli, L.; Benini, L. EEG-TCNet: An accurate temporal convolutional network for embedded motor-imagery brain–machine interfaces. In Proceedings of the 2020 IEEE International Conference on Systems, Man, and Cybernetics (SMC), Toronto, ON, Canada, 11–14 October 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 2958–2965. [Google Scholar] [CrossRef]
Wei, W.; Dai, Q.; Wong, Y.; Hu, Y.; Kankanhalli, M.; Geng, W. Surface-electromyography-based gesture recognition by multi-view deep learning. IEEE Trans. Biomed. Eng. 2019, 66, 2964–2973. [Google Scholar] [CrossRef] [PubMed]
Zhang, Y.; Yang, F.; Fan, Q.; Yang, A.; Li, X. Research on sEMG-based gesture recognition by dual-view deep learning. IEEE Access 2022, 10, 32928–32937. [Google Scholar] [CrossRef]
Hu, F.; He, K.; Qian, M.; Gouda, M.A. TFN-FICFM: SEMG-based gesture recognition using temporal fusion network and fuzzy integral-based classifier fusion. J. Bionic Eng. 2024, 21, 1878–1891. [Google Scholar] [CrossRef]

Figure 1. Overall architecture of the proposed SCGTNet. The general architecture of the proposed SCGTNet. The input EEG or sEMG signals are adjusted to the significance of different channels through the ASA module. The MSC module connects the GRUs to extract local features. Transformer Encoder is utilized to obtain timing dependencies. B, T, C, F, and G are the batch size, time_steps, number of channels, filters in Conv2D and gru units.

Figure 2. Architecture of the Adaptive Spatial Attention Module. B, T, C, and W are the batch size, time_steps, number of channels and channel weights.

Figure 3. Architecture of the Multi-Scale Convolution Module. B, T, C, and F are the batch size, time_steps, number of channels and filters in Conv2D.

Figure 4. Training loss and accuracy curves during the ablation experiments on the BCI IV-2a dataset. These plots demonstrate the contributions of each module to convergence speed and final performance on motor intent decoding.

Figure 5. t-SNE visualization for SCGTNet under different ablation settings on BCI IV 2a: (a) full model; (b) without ASA; (c) without MSC; (d) MSC replaced by

3 \times 1

convolution. Points represent trial feature vectors in 2D space, color-coded by class (LEFT is blue, RIGHT is orange, FEET is green, TONGUE is red). The degree of cluster overlap indicates the model’s class separation capability under each ablation.

Figure 5. t-SNE visualization for SCGTNet under different ablation settings on BCI IV 2a: (a) full model; (b) without ASA; (c) without MSC; (d) MSC replaced by

3 \times 1

convolution. Points represent trial feature vectors in 2D space, color-coded by class (LEFT is blue, RIGHT is orange, FEET is green, TONGUE is red). The degree of cluster overlap indicates the model’s class separation capability under each ablation.

Figure 6. Training loss and accuracy curves during the ablation experiments on the NinaPro DB2 dataset. These plots demonstrate the contributions of each module to convergence speed and final performance on gesture recognition.

Figure 7. Confusion matrix for SCGTNet on the BCI IV-2a dataset. Columns denote the ground-truth motor imagery classes (LEFT, RIGHT, FEET, TONGUE) and rows columns denote predicted classes. The color intensity indicates the proportion of trials assigned to each predicted class.

Figure 8. Leave-one-subject-out (LOSO) performance of SCGTNet and baseline methods on the BCI IV 2a dataset. The x-axis denotes subjects (Sub1–Sub9), and the y-axis denotes classification accuracy.

Figure 9. Confusion matrix for SCGTNet on the NinaPro DB2 dataset (49 classes). Columns denote ground-truth gesture classes and rows denote predicted classes. Color intensity corresponds to the proportion of correctly classified trials per gesture.

Table 1. Ablation study results for SCGTNet on the BCI IV-2a dataset. These results highlight the contribution of each component to overall performance.

Configuration	Accuracy	Recall	Precision	F1-Score
Without ASA	73.33%	72.45%	74.31%	73.36%
Without MSC	74.12%	76.47%	71.26%	73.77%
$3 \times 1$ CNN	76.04%	73.26%	74.63%	73.94%
Full model	79.47%	79.33%	80.96%	78.13%

Table 2. Ablation study results for SCGTNet on the NinaPro DB2 dataset. These results highlight the contribution of each component to overall performance.

Configuration	Accuracy	Recall	Precision	F1-Score
Without ASA	83.28%	81.02%	84.90%	82.88%
Without MSC	81.72%	87.52%	85.78%	81.26%
$3 \times 1$ CNN	81.15%	85.43%	86.46%	80.60%
Full model	85.87%	87.42%	83.19%	85.42%

Table 3. Impact of Morlet Wavelet Bandwidth (fb) on BCI IV 2a classification performance. Accuracies, recalls, precisions, and F1-scores were measured under different bandwidth settings (fb = 1.5, 2.0, 2.5).

Number of fb	Accuracy	Recall	Precision	F1-Score
1.5	68.79%	67.04%	70.61%	68.74%
2	79.47%	79.33%	80.96%	78.13%
2.5	69.42%	68.04%	71.08%	69.48%

Table 4. Impact of multi-head attention head number on BCI IV 2a and Ninapro DB2 classification performance.

Dataset	num_heads	Accuracy	Recall	Precision	F1-Score
BCI IV 2a	1	74.41%	73.11%	75.16%	74.12%
BCI IV 2a	2	77.47%	78.24%	76.27%	77.24%
BCI IV 2a	4	79.47%	79.33%	76.96%	78.13%
BCI IV 2a	8	79.51%	80.09%	77.62%	78.84%
Ninapro DB2	1	80.23%	82.14%	79.16%	80.62%
Ninapro DB2	2	81.88%	83.37%	80.21%	81.80%
Ninapro DB2	4	85.87%	87.42%	83.19%	85.42%
Ninapro DB2	8	84.95%	86.02%	82.17%	84.05%

Table 5. Performance comparison of SCGTNet with other models on the BCI IV 2a dataset.

Model	Accuracy	Recall	Precision	F1-Score
CNN [44]	70.65%	70.31%	71.26%	70.78%
LSTM [45]	71.24%	74.33%	74.21%	74.27%
EEG-Inception [25]	74.76%	72.47%	71.69%	72.08%
MIN2Net [46]	75.87%	75.09%	74.16%	74.62%
EEG-TCNet [47]	75.53%	76.63%	75.31%	75.96%
TFT [30]	74.49%	75.54%	73.36%	74.43%
SCGT (Ours)	79.47%	79.33%	76.96%	78.13%

Table 6. Performance comparison of SCGTNet with other models on the NinaPro DB2 dataset. Recall/Precision/F1 is marked ‘-’ for baselines that cannot be reproduced, indicating that the original article was not reported or that we were unable to reproduce it with the same settings.

Model	Accuracy	Recall	Precision	F1-Score
CNN [44]	70.28%	68.58%	72.03%	70.21%
LSTM [45]	73.13%	72.22%	74.57%	73.40%
Wei et al. [48]	83.7%	-	-	-
Ding et al. [26]	78.86%	-	-	-
Zhang et al. [49]	83.29%	-	-	-
Hu et al. [50]	84.84%	-	-	-
TFT [30]	83.64%	83.85%	80.06%	82.40%
SCGT (Ours)	85.87%	87.42%	83.19%	85.42%

Table 7. Total and trainable parameter counts and floating-point operations (FLOPs) of SCGTNet and comparison models on the BCI IV 2a dataset.

Model	Total Params	Trainable Params	No-Trainable Params	FLOPs
CNN [44]	11.5k	11.3k	0.2k	0.83M
LSTM [45]	18k	16.6k	1.4k	0.56M
TFT [30]	220k	217.6k	2.4k	3.50M
SCGT (Ours)	56k	55.4k	0.6k	1.12M

Table 8. Total and trainable parameter counts and floating-point operations (FLOPs) of SCGTNet and comparison models on the NinaPro DB2 Dataset.

Model	Total Params	Trainable Params	No-Trainable Params	FLOPs
CNN [44]	9.8k	9.6k	0.2k	1.25M
LSTM [45]	14.5k	14.1k	0.4k	2.87M
TFT [30]	180k	180k	1.8k	52M
SCGT (Ours)	48k	47.4k	0.6k	12M

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Fu, T.; Zhou, Z.; Yuan, W. Spatio-Temporal Deep Learning with Adaptive Attention for EEG and sEMG Decoding in Human–Machine Interaction. Electronics 2025, 14, 2670. https://doi.org/10.3390/electronics14132670

AMA Style

Fu T, Zhou Z, Yuan W. Spatio-Temporal Deep Learning with Adaptive Attention for EEG and sEMG Decoding in Human–Machine Interaction. Electronics. 2025; 14(13):2670. https://doi.org/10.3390/electronics14132670

Chicago/Turabian Style

Fu, Tianhao, Zhiyong Zhou, and Wenyu Yuan. 2025. "Spatio-Temporal Deep Learning with Adaptive Attention for EEG and sEMG Decoding in Human–Machine Interaction" Electronics 14, no. 13: 2670. https://doi.org/10.3390/electronics14132670

APA Style

Fu, T., Zhou, Z., & Yuan, W. (2025). Spatio-Temporal Deep Learning with Adaptive Attention for EEG and sEMG Decoding in Human–Machine Interaction. Electronics, 14(13), 2670. https://doi.org/10.3390/electronics14132670

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Spatio-Temporal Deep Learning with Adaptive Attention for EEG and sEMG Decoding in Human–Machine Interaction

Abstract

1. Introduction

2. Related Works

3. Materials and Methods

3.1. Adaptive Spatial Attention Module

3.2. Multi-Scale Convolution Module

3.3. GRU Encoder

3.4. Multi-Head Attention

3.5. Feed-Forward Network

3.6. Data and Preprocessing

3.6.1. BCI IV 2a Dataset

3.6.2. NinaPro DB2 Dataset

4. Experiments and Results

4.1. Training and Assessment Settings

4.2. Ablation Study

4.3. Hyperparameter Sensitivity Analysis

4.4. Comparison Study

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI