Dual-Branch Spatio-Temporal-Frequency Fusion Convolutional Network with Transformer for EEG-Based Motor Imagery Classification

Hu, Hao; Zhou, Zhiyong; Zhang, Zihan; Yuan, Wenyu

doi:10.3390/electronics14142853

Open AccessArticle

Dual-Branch Spatio-Temporal-Frequency Fusion Convolutional Network with Transformer for EEG-Based Motor Imagery Classification

¹

School of Mechanical Engineering, Shanghai DianJi University, Shanghai 201306, China

²

School of Art and Design, Shanghai DianJi University, Shanghai 201306, China

³

School of Electronic lnformation Engineering, Shanghai DianJi University, Shanghai 201306, China

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(14), 2853; https://doi.org/10.3390/electronics14142853

Submission received: 27 May 2025 / Revised: 11 July 2025 / Accepted: 14 July 2025 / Published: 17 July 2025

(This article belongs to the Special Issue Artificial Intelligence Methods for Biomedical Data Processing)

Download

Browse Figures

Versions Notes

Abstract

The decoding of motor imagery (MI) electroencephalogram (EEG) signals is crucial for motor control and rehabilitation. However, as feature extraction is the core component of the decoding process, traditional methods, often limited to single-feature domains or shallow time-frequency fusion, struggle to comprehensively capture the spatio-temporal-frequency characteristics of the signals, thereby limiting decoding accuracy. To address these limitations, this paper proposes a dual-branch neural network architecture with multi-domain feature fusion, the dual-branch spatio-temporal-frequency fusion convolutional network with Transformer (DB-STFFCNet). The DB-STFFCNet model consists of three modules: the spatiotemporal feature extraction module (STFE), the frequency feature extraction module (FFE), and the feature fusion and classification module. The STFE module employs a lightweight multi-dimensional attention network combined with a temporal Transformer encoder, capable of simultaneously modeling local fine-grained features and global spatiotemporal dependencies, effectively integrating spatiotemporal information and enhancing feature representation. The FFE module constructs a hierarchical feature refinement structure by leveraging the fast Fourier transform (FFT) and multi-scale frequency convolutions, while a frequency-domain Transformer encoder captures the global dependencies among frequency domain features, thus improving the model’s ability to represent key frequency information. Finally, the fusion module effectively consolidates the spatiotemporal and frequency features to achieve accurate classification. To evaluate the feasibility of the proposed method, experiments were conducted on the BCI Competition IV-2a and IV-2b public datasets, achieving accuracies of 83.13% and 89.54%, respectively, outperforming existing methods. This study provides a novel solution for joint time-frequency representation learning in EEG analysis.

Keywords:

electroencephalogram (EEG); brain–computer interface (BCI); motor imagery (MI); frequency features; Transformer

1. Introduction

Brain–computer interface (BCI) is an emerging technology that enables human–computer interaction through direct decoding of brain signals [1]. BCI has shown extensive applications in various fields such as medical rehabilitation, assistive devices, and virtual reality [2], becoming a crucial tool for addressing the interaction difficulties between disabled patients and their environment. Electroencephalography (EEG) primarily captures electrical signals generated by neuronal activity in the brain, reflecting dynamic brain functional states. Due to its advantages such as low cost, non-invasiveness, and high temporal resolution, EEG has become the mainstream choice for non-invasive BCI research [3].

In EEG signals, motor imagery (MI) is a popular paradigm for BCI. In this paradigm, participants generate specific brain signals by merely imagining the movement of a particular limb without any actual motion. The core mechanism involves modulating the neural rhythms, such as the

μ

and

β

rhythms, in the sensorimotor cortex to elicit event-related desynchronization (ERD) and event-related synchronization (ERS) [4]. Figure 1 illustrates the overall workflow of an MI-BCI system. MI-BCI technology has found broad application across both medical and non-medical domains. In neurorehabilitation, for example, MI-BCIs decode EEG signals generated during the imagery phase to achieve precise control of prosthetic limbs, exoskeletons, or gait-assist devices. This approach offers stroke-induced hemiplegic patients a novel avenue for regaining activities of daily living, significantly enhancing their functional independence and quality of life [5]. Nonetheless, the decoding of MI-EEG remains a formidable challenge due to the low signal-to-noise ratio of EEG signals, substantial inter-individual variability, and the intricate nature of signal acquisition and processing.

In recent years, to address the challenges of decoding MI-EEG signals, both classical machine learning and deep learning methods have been extensively applied to MI classification tasks. Traditional machine learning approaches typically involve two phases: manual feature extraction and classification. During the feature extraction phase, researchers commonly utilize algorithms such as power spectral density analysis (PSD) [6], common spatial pattern (CSP) [7], and filter bank CSP (FBCSP) [8] to extract frequency, spatial frequency, or time-frequency features from EEG signals. In the classification phase, methods like support vector machines (SVM), linear discriminant analysis (LDA), and random forests (RF) are employed. However, these approaches often require extensive prior knowledge or are heavily reliant on manually designed features, making the implementation process considerably complex. Moreover, traditional methods often exhibit insufficient generalization ability when addressing cross-subject variability and adapting to new subjects. In contrast, deep learning approaches eliminate the need for manual feature extraction by automatically extracting high-level features from raw EEG signals through an end-to-end framework. For MI-EEG feature extraction, researchers have proposed various model architectures, including convolutional neural networks (CNN) [9], recurrent neural networks (RNN) [10], and deep belief networks (DBN) [11]. Notably, CNNs can effectively capture frequency, temporal, and spatial information using convolution operations and have been widely applied in MI-EEG classification tasks. Although deep learning demonstrates significant potential in MI-EEG decoding, its limitations should not be overlooked. Specifically, existing models generally suffer from the following limitations. First, most approaches rely predominantly on single-modality feature extraction, making it difficult to integrate rich temporal, spatial, and spectral information simultaneously, which in turn constrains improvements in decoding performance. Second, these models lack sufficient capacity to capture long-range dependencies—an especially critical shortcoming given that certain neural patterns in motor imagery, such as slow event-related desynchronization/synchronization (ERD/ERS), typically span extended time windows and thus demand stronger memory and contextual understanding. Furthermore, deep-learning-based models in MI-EEG face practical challenges including limited training data and poor interpretability, all of which remain significant bottlenecks for their broader application.

To address the aforementioned challenges, we propose a novel dual-branch neural network architecture (DB-STFFCNet). Unlike conventional strategies that rely on single-feature extraction, DB-STFFCNet adopts a parallel design with separate spatiotemporal and frequency domain branches, aiming to fully exploit the complementarity and deep interrelations among the multi-dimensional time-frequency features present in MI-EEG signals. Specifically, the spatiotemporal branch employs a lightweight multi-dimensional attention network combined with a Transformer encoder to effectively capture spatiotemporal features, while the frequency domain branch utilizes the fast Fourier transform (FFT) and multi-scale convolution to construct hierarchical spectral features, further enhancing global spectral information via a Transformer. The two branches work synergistically through a refined feature fusion strategy, which not only improves the precise representation of motor intention but also significantly enhances the model’s robustness and generalization capability. The contributions of this paper are summarized as follows:

A Transformer-based multi-domain feature learning framework was proposed, which is capable of extracting global contextual information and long-term dependencies from various feature domains, enhancing the model’s overall perception of multi-dimensional signal features.
A parallel frequency domain feature extraction module was constructed. This module integrates the FFT with a self-designed multi-scale frequency convolution to effectively mine spectral features, and it further consolidates global frequency domain information through a multi-head self-attention mechanism.
Deep fusion of spatiotemporal and frequency domain features were achieved, and efficient classification was performed using a fully connected layer, which significantly enhances the model’s discriminative performance. Experimental results demonstrate superior performance on the BCI Competition IV-2a and IV-2b datasets.

The remainder of this paper is organized as follows. Section 2 presents the literature review. Section 3 details the methodology adopted in this study. Section 4 describes the experiments and presents the results. Section 5 offers a discussion of the findings, and Section 6 concludes the paper.

2. Literature Review

In this section, we provide a detailed review and discussion of related research in EEG signal analysis, focusing on the application of deep learning, multi-branch architectures, and feature fusion techniques.

In recent years, CNNs have achieved remarkable progress in motor imagery classification, with numerous studies focusing on leveraging this technology to enhance decoding accuracy and classification performance [12]. Schirrmeister et al. [13] explored end-to-end learning with the design of ShallowNet, which applies both temporal and spatial convolutions directly to raw EEG data, achieving competitive performance compared to FBCSP. Lawhern et al. [14] innovatively introduced depthwise separable convolution into the design of lightweight CNN architectures, establishing the landmark EEGNet framework that has inspired subsequent research in this field. Building upon the EEGNet framework, Chen et al. [15] proposed the EEGNeX network, which enhances spatial feature extraction by substituting depthwise separable convolutions with two standard 2D convolutions and incorporating a dilation mechanism to expand the temporal receptive field. Recently, attention mechanisms have gained widespread attention in natural language processing and computer vision, and their successful applications have extended to MI-EEG signal decoding tasks. Researchers have effectively extracted multi-dimensional features by integrating channel attention with deep attention modules, leading to the development of a lightweight multi-dimensional attention network (LMDA-Net) [16]. Altaheri et al. [17] introduced ATCNet, which utilizes a multi-head self-attention mechanism alongside a temporal convolutional network to capture global signal features and high-level temporal features, respectively, significantly improving signal classification accuracy on the BCI Competition IV-2a dataset. Furthermore, researchers innovatively fused CNN and Transformer, proposing a CNN Conformer network model [18]. This model employes a convolutional module to extract local spatiotemporal features and utilizes a self-attention module to capture long-range dependencies, ultimately achieving state-of-the-art (SOTA) performance across multiple EEG decoding tasks. Building upon ATCNet and EEG Conformer, Zhao et al. [19] further proposed a more efficient convolutional Transformer network (CTNet), which incorporates an efficient attention mechanism design that drastically reduces the number of parameters, while enhancing convolutional layers to automatically adapt to features from diverse brain regions. By achieving collaborative optimization between local perception and global dependencies, CTNet demonstrated superior performance on public datasets. However, these methods did not fully account for the correspondence between different frequency components and brain regions when integrating temporal and spatial features. To address this issue, Zhao et al. [20] introduced WaSF-ConvNet, a model that integrates wavelet kernels and spatial filters to jointly learn spatio-temporal-frequency features for motor imagery classification. Although the model achieved 68.1% classification accuracy on the BCI Competition IV 2a dataset, its performance did not surpass that of ShallowNet. FACT-Net [21] innovatively introduced a frequency adaptation module that dynamically calibrates frequency domain features using learnable Fourier coefficients, coupled with a temporal periodic reorganization module to capture rhythmic characteristics, which significantly effectively captures and enhances the necessary frequency-domain information. Additionally, IFBCLNet [22] proposed a multi-dimensional feature extraction framework that combines an interpretable filter bank (IFB), CNN, and long short-term memory networks (LSTM). This method adaptively learns frequency sub-bands through the IFB module, extracts spatial features while reducing computational complexity through the CNN module, and captures temporal dependencies using the LSTM module. Experiments demonstrated that IFBCLNet achieved superior decoding performance across multiple datasets, particularly in cross-subject tasks.

Despite the remarkable progress achieved by existing methods in decoding EEG signals, single-branch approaches relying on stacked spatiotemporal convolutions remained insufficient for comprehensive feature representation. First, temporal convolution networks struggled to effectively capture the dynamic coupling characteristics across frequency bands in MI-EEG signals, whereas frequency-domain decomposition methods frequently overlooked the temporal correlations inherent in time-domain features. Second, single-scale convolution operations were limited in their ability to capture features of MI-EEG signals at diverse scales, which restricted the model’s overall representational effectiveness. Recent studies have demonstrated that multi-branch architectures, by processing temporal, spatial, and frequency features in parallel, could substantially enhance the model’s representational capacity.

Actually, several studies [23,24,25,26] have recognized the significance of multi-branch architectures for more comprehensively extracting features from MI-EEG signals, with distinct branches dedicated to capturing diverse feature representations. In light of these findings, Zhi et al. [27] proposed a multi-domain convolutional neural network (TSFCNet) that leverages a multi-branch structure to extract spatiotemporal-spectral features, effectively decoding EEG signals and validating the effectiveness of multi-domain feature fusion in MI-BCI decoding. Unlike TSFCNet, Cai et al. [28] proposed the MT-MBCNN model, which innovatively combines dynamic frequency band optimization with a multi-task cooperative mechanism. This model employs a multi-branch, multi-scale architecture for feature extraction and addresses issues such as insufficient spatiotemporal feature representation and limited utilization of multidimensional information by single-task features through the joint optimization of learnable spectral decomposition and decoupled classification features.

In summary, existing work has demonstrated significant progress in the classification of MI-EEG signals using deep learning. Through meticulously designed network architectures, innovative attention mechanisms, and task-specific optimizations, these have greatly enhanced the accuracy and robustness of BCI systems. Inspired by these advances, we propose the DB-STFFCNet framework, a feature extraction network designed with high computational efficiency and robustness, offering a new direction for future research.

3. Methods

In this section, we introduce the proposed dual-branch spatio-temporal-frequency fusion convolutional network with Transformer (DB-STFFCNet) in detail. DB-STFFCNet consists of four modules: preprocessing and data augmentation, spatiotemporal feature extraction module (STFE), frequency feature extraction module (FFE), feature fusion and classification. As shown in Figure 2, the DB-STFFCNet architecture accepts preprocessed EEG signals that undergo data augmentation through segmentation and reconstruction techniques, specifically designed to enhance temporal feature representation and improve model robustness against inter-subject variability. In the STFE module, two sub-modules are incorporated: the multi-dimensional attentional convolution block (MACB) and the Transformer encoder. The MACB combines a channel attention mechanism, temporal convolutions, and deep attention mechanisms to effectively extract spatiotemporal features. To further enhance feature extraction, we introduce a Transformer encoder that utilizes a multi-head self-attention mechanism (MSA) to globally model spatiotemporal features, to capture the interplay between long-range dependencies and local information. This approach enriches feature representations and improves classification accuracy. Additionally, the FFE module extracts frequency domain features via the FFT and further refines these features using our custom-designed frequency convolution module. Similar to the STFE module, a Transformer encoder is employed to extract global frequency features. These two modules complement each other, further enhancing the model’s comprehensive capability to capture the characteristics of MI-EEG signals. Finally, the feature fusion and classification module integrates the spatiotemporal and frequency domain features and performs classification via a fully connected layer.

3.1. Preprocessing and Data Augmentation

This study employed the BCI Competition IV dataset 2a [29] (BCI-IV-2a) and the BCI Competition IV dataset 2b [30] (BCI-IV-2b) as the experimental data foundation. Table 1 presents the basic information of the datasets. In BCI-IV-2a, EEG signals were recorded from nine healthy subjects performing four types of motor imagery tasks (i.e., left hand, right hand, both feet, and tongue movements). For each subject, the data were partitioned into training and testing sets, with each set comprising six blocks of experiments; each block contained 48 trials, resulting in 288 trials per set. This study includes 22 EEG channels and 3 EOG channels, with EEG activity recorded at a sampling rate of 250 Hz. In BCI-IV-2b, EEG signals were collected from nine healthy subjects performing two types of motor imagery tasks (left hand and right hand). The experiment utilized 3 EEG electrodes (C3, Cz, and C4) at a sampling rate of 250 Hz. The recording process was divided into five stages, with the first two stages collecting 120 samples per stage, and the subsequent three stages collecting 160 samples per stage. For each trial, we extracted EEG segments from a fixed time window after the cue onset: [1.5, 6] seconds for BCI-IV-2a and [3, 7] seconds for BCI-IV-2b. As a result, each trial was represented as a matrix of size (22, 1000) for BCI-IV-2a and (3, 1000) for BCI-IV-2b.

Prior to feature extraction and classification, the raw EEG signals were initially processed using a 200-order Blackman window bandpass filter to extract the 4–38 Hz frequency band. This design employed a bidirectional zero-phase filtering technique to eliminate high-frequency electromyographic (EMG) noise and low-frequency baseline drift, while preserving the motor-imagery-related

μ

(8–12 Hz) and

β

(13–30 Hz) rhythmic features. The filtered continuous signals were then segmented into independent trial epochs based on task event markers (e.g., the onset of a visual cue). We defined an EEG input sample as a tensor

x (x \in R^{C \times T})

, where C denotes the number of EEG channels and T denotes the number of time samples. Each trial segment underwent the following normalization procedures sequentially:

(a) Amplitude normalization: For each trial, the channel signals were scaled by their maximum absolute value to eliminate amplitude scale differences.

x_{i} = \frac{x_{i}}{max (| x_{i} |)}

(1)

Let

| \cdot |

denote the matrix absolute value operator, where i indexes individual trials of variable x.

(b) Covariance alignment: Based on Euclidean alignment (EA) theory [31], a global reference covariance matrix was computed.

\bar{R} = \frac{1}{N} \sum_{i = 1}^{N} x_{i} x_{i}^{T}

(2)

where N signifies the total number of trials. The spatial covariance distribution was unified through a whitening transformation:

{\tilde{x}}_{i} = {\bar{R}}^{- 1 / 2} x_{i}

(3)

This achieved the calibration of signal distributions among subjects. Finally, a three-dimensional tensor was generated as the input to the deep network.

Due to the long acquisition period and high cost of EEG signal collection, the size of experimental datasets is limited, which significantly increases the risk of model overfitting. To address this issue, existing methods [18,32,33,34] have adopted data augmentation techniques to increase the sample size and improve the model’s decoding performance. Therefore, this study proposes a data augmentation scheme based on time-domain segmentation and recombination (S&R) [35]. Specifically, based on the temporal continuity assumption of neural signals, the training EEG samples of the same class are first evenly divided along the time axis into

N_{S}

equal-length segments. Then,

N_{S}

segments are randomly selected from different training samples and reassembled in the original temporal order to generate new artificial samples. In addition, to ensure that the distribution of the augmented data remains consistent with that of the training data, the scale of the augmented samples is determined based on the batch size and the number of classes, and new augmented data are dynamically generated during each training iteration. This method effectively expands the dataset, enhances the robustness of the model, and improves its ability to model the temporal features of EEG signals.

3.2. Transformer Encoder

In our study, we introduced a Transformer encoder, which is embedded within both the STFE and FFE modules to specifically handle spatiotemporal and frequency-domain features, thereby overcoming the local receptive field limitations of traditional convolutional operations. As shown in Figure 3, the Transformer encoder consists of a multi-head self-attention mechanism, a feed-forward module, and an Add & Norm module.

The multi-head self-attention mechanism is the core component of the Transformer encoder. It is composed of multiple parallel self-attention mechanisms, with each one independently focusing on different parts of the input sequence. The multi-head self-attention mechanism takes the high-level feature tensor

P \in R^{T_{C} \times d}

, which is the output of the feature extraction module, as its input, where

T_{C}

denotes the feature stride of the module output. P is mapped into sets of query (Q), key (K), and value (V) vectors by means of three sets of learnable projection matrices

W_{i}^{Q}, W_{i}^{K}, W_{i}^{V} \in R^{d \times d_{k}} (i = 1, \dots, h)

:

\begin{matrix} Q_{i} & = P W_{i}^{Q} \in R^{T_{c} \times d_{k}}, \end{matrix}

(4)

\begin{matrix} K_{i} & = P W_{i}^{K} \in R^{T_{c} \times d_{k}}, \end{matrix}

(5)

\begin{matrix} V_{i} & = P W_{i}^{V} \in R^{T_{c} \times d_{k}} \end{matrix}

(6)

where

d_{k} = d / h

represents the feature dimension of a single head and h is the number of attention heads. By partitioning the feature space into h parallel subspaces, each subspace independently performs the scaled dot-product attention operation:

Z_{i} = A t t e n t i o n (Q_{i}, K_{i}, V_{i}) = softmax (\frac{Q_{i} K_{i}^{T}}{\sqrt{d_{k}}}) V_{i}

(7)

The outputs from the multiple attention heads are concatenated and then linearly transformed to achieve global context fusion. Specifically, the features

Z_{i} \in R^{T_{c} \times d_{k}}

generated by each attention head are concatenated along the feature dimension and subsequently passed through a learnable matrix

W^{O} \in R^{h d_{k} \times d}

for linear transformation. This process not only integrates the semantic information from the various subspaces but also dynamically adjusts feature weights via parameterized mapping, ultimately producing the fused feature

M H A (P) \in R^{T_{c} \times d_{k}}

that maintains the same dimensionality as the input, ensuring compatibility with subsequent residual connections.

MHA (P) = Concat (Z_{1}, \dots, Z_{h}) W^{O}

(8)

Residual connections are applied by adding the original input P to the multi-head attention output, thereby preserving low-level feature information. Layer normalization is then performed along the feature dimension d on the summation result to enhance training stability:

O = LayerNorm (P + MHA (P))

(9)

The feed-forward network further boosts the non-linear representation capability by applying a fully connected layer followed by GELU activation, and its output is subsequently refined via a second residual normalization.

3.3. Spatiotemporal Feature Extraction Module

The MACB in the STFE module plays a pivotal role in extracting temporal and spatial features from EEG signals. This module is constructed based on the feature extraction architecture of LMDA-Net [16], achieving feature extraction through hierarchical fusion of the channel attention module, temporal convolution layer, depthwise attention module, and spatial convolution layer. Consistent with the LMDA-Net framework, both temporal and spatial convolution layers adopt depthwise separable convolution designs to ensure an optimal balance between feature extraction capability and computational efficiency.

The channel attention module preserves the basic tensor multiplication operation to model spatial dependencies across channels. For the input EEG sample x, this module introduces a parameter tensor

c (c \in R^{D \times C})

obeying normal distribution, where D represents the instance dimension and numerically equals the number of convolutional kernels. Channel information is mapped to the depth dimension via tensor multiplication. This process can be formulated as:

x_{h c t}^{'} = \sum_{d} x_{d c t} c_{h d c}

(10)

The deep attention module adheres to the LMDA-Net architecture, utilizing a synergistic mechanism integrating semi-global pooling and local cross-depth convolutions. Discriminative spatiotemporal coupling patterns within high-dimensional features are dynamically filtered, and the Hadamard product is employed to achieve adaptive alignment between feature magnitudes and probability weights, circumventing information blurring issues induced by conventional global pooling.

To further model global spatiotemporal dependencies, the feature tensor output from the MACB is fed into the Transformer encoder. Through the MSA mechanism, the encoder can adaptively allocate weights to different temporal nodes, capturing the temporal continuity of neural response patterns in motor imagery tasks. Specifically, for the feature tensor

P_{s t} \in R^{T_{c} \times d}

output by the MACB, the self-attention matrix

Z_{i}

quantifies the correlation strength between different temporal segments, to enhance feature focusing on task-critical phases. Residual connections and layer normalization ensure the stability of the network during long-sequence training, preventing issues such as gradient degradation or explosion.

3.4. Frequency Feature Extraction Module

In the field of bioelectrical signal processing, traditional Fourier transform analysis methods have been proven effective in achieving precise representation and feature analysis of neuroelectrophysiological data in the frequency domain [36]. To deeply explore the frequency-domain characteristics of MI-EEG signals, the FFE module proposed in this paper first employs the FFT to convert raw EEG signals from the time domain to the frequency domain, where its efficient divide-and-conquer strategy and butterfly operations significantly enhance the conversion efficiency.

After obtaining the frequency domain representation, the FFE module employs a frequency convolutional network to perform deep feature learning on the spectral information. Initially, a

1 \times 1

convolution layer is used to expand the single-channel input into multi-channel feature maps, resulting in enhanced diversity and richness of the feature representation. During the experimental phase, we systematically explored and compared various channel depth configurations, ultimately selecting 18 channels as the optimal configuration after comprehensively balancing model performance and computational complexity. Subsequently, two groups of depthwise convolutions with kernel sizes of 1 × 5 and 1 × 15 are implemented to span frequency-domain receptive fields of differing extents. The 1 × 5 kernels focus on high-frequency and transient spectral patterns, while the 1 × 15 kernels cover low-frequency and broader-band signals. By combining these two scales, the network can simultaneously capture fine-grained, localized spectral details and more global frequency structures, without a large increase in parameters. We evaluated alternative kernel sizes and found that

1 \times 5

and

1 \times 15

offered the best balance of classification accuracy vs. computational cost. After the multi-scale convolution operations, an adaptive average pooling layer is utilized to compress the spatial dimension to 1, ensuring consistent output feature shapes and generating the spectral feature

P_{f} \in R^{b \times d}

, where b denotes the number of frequency bands.

To further model the complex coupling relationships among frequency components, the FFE module innovatively introduces a Transformer encoder tailored for the frequency domain. Unlike the STFE module, which primarily focuses on temporal dynamic evolution characteristics, the Transformer encoder in the FFE module is specifically optimized for frequency feature interactions. Specifically, during the feature embedding stage, the model maps the spectral energy distribution to directly capture both the amplitude information of the spectral energy and its spatial topological information along the frequency axis, thereby enhancing the perception of frequency domain characteristics. By constructing a full-frequency association matrix via MSA, the model dynamically learns the cooperative and suppressive relationships between different frequencies, thus improving its ability to capture global frequency dependencies. Furthermore, the inter-layer residual connection strategy facilitates a gradual fusion of local spectral features with global dependencies, while the shared-architecture feed-forward network ensures that the dimensions of the frequency features are aligned with those of the spatiotemporal branch output.

3.5. Feature Fusion and Classification

The proposed feature fusion and classification module integrates spatiotemporal and frequency-domain representations to enhance the decoding performance of motor imagery EEG signals. The spatiotemporal feature matrix

S \in R^{N \times D_{1}}

from the STFE module and the frequency feature

F \in R^{N \times D_{2}}

from the FFE module are concatenated along the feature dimension via tensor concatenation to form a fused feature vector

G \in R^{N \times (D_{1} + D_{2})}

. This concatenation strategy preserves the complementarity between temporal dynamics and spectral features, ensuring a comprehensive representation of the EEG signals. It is worth noting that while concatenating spatiotemporal and frequency features inevitably increases the dimensionality of the fused representation, the overall model remains lightweight, with approximately 29.5 K trainable parameters. Regularization techniques such as dropout and batch normalization are further employed to mitigate overfitting, ensuring that the higher dimensionality enhances discriminative capacity without introducing excessive complexity. The fused feature G is then fed into a fully connected network and classified using cross-entropy loss [37], which is defined as follows:

L_{c} = - \frac{1}{N} \sum_{i = 1}^{N} \sum_{m = 1}^{M} y_{i, m} log (p_{i, m})

(11)

where N denotes the number of training samples, M is the total number of motion imagery categories,

y_{i, m}

is its true label in category m, and

p_{i, m}

is the probability that the model predicts the ith sample to be in the mth category. By minimizing the discrepancy between the predicted distribution and the true label distribution, the model can be effectively guided to learn features with higher discrimination.

4. Experiment and Result

4.1. Experiment Settings

This study employed a subject-dependent strategy for training and testing on both the BCI-IV-2a and BCI-IV-2b datasets, ensuring that the training and testing data for each subject are strictly separated to avoid information leakage. In particular, the BCI-IV-2a dataset used the first session for training and the second for testing; for the BCI-IV-2b dataset, the first three sessions were used for training and the last two for testing. All experiments were implemented using the PyTorch (Version 1.13) framework and executed on an NVIDIA 4070 GPU (Manufactured by Nvidia Corporation, Santa Clara, CA, USA). To enhance data diversity, we adopted an eight-segment data augmentation strategy. The Transformer encoder was configured with a depth of six and five attention heads. We used the AdamW optimizer to train all models with the default parameters described by Loshchilov and Hutter [38] and a mini-batch size of 32. The training phase ran for a total of 350 epochs and was optimized using the mean cross-entropy loss function.

Performance evaluation was based on classification accuracy and the Kappa coefficient, which is calculated as follows:

k = \frac{p_{o} - p_{e}}{1 - p_{e}}

(12)

where

P_{o}

denotes the observed accuracy and

P_{e}

represents the expected accuracy by chance. The Kappa coefficient quantifies the degree of agreement between the model’s classification results and random guessing.

4.2. Comparison of Classification Results

In this section, we validated the classification performance of the proposed model on the BCI-IV-2a and BCI-IV-2b datasets through comparative experiments. The classification performance of various models on the BCI Competition IV-2a and IV-2b datasets, along with the average results, are presented in Table 2 and Table 3, and Figure 4 and Figure 5, respectively. Our method demonstrates superior performance on both public datasets. To highlight the differences in classification performance among the various methods, we conducted a statistical analysis of the results. Specifically, we performed a one-way repeated measures analysis of variance (ANOVA) on the two public datasets, followed by Bonferroni correction.

In the BCI-IV-2a dataset, the results of the ANOVA indicated significant differences in decoding performance among the various methods

(F (6, 48) = 6.2442, p = 0.0001 < 0.001)

. As shown in Table 2 and Figure 4, DB-STFFCNet achieved an average accuracy of 83.13%, which was significantly superior to all baseline models. Traditional CNN models such as EEGNet

(p = 0.003)

, FBCNet [39]

(p = 0.004)

, and LMDA-Net

(p = 0.021)

are constrained by their single-branch architectures, which limit their ability to extract extensive receptive fields. Although LMDA-Net employs a local multi-dimensional attention mechanism to achieve a lightweight design and task adaptability, it remains insufficient in capturing global correlations. In contrast, our approach integrates a dual-branch architecture and incorporates a Transformer encoder into a lightweight multi-dimensional attention network, leading to accuracy improvements of 8.2%, 7.02%, and 3.59% over these baselines. Notably, while TMSANet (82.45%) achieved a performance close to ours, its higher standard deviation and lower Kappa coefficient reflect limitations in modeling global feature interactions. Furthermore, although Conformer

(p = 0.365)

and ATCNet

(p = 0.352)

combine CNN and Transformer components, they do not explicitly separate frequency-domain features, resulting in insufficient extraction of discriminative spectral information. In contrast, DB-STFFCNet employs a dual-branch attention mechanism to achieve a refined complementarity between spatiotemporal and frequency-domain features, thereby significantly enhancing the extraction of discriminative spectral information.

As for BCIC IV-2b, the overall differences in classification performance among models were statistically significant

(F (6, 48) = 5.3019, p = 0.0003 < 0.001)

. Further paired t-tests revealed that only EEGNet

(p = 0.0045)

and FBCNet

(p = 0.0013)

exhibited results that were significantly different from those of the other models. This reflects certain limitations in feature extraction inherent in traditional single-branch models. Moreover, as shown in Table 3 and Figure 5, although all models demonstrated performance in terms of average accuracy and Kappa values, our method achieved the best overall accuracy and classification consistency. Notably, while some models showed a slight advantage in terms of result stability, this did not diminish the outstanding superiority of our approach in enhancing overall decoding performance. Overall, the results confirm that employing a more refined feature fusion strategy in EEG signal decoding tasks can yield superior performance, providing strong support for future research.

4.3. Ablation Study

To rigorously evaluate the contribution of each module, we conducted repeated-measures ANOVA and Bonferroni-corrected paired t-tests based on the nine ablation conditions on the BCI-IV-2a and BCI-IV-2b datasets, as shown in Table 4 and Table 5.

For both BCI-IV-2a and BCI-IV-2b datasets, the ANOVA revealed a significant overall difference across conditions

(2 a : F (8, 64) = 67.27, p < 0.001; 2 b : F (8, 64) = 9.7, p < 0.001)

. Further pairwise comparisons demonstrated that removing the STFE module alone led to large and statistically significant drops in accuracy

(2 a : p = 0.352; 2 b : p = 0.011)

, underscoring its critical role in modeling temporal and spatial dependencies. In BCI-IV-2a, compared to removing only the FFE module, additionally removing the STFE encoder

(p = 0.0283)

resulted in further significant degradation, suggesting that the STFE encoder strengthens the STFE branch even when the FFE branch is absent. Likewise, compared to removing only the STFE module, additionally removing the FFE encoder

(p = 0.0001)

led to a significant performance decline, indicating that the FFE encoder enhances the FFE branch even when the STFE branch is removed. Similar patterns were observed in BCI-IV-2b, where additionally removing the FFE encoder after removing the STFE module

(p = 0.0052)

also significantly reduced accuracy. Additionally, the removal of data augmentation significantly lowered performance

(p = 0.0363)

, highlighting its importance in improving generalization and robustness. Moreover, in both datasets, removing the STFE encoder

(p = 0.0136)

or the FFE encoder

(p = 0.027)

individually led to notable performance drops, while removing both encoders simultaneously

(2 a : p = 0.0078; 2 b : p = 0.0065)

caused an even greater decline. Together, these results from the ablation study consistently demonstrate that each component, including the STFE and FFE modules, the encoder submodules, and data augmentation, plays an important role in boosting classification accuracy. The findings highlight not only the individual benefits of each part but also their combined contribution to the overall model performance.

Furthermore, to analyze the trade-off between accuracy and computational efficiency, we compared training and inference times across ablation variants. Removing entire branches such as STFE or FFE notably reduced training time per epoch (e.g., on BCI-IV-2a from

1.19 s

to

0.61 s

and

1.03 s

, respectively) and slightly lowered inference time, but led to clear accuracy degradation. Similarly, eliminating the encoder submodules within STFE or FFE reduced computational time further, yet also decreased classification performance, confirming their role in enhancing feature representations. The removal of data augmentation shortened training time but similarly harmed accuracy, indicating its importance for improving generalization. Overall, these results suggest that while each component—STFE and FFE branches, encoder submodules, and data augmentation—adds modest computational overhead, their combined contributions substantially boost accuracy, achieving a balanced design suited for practical applications.

To further evaluate the impact of each key module on feature representation, we performed t-SNE visualization and quantitative analysis of inter-class separability and intra-class compactness. As shown in Figure 6, the feature embeddings generated by our method exhibit the clearest boundaries among classes and the most compact distributions within each class. When the FFE module is removed, the separability slightly deteriorates, although class clusters remain relatively distinct. In contrast, removing data augmentation leads to further reduced compactness and increased overlap between classes, highlighting its role in improving generalization. The removal of both encoders (FFE encoder and STFE encoder) results in even greater inter-class mixing, while excluding the STFE module leads to the most severe overlap and indistinct class boundaries. To further confirm the generalizability of these findings, we also present t-SNE visualizations for Subjects 1, 3, and 9, which similarly show clearer inter-class separation and tighter intra-class clustering, supporting the robustness of our conclusions, as shown in Figure 7.

To quantitatively validate the observed feature separability, we conducted statistical tests on the t-SNE projected feature space. Specifically, pairwise two-sample t-tests were performed comparing inter-class and intra-class Euclidean distances. The results showed that the average inter-class distances were significantly greater than intra-class distances across multiple class pairs (e.g., Subject 7: right hand vs. foot,

t = 380.9

,

p < 0.001

; left hand vs. tongue,

t = 208.2

,

p < 0.001

). Similar significant differences were observed for other subjects (e.g., Subject 3: right hand vs. foot,

t = 333.7

,

p < 0.001

). These findings confirm that the proposed model effectively improves inter-class separability and intra-class compactness in the learned feature representations.

4.4. Training Progress

In this paper, we propose the DB-STFFCNet, which significantly enhances motor imagery decoding performance by collaboratively modeling spatiotemporal and frequency-domain features. As shown in Figure 8, the training loss rapidly decreases during the initial stages of training and then stabilizes, indicating that the network effectively captures and learns the key features in the EEG signals while maintaining good convergence and stability. Meanwhile, the test accuracy shows a steady upward trend, eventually fluctuating around a high level, which reflects DB-STFFCNet’s excellent generalization ability across different test samples and strong feature discriminability. Notably, the curves do not exhibit obvious signs of overfitting in the later stages, suggesting that the network maintains robust resistance to noise and individual differences during the fusion of spatiotemporal and frequency-domain features. Comparative evaluation results (Table 2 and Table 3) demonstrate that DB-STFFCNet achieves statistically significant improvements in accuracy and Kappa coefficient compared to state-of-the-art baseline models. Ablation experiments (Table 4 and Table 5) quantitatively analyze the contributions of each core component, indicating that the FFE module and the time-domain encoder module make substantial contributions to the overall model. Together, these results indicate that the dual-branch design not only optimizes the localization of spatiotemporal features but also uncovers latent periodic patterns in the frequency domain, thereby achieving comprehensive discriminative performance through cross-modal feature fusion.

4.5. Detailed Classification Analysis

Confusion matrices are employed to quantify the consistency between the predicted labels and the ground truth, while also revealing misclassification patterns among tasks. As shown in Figure 9, the confusion matrices of DB-STFFCNet on the BCI-IV-2a and BCI-IV-2b datasets are presented. In the BCI-IV-2a dataset, the model achieved classification accuracies of 84.21% and 87.90% for the left-hand and right-hand tasks, respectively, whereas the foot task, due to its lower specificity in activation patterns, exhibited the highest misclassification rate at 21.1% (primarily confused with the tongue task). In the BCI-IV-2b dataset, the model demonstrated high accuracy in distinguishing between left-hand and right-hand motor imagery, with correct recognition rates of 86.3% for the left-hand class and 83.8% for the right-hand class, indicating strong discriminative ability in this binary motor imagery task. However, there remains a certain degree of mutual misclassification, which may be attributed to overlapping spatiotemporal or spectral features between the left-hand and right-hand tasks. This suggests that further feature enhancement or adaptive strategies could potentially reduce inter-class confusion.

4.6. Evalution of Parameter Selection

To evaluate the impact of different configurations of the multi-scale frequency convolution layer on model performance, we conducted parameter experiments on the BCI-IV-2a dataset. The experimental results indicate that optimizing the frequency convolution parameters can further enhance the overall performance of the model in time-frequency analysis tasks. The experimental results indicate that optimizing the frequency convolution parameters can further enhance the overall performance of the model in time-frequency analysis tasks.

We systematically assessed four specific configurations of the multi-scale frequency convolution layer: no convolution, a

1 \times 5

convolution, a

1 \times 15

convolution, and a combination of

1 \times 15

and

1 \times 5

convolutions. As shown in Figure 10, while the single-scale

1 \times 15

and

1 \times 5

convolutions are each capable of extracting local frequency-domain features to a certain extent, their ability to capture multi-scale information is relatively limited. Conversely, the combined configuration not only effectively fuses features from convolution kernels of different scales, but also significantly enhances the model’s decoding accuracy. In comparison, configurations without convolution exhibit clear deficiencies in feature extraction, resulting in lower overall performance. This comparative analysis provides clear empirical evidence for optimizing frequency-domain feature extraction strategies.

5. Discussion

This study aims to enhance motor imagery EEG decoding by jointly modeling temporal, spatial, and spectral characteristics. To this end, we propose DB-STFFCNet, which features a dedicated frequency branch integrating FFT to transform EEG signals into the frequency domain, multiscale convolutions to capture discriminative spectral patterns across different frequency bands, and a Transformer encoder to model global spectral dependencies. Parallel to this, a spatiotemporal branch equipped with channel attention and a Transformer encoder is used to learn local and long-range temporal dependencies as well as spatial relationships among channels.

Unlike prior works such as EEGNet, ATCNet, Conformer, LMDANet, and FBCNet, which process temporal and frequency features sequentially or embed frequency information implicitly, DB-STFFCNet explicitly models and refines spectral features through its dedicated frequency branch. The dual-branch parallel design enables simultaneous extraction and fusion of temporal, spatial, and spectral representations, leading to richer and more discriminative features and ultimately superior classification performance.

To evaluate the model, we analyze both classification performance and the statistical properties of EEG signals. Specifically, Section 5.2 examines the temporal consistency of motor imagery EEG signals using power spectral density (PSD) and sliding window–based energy heatmaps. This analysis helps justify the stationarity assumption behind FFT and supports the neurophysiological validity of our frequency-domain modeling approach.

5.1. Analysis of Model Complexity

To further analyze the trade-off between classification performance and computational complexity, we compared our method with several representative models in terms of classification accuracy, number of trainable parameters, and floating-point operations (FLOPs), as shown in Table 6. Our method achieves the highest accuracy of 83.13%, surpassing EEGNet (74.93%), Conformer (79.34%), LMDANet (79.54%), and ATCNet (80.34%), and slightly outperforming TMSANet (82.45%). Although our model has 29.5 K parameters and 66.33 M FLOPs—moderately higher than compact networks like EEGNet (3.91 K parameters, 13.25 M FLOPs) and LMDANet (5.4 K parameters, 64.87 M FLOPs)—it remains significantly lighter than larger architectures such as Conformer (156.6 K parameters) and ATCNet (113.73 K parameters). Notably, compared with TMSANet (20.9 K parameters, 33 M FLOPs), our model achieves an accuracy gain of 0.68% at the cost of slightly higher computational overhead. These results indicate that our model strikes a favorable balance between performance and efficiency: the dual-branch design and encoder modules increase discriminative power with a manageable increase in complexity. In practical scenarios, such a trade-off ensures that the model remains lightweight enough for real-time applications on consumer-grade GPUs, while delivering consistently higher accuracy on the BCI-IV-2a dataset. This highlights the practical value of the proposed architecture in EEG-based motor imagery decoding tasks.

5.2. EEG Signal Analysis

To assess the spectral characteristics and temporal stability of the EEG data, we first evaluated the stationarity of the signal across trial windows, as the fast Fourier transform (FFT) inherently assumes signal stationarity. Preliminary analyses showed that the EEG segments within the selected time windows (1.5 to 6 s) were sufficiently stationary, with consistent power spectral density (PSD) patterns across trials and corroborating results from the sliding-window spectrogram analysis. These findings confirmed signal stability, validating the use of FFT for spectral feature extraction. The spectral analysis was based on the BCI-IV-2A, which includes 288 motor imagery trials per subject. For each subject, the power spectral density (PSD) was computed for each trial and then averaged across trials to obtain a representative frequency profile. We preprocessed the data by removing artifact channels, interpolating missing values, and applying a 4–38 Hz band-pass filter to eliminate low- and high-frequency noise. These preprocessing steps ensured high-quality input for subsequent analysis. All recordings were conducted under uniform conditions. Dataset details are provided in Section 3.1.

Figure 11 shows the mean PSD curve, which exhibits the canonical EEG power profile: elevated power at low frequencies that gradually declines with increasing frequency. Specifically, the mu (8–13 Hz) and beta (13–30 Hz) bands—key markers of motor imagery—are clearly delineated. Although low-frequency bands display more pronounced variability, higher frequencies (>5 Hz) still demonstrate an approximately one-order-of-magnitude reduction in power, consistent with established EEG spectra. The shaded regions (±1 SD) in Figure 11 further attest to the reproducibility of these spectral features across channels and trials.

Complementing the PSD analysis, we performed a sliding-window spectrogram analysis (Figure 11), which shows that the energy levels remain relatively stable over time, without large fluctuations that might suggest strong artifacts or non-stationary disturbances. This is particularly evident in channels located over the sensorimotor cortex, such as C3 (which is the most relevant to motor imagery tasks) and C5, which also contribute to discriminative information.

These results support the neurophysiological plausibility and temporal stability of the EEG signals, which is critical for ensuring the validity of FFT-based feature extraction used in our classification model. Therefore, the PSD and energy analyses together provide qualitative evidence that the extracted features are based on stable and meaningful brain activity, rather than noise or transient artifacts. Nevertheless, we acknowledge that EEG signals are inherently non-stationary and that fixed-window FFT may overlook brief spectral transients. To address this limitation, future work will incorporate time-frequency approaches such as the short-time Fourier transform (STFT) and continuous wavelet transform (CWT), which can provide enhanced temporal resolution and better capture transient spectral dynamics during cognitively demanding motor imagery tasks.

6. Conclusions

In this study, we proposed a novel dual-branch neural network architecture, DB-STFFCNet, that effectively addresses the limitations of insufficient feature extraction and reliance on single-feature modes in motor imagery EEG (MI-EEG) decoding tasks. By innovatively fusing spatiotemporal and frequency-domain features, our model achieves significant improvements in both feature representation and decoding accuracy. Specifically, the spatiotemporal branch combines local attention convolution with a Transformer encoder mechanism to hierarchically capture both local and global dependencies in EEG signals; the frequency branch effectively extracts spectral information using FFT and further employs frequency convolution and Transformer modules to respectively learn local frequency features and global contextual information. This dual-branch collaborative strategy enables DB-STFFCNet to significantly outperform current mainstream methods on two public motor imagery EEG datasets, thereby validating the superiority and effectiveness of our approach in EEG signal decoding. However, this study still has some limitations. For instance, the evaluation of cross-subject generalization is insufficient, and the complexity of the feature fusion module leads to higher computational costs. Future research will focus on incorporating transfer learning to improve generalization, optimizing the module structure to reduce complexity, and exploring the extension of the model to multi-task and multi-paradigm EEG decoding applications.

Author Contributions

Conceptualization, H.H. and Z.Z. (Zhiyong Zhou); methodology, H.H.; software, H.H.; validation, H.H.; formal analysis, H.H.; investigation, Z.Z. (Zihan Zhang); resources, Z.Z. (Zhiyong Zhou); data curation, H.H.; writing—original draft preparation, H.H.; writing—review and editing, W.Y. and Z.Z. (Zhiyong Zhou); visualization, H.H. and Z.Z. (Zihan Zhang); supervision, W.Y. and Z.Z. (Zhiyong Zhou); project administration, H.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The dataset used during this study is available on the website https://bbci.de/competition/iv/ (accessed on 12 February 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Graimann, B.; Allison, B.; Pfurtscheller, G. Brain-Computer Interfaces: A Gentle Introduction. In Brain-Computer Interfaces: Revolutionizing Human-Computer Interaction; Springer: Berlin/Heidelberg, Germany, 2010; pp. 1–27. [Google Scholar] [CrossRef]
Padfield, N.; Zabalza, J.; Zhao, H.; Masero, V.; Ren, J. EEG-Based Brain-Computer Interfaces Using Motor-Imagery: Techniques and Challenges. Sensors 2019, 19, 1423. [Google Scholar] [CrossRef] [PubMed]
Freiwald, W.A.; Kreiter, A.K.; Singer, W. Synchronization and assembly formation in the visual cortex. Prog. Brain Res. 2001, 130, 111–140. [Google Scholar] [CrossRef] [PubMed]
Jeon, Y.; Nam, C.S.; Kim, Y.J.; Whang, M.C. Event-related (De) synchronization (ERD/ERS) during motor imagery tasks: Implications for brain–computer interfaces. Int. J. Ind. Ergon. 2011, 41, 428–436. [Google Scholar] [CrossRef]
Liu, X.; Zhang, W.; Li, W.; Zhang, S.; Lv, P.; Yin, Y. Effects of motor imagery based brain-computer interface on upper limb function and attention in stroke patients with hemiplegia: A randomized controlled trial. BMC Neurol. 2023, 23, 136. [Google Scholar] [CrossRef]
Alam, M.N.; Ibrahimy, M.I.; Motakabber, S. Feature Extraction of EEG signal by Power Spectral Density for Motor Imagery Based BCI. In Proceedings of the 2021 8th International Conference on Computer and Communication Engineering (ICCCE), Kuala Lumpur, Malaysia, 22–23 June 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 234–237. [Google Scholar] [CrossRef]
Blankertz, B.; Tomioka, R.; Lemm, S.; Kawanabe, M.; Muller, K.R. Optimizing Spatial filters for Robust EEG Single-Trial Analysis. IEEE Signal Process. Mag. 2007, 25, 41–56. [Google Scholar] [CrossRef]
Ang, K.K.; Chin, Z.Y.; Zhang, H.; Guan, C. Filter Bank Common Spatial Pattern (FBCSP) in Brain-Computer Interface. In Proceedings of the 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), Hong Kong, China, 1–8 June 2008; IEEE: Piscataway, NJ, USA, 2008; pp. 2390–2397. [Google Scholar] [CrossRef]
Sakhavi, S.; Guan, C.; Yan, S. Learning Temporal Information for Brain-Computer Interface Using Convolutional Neural Networks. IEEE Trans. Neural Networks Learn. Syst. 2018, 29, 5619–5629. [Google Scholar] [CrossRef]
Bouallegue, G.; Djemal, R.; Alshebeili, S.; Aldhalaan, H. A Dynamic Filtering DF-RNN Deep-Learning-Based Approach for EEG-Based Neurological Disorders Diagnosis. IEEE Access 2020, 8, 206992–207007. [Google Scholar] [CrossRef]
Xu, J.; Zheng, H.; Wang, J.; Li, D.; Fang, X. Recognition of EEG Signal Motor Imagery Intention Based on Deep Multi-View Feature Learning. Sensors 2020, 20, 3496. [Google Scholar] [CrossRef]
Saibene, A.; Ghaemi, H.; Dagdevir, E. Deep learning in motor imagery EEG signal decoding: A Systematic Review. Neurocomputing 2024, 610, 128577. [Google Scholar] [CrossRef]
Schirrmeister, R.; Springenberg, J.; Fiederer, L.; Glasstetter, M.; Eggensperger, K.; Tangermann, M.; Hutter, F.; Burgard, W.; Ball, T. Deep learning with convolutional neural networks for EEG decoding and visualization. Hum. Brain Mapp. 2017, 38, 5391–5420. [Google Scholar] [CrossRef]
Lawhern, V.J.; Solon, A.J.; Waytowich, N.R.; Gordon, S.M.; Hung, C.P.; Lance, B.J. EEGNet: A compact convolutional neural network for EEG-based brain–computer interfaces. J. Neural Eng. 2018, 15, 056013. [Google Scholar] [CrossRef]
Chen, X.; Teng, X.; Chen, H.; Pan, Y.; Geyer, P. Toward reliable signals decoding for electroencephalogram: A benchmark study to EEGNeX. Biomed. Signal Process. Control 2024, 87, 105475. [Google Scholar] [CrossRef]
Miao, Z.; Zhao, M.; Zhang, X.; Ming, D. LMDA-Net: A lightweight multi-dimensional attention network for general EEG-based brain-computer interfaces and interpretability. NeuroImage 2023, 276, 120209. [Google Scholar] [CrossRef] [PubMed]
Altaheri, H.; Muhammad, G.; Alsulaiman, M. Physics-Informed Attention Temporal Convolutional Network for EEG-Based Motor Imagery Classification. IEEE Trans. Ind. Inform. 2022, 19, 2249–2258. [Google Scholar] [CrossRef]
Song, Y.; Zheng, Q.; Liu, B.; Gao, X. EEG Conformer: Convolutional Transformer for EEG Decoding and Visualization. IEEE Trans. Neural Syst. Rehabil. Eng. 2022, 31, 710–719. [Google Scholar] [CrossRef]
Zhao, W.; Jiang, X.; Zhang, B.; Xiao, S.; Weng, S. CTNet: A convolutional transformer network for EEG-based motor imagery classification. Sci. Rep. 2024, 14, 20237. [Google Scholar] [CrossRef]
Zhao, D.; Tang, F.; Si, B. and Feng, X. Learning joint space–time–frequency features for EEG decoding on small labeled data. Neural Netw. 2019, 114, 67–77. [Google Scholar] [CrossRef]
Ke, S.; Yang, B.; Qin, Y.; Rong, F.; Zhang, J.; Zheng, Y. FACT-Net: A Frequency Adapter CNN With Temporal-Periodicity Inception for Fast and Accurate MI-EEG Decoding. IEEE Trans. Neural Syst. Rehabil. Eng. 2024, 32, 4131–4142. [Google Scholar] [CrossRef]
Cao, J.; Li, G.; Shen, J.; Dai, C. IFBCLNet: Spatio-temporal frequency feature extraction-based MI-EEG classification convolutional network. Biomed. Signal Process. Control 2024, 92, 106092. [Google Scholar] [CrossRef]
Altuwaijri, G.A.; Muhammad, G.; Altaheri, H.; Alsulaiman, M. A Multi-Branch Convolutional Neural Network with Squeeze-and-Excitation Attention Blocks for EEG-Based Motor Imagery Signals Classification. Diagnostics 2022, 12, 995. [Google Scholar] [CrossRef]
Chunduri, V.; Aoudni, Y.; Khan, S.; Aziz, A.; Rizwan, A.; Deb, N.; Keshta, I.; Soni, M. Multi-scale spatiotemporal attention network for neuron based motor imagery EEG classification. J. Neurosci. Methods 2024, 406, 110128. [Google Scholar] [CrossRef]
Chen, W.; Luo, Y.; Wang, J. Three-branch Temporal-Spatial Convolutional Transformer for Motor Imagery EEG Classification. IEEE Access 2024, 12, 79754–79764. [Google Scholar] [CrossRef]
Zhou, K.; Haimudula, A.; Tang, W. Dual-Branch Convolution Network With Efficient Channel Attention for EEG-Based Motor Imagery Classification. IEEE Access 2024, 12, 74930–74943. [Google Scholar] [CrossRef]
Zhi, H.; Yu, Z.; Yu, T.; Gu, Z.; Yang, J. A Multi-Domain Convolutional Neural Network for EEG-Based Motor Imagery Decoding. IEEE Trans. Neural Syst. Rehabil. Eng. 2023, 31, 3988–3998. [Google Scholar] [CrossRef]
Cai, Z.; Luo, T.; Cao, X. Multi-branch spatial-temporal-spectral convolutional neural networks for multi-task motor imagery EEG classification. Biomed. Signal Process. Control 2024, 93, 106156. [Google Scholar] [CrossRef]
Brunner, C.; Leeb, R.; Müller-Putz, G.; Schlögl, A.; Pfurtscheller, G. BCI Competition 2008–Graz data set A. Inst. Knowl. Discov. (Lab. Brain-Comput. Interfaces) Graz Univ. Technol. 2008, 16, 1. [Google Scholar]
Leeb, R.; Brunner, C.; Müller-Putz, G.; Schlögl, A.; Pfurtscheller, G. BCI Competition 2008–Graz data set B. Graz Univ. Technol. Austria 2008, 16, 1–6. [Google Scholar]
He, H.; Wu, D. Transfer Learning for Brain–Computer Interfaces: A Euclidean Space Data Alignment Approach. IEEE Trans. Biomed. Eng. 2019, 67, 399–410. [Google Scholar] [CrossRef] [PubMed]
Ma, X.; Chen, W.; Pei, Z.; Zhang, Y. and Chen, J. Attention-based convolutional neural network with multi-modal temporal information fusion for motor imagery EEG decoding. Comput. Biol. Med. 2024, 175, 108504. [Google Scholar] [CrossRef] [PubMed]
Zhao, Q.; Zhu, W. TMSA-Net: A novel attention mechanism for improved motor imagery EEG signal processing. Biomed. Signal Process. Control 2025, 102, 107189. [Google Scholar] [CrossRef]
Wang, J.; Yao, L.; Wang, Y. IFNet: An Interactive Frequency Convolutional Neural Network for Enhancing Motor Imagery Decoding From EEG. IEEE Trans. Neural Syst. Rehabil. Eng. 2023, 31, 1900–1911. [Google Scholar] [CrossRef]
Lotte, F. Signal Processing Approaches to Minimize or Suppress Calibration Time in Oscillatory Activity-Based Brain–Computer Interfaces. Proc. IEEE 2015, 103, 871–890. [Google Scholar] [CrossRef]
Samiee, K.; Kovacs, P.; Gabbouj, M. Epileptic Seizure Classification of EEG Time-Series Using Rational Discrete Short-Time Fourier Transform. IEEE Trans. Biomed. Eng. 2014, 62, 541–552. [Google Scholar] [CrossRef] [PubMed]
Shannon, C.E. A mathematical theory of communication. ACM SIGMOBILE Mob. Comput. Commun. Rev. 2001, 5, 3–55. [Google Scholar] [CrossRef]
Loshchilov, I.; Hutter, F. Fixing Weight Decay Regularization in Adam. arXiv 2017, arXiv:1711.05101. [Google Scholar]
Mane, R.; Chew, E.; Chua, K.; Ang, K.K.; Robinson, N.; Vinod, A.P.; Lee, S.W.; Guan, C. FBCNet: A Multi-view Convolutional Neural Network for Brain-Computer Interface. arXiv 2021, arXiv:2104.01233. [Google Scholar] [CrossRef]

Figure 1. Workflow of an EEG-based MI-BCI system.

Figure 2. The framework of dual-branch spatio-temporal-frequency fusion convolutional network (DB-STFFCNet). DB-STFFCNet contains four components, namely preprocessing and data augmentation, frequency feature extraction module, spatiotemporal feature extraction module, feature fusion and classification.

Figure 3. (a) Scaled dot-product attention; (b) multi-head attention; (c) Transformer encoder.

Figure 4. Comparison of classification accuracy of the proposed model with other methods on the test set in BCI-IV-2a.

Figure 5. Comparison of classification accuracy of the proposed model with other methods on the test set in BCI-IV-2b.

Figure 6. t-SNE visualizations of feature representations for Subject 7 on the BCI-IV-2a dataset under different model configurations. The visualizations illustrate the impact of excluding key components: (a) full model, (b) without FFE module, (c) without data augmentation, (d) without both FFE and STFE encoders, and (e) without the STFE module. Distinct colors represent different motor imagery classes.

Figure 7. t-SNE visualizations of feature representations for Subjects 1, 3, and 9 on the BCI-IV-2a dataset using the full model: (a) Subject 1; (b) Subject 3; (c) Subject 9.

Figure 8. Training loss and test accuracy of the proposed DB-STFFCNet on the BCI-IV-2a.

Figure 9. The subject average confusion matrix of the DB-STFFCNet model on the BCI-IV-2a and BCI-IV-2b datasets.

Figure 10. Performance comparison of frequency-domain convolution strategies (no convolution, 1 × 5 kernel, 1 × 15 kernel, and hybrid 1 × 5 + 1 × 15 configuration) on the BCI-IV-2a dataset classification task.

Figure 11. Average power spectral density (PSD) and sliding-window-based spectrotemporal heatmaps of motor imagery EEG signals on the BCI-IV-2a dataset. The left subfigure shows the average PSD across trials, while the right subfigure illustrates spectrotemporal patterns using a sliding window approach.

Table 1. Summary of the BCI-IV-2a and IV-2b datasets used in this study.

Dataset	Subjects	Channels	Trials	Classes	Sampling Rate (Hz)	Trial Duration (s)
BCI-IV-2a	9	22	576	4	250	4
BCI-IV-2b	9	3	720	2	250	4

Table 2. Comparison of classification accuracy (%) for BCI-IV-2a.

Methods	A01	A02	A03	A04	A05	A06	A07	A08	A09	Average	Std	Kappa
EEGNet	85.10	64.24	84.72	68.40	60.42	57.64	84.38	83.33	86.11	74.93	11.99	0.66
FBCNet	83.53	57.64	85.76	78.27	73.81	56.25	84.13	82.64	82.99	76.11	11.45	0.69
Conformer	86.73	60.12	94.25	77.38	59.87	67.45	91.69	89.01	87.56	79.34	13.62	0.72
LMDA-Net	87.15	68.44	92.01	76.74	66.54	61.46	92.36	85.07	86.11	79.54	11.61	0.63
ATCNet	85.21	63.89	92.70	76.98	79.72	67.33	89.12	85.45	82.67	80.34	9.60	0.73
TMSA-Net	86.75	63.48	95.92	83.16	79.28	66.89	92.47	89.35	84.79	82.45	10.99	0.76
Ours	89.58	69.89	93.06	82.99	74.10	67.71	94.10	89.89	86.81	83.13	10.09	0.78

Note: The values in bold represent the best results.

Table 3. Comparison of classification accuracy (%) for BCI-IV-2b. The values in bold represent the best results.

Methods	B01	B02	B03	B04	B05	B06	B07	B08	B09	Average	Std	Kappa
EEGNet	75.00	62.50	60.42	98.33	80.00	88.33	85.00	93.33	90.83	81.53	13.32	0.72
FBCNet	79.42	56.83	61.27	96.15	92.49	86.12	81.90	91.37	88.95	81.61	13.84	0.64
ATCNet	72.85	62.10	86.42	95.10	92.50	89.91	90.25	95.80	89.92	86.09	11.26	0.73
LMDA-Net	82.58	62.78	74.80	99.60	95.52	92.33	90.43	95.89	93.69	87.51	11.98	0.73
Conformer	78.43	71.92	84.17	96.84	96.55	88.62	91.78	93.41	91.66	88.15	8.46	0.76
TMSA-Net	82.17	70.58	87.25	97.82	97.95	90.11	93.04	94.17	86.95	88.89	8.63	0.77
Ours	83.83	68.56	78.77	99.80	95.76	94.89	92.56	96.68	94.97	89.54	10.31	0.78

Note: The values in bold represent the best results.

Table 4. Ablation studies on BCI-IV-2a.

Method	Accuracy (%)	Traing Time (s)	Inference Time (ms)
Ours	83.13	1.19	0.479
Ours-w/o FFE	80.19	1.03	0.476
Ours-w/o FFE + encoder(STFE)	79.54	0.87	0.472
Ours-w/o STFE	60.81	0.61	0.465
Ours-w/o STFE + encoder(FFE)	58.79	0.53	0.486
Ours-w/o Augmentation	75.49	0.79	0.531
Ours-w/o encoder(STFE)	79.84	1.11	0.524
Ours-w/o encoder(FFE)	78.61	1.17	0.542
Ours-w/o encoder(STFE+FFE)	75.87	0.88	0.476

Table 5. Ablation studies on BCI-IV-2b.

Method	Accuracy (%)	Traing Time (s)	Inference Time (ms)
Ours	89.54	0.83	0.243
Ours-w/o FFE	88.22	0.54	0.206
Ours-w/o FFE + encoder(STFE)	87.51	0.22	0.125
Ours-w/o STFE	71.47	0.3	0.163
Ours-w/o STFE + encoder(FFE)	68.58	0.16	0.144
Ours-w/o Augmentation	80.97	0.54	0.178
Ours-w/o encoder(STFE)	82.73	0.4	0.131
Ours-w/o encoder(FFE)	78.75	0.48	0.163
Ours-w/o encoder(STFE+FFE)	79.89	0.2	0.1

Table 6. Comparison of model complexity (parameters and FLOPs) and classification performance (accuracy) on the BCI-IV-2a dataset.

Method	Parameters	Flops	Accuracy (%)
Ours	29.5 k	66.33 M	83.13
EEGNet	3.91 k	13.25 M	74.93
Conformer	156.56 k	71.35 M	79.34
L MDANet	5.4 k	64.87 M	79.54
T MSANet	20.9 k	33 M	82.45
ATCNet	113.73 k	29.79 M	80.34

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Hu, H.; Zhou, Z.; Zhang, Z.; Yuan, W. Dual-Branch Spatio-Temporal-Frequency Fusion Convolutional Network with Transformer for EEG-Based Motor Imagery Classification. Electronics 2025, 14, 2853. https://doi.org/10.3390/electronics14142853

AMA Style

Hu H, Zhou Z, Zhang Z, Yuan W. Dual-Branch Spatio-Temporal-Frequency Fusion Convolutional Network with Transformer for EEG-Based Motor Imagery Classification. Electronics. 2025; 14(14):2853. https://doi.org/10.3390/electronics14142853

Chicago/Turabian Style

Hu, Hao, Zhiyong Zhou, Zihan Zhang, and Wenyu Yuan. 2025. "Dual-Branch Spatio-Temporal-Frequency Fusion Convolutional Network with Transformer for EEG-Based Motor Imagery Classification" Electronics 14, no. 14: 2853. https://doi.org/10.3390/electronics14142853

APA Style

Hu, H., Zhou, Z., Zhang, Z., & Yuan, W. (2025). Dual-Branch Spatio-Temporal-Frequency Fusion Convolutional Network with Transformer for EEG-Based Motor Imagery Classification. Electronics, 14(14), 2853. https://doi.org/10.3390/electronics14142853

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Dual-Branch Spatio-Temporal-Frequency Fusion Convolutional Network with Transformer for EEG-Based Motor Imagery Classification

Abstract

1. Introduction

2. Literature Review

3. Methods

3.1. Preprocessing and Data Augmentation

3.2. Transformer Encoder

3.3. Spatiotemporal Feature Extraction Module

3.4. Frequency Feature Extraction Module

3.5. Feature Fusion and Classification

4. Experiment and Result

4.1. Experiment Settings

4.2. Comparison of Classification Results

4.3. Ablation Study

4.4. Training Progress

4.5. Detailed Classification Analysis

4.6. Evalution of Parameter Selection

5. Discussion

5.1. Analysis of Model Complexity

5.2. EEG Signal Analysis

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI