1. Introduction
Brain–computer interfaces (BCIs), which convert neural signals into device control, create real-world possibilities for rehabilitation and assistive mobility. Of the non-invasive options available, electroencephalography (EEG) is an appealing option due to its safety, portability, and high-resolution temporal information [
1]. When combined with modern artificial intelligence, in particular when using deep learning techniques, EEG provides the capability to perform end-to-end neural decoding through the use of discriminative features learned from the input data [
2]. The combination of BCI technology and deep learning is key for applications that require users to express intent without producing residual muscle activity, i.e., operating a powered wheelchair.
Motor imagery (MI) provides a natural control paradigm for EEG BCIs. Imagining limb movements modulates oscillatory activity over sensorimotor cortices, with characteristic changes in the mu (about 8–12 Hz) and beta (about 13–30 Hz) bands. These sensorimotor rhythms exhibit event-related desynchronization and resynchronization that track imagined actions and their timing [
3]. Because MI requires no overt movement, it is suitable for people with severe motor impairments and can be mapped to navigation primitives such as start, stop, and turns, or to higher level behaviors orchestrated by a shared controller [
1,
4].
While there has been significant advancement in the field of MI-based BCIs, specifically due to advancements in the application of deep learning techniques, MI-based BCIs still experience challenges in accurately decoding MI from EEG. Signals typically have a low signal-to-noise ratio, are non-stationary across trials and sessions, and are highly variable between subjects. Therefore, models that are trained on one subject will often fail when applied to another subject. Many current BCI systems continue to utilize a time-consuming calibration process that must be completed for each new user. To achieve “plug-and-play” capabilities, researchers need to develop methods that can identify subject invariant representations while maintaining the ability to detect the relevant spatial and temporal structure of MI. Recent research in cross-subject decoding views this challenge as a distribution shift problem and proposes domain generalization strategies to address it [
5].
A large body of research has utilized both traditional signal processing methods and deep learning methods to meet these requirements. In early BCI systems, researchers utilized handcrafted features (such as band power and spatial filters) and used shallow classifiers. However, early systems were unable to handle the non-stationariness of the data and the complexity of multi-class classification problems. The development of compact convolutional neural networks, specifically EEGNet, demonstrated that depthwise and separable convolutions can learn spatial–spectral structure directly from unprocessed EEG and generalize well across different paradigms while maintaining low parameters [
6].
More recent research has proposed architectures utilizing attention mechanisms to model longer-term temporal dependencies. The hybrid CNN-Transformer architecture utilizes the local inductive biases provided by convolution layers to provide spatial and spectral features, and the global context integration provided by self-attention layers to provide contextual information; in MI classification, these architectures have shown improvements over purely convolutional architectures, in addition to providing better interpretability through the use of attention maps [
7]. However, much of the existing research continues to focus on either spatial or temporal aspects of the data, require subject-specific adaptation processes, and/or do not provide sufficient insight regarding how decisions are made using specific combinations of channels and time segments [
2,
5].
Taken together, recent advances in motor imagery decoding indicate that no single modeling strategy is sufficient to fully address the challenges of practical cross-subject BCIs. Multi-scale convolutional architectures such as TSception [
8] and HS-CNN [
9] improve sensitivity to EEG patterns evolving at different temporal resolutions, while recent transformer-based and hybrid sequence models, including EEG-Conformer [
10], GRUGate-Trans [
11], and t-CTrans [
12], enhance contextual integration and long-range temporal modeling through attention and recurrent gating mechanisms. In parallel, domain adaptation and domain generalization approaches explicitly address inter-subject distribution shifts by promoting more invariant representations across users. Overall, these developments suggest that robust cross-subject decoding requires simultaneously handling multi-scale temporal structure, informative context aggregation, and heterogeneous subject distributions.
In this context, we target robust cross-subject MI decoding by combining multi-scale temporal feature extraction with explicit within-trial sequence modeling. Concretely, we:
Introduce a multi-branch 1D CNN–LSTM architecture that captures complementary temporal scales and integrates evidence over sub-windows via late fusion;
Adopt group normalization to improve training stability and reduce dependence on batch statistics in mixed-subject training;
Evaluate in strictly subject-independent settings (5-fold cross-subject CV and LOSO) on large EEGMMI cohorts (105–109 subjects) with controlled windowing/sub-windowing;
Quantify the benefit of rapid subject-specific transfer learning following Wang et al. [
13], assessing both zero-shot and few-shot personalization.
2. Related Works
Classical pipelines for motor imagery (MI) EEG decoding relied on hand-crafted features and shallow classifiers. The common spatial pattern (CSP) algorithm was a cornerstone for two-class MI because it maximizes variance differences between classes, yet it typically requires supervised, subject-specific calibration to achieve reliable control [
14]. Multi-subject learning strategies alleviate this burden by estimating spatial filters from groups of participants; for instance, Devlaminck et al. [
15] showed that leveraging data from other users can reduce the amount of labeled trials needed for a new subject. In parallel, Riemannian geometry methods treat trial covariance matrices as points on the manifold of symmetric positive definite (SPD) matrices and use geodesic distances or tangent-space projections to obtain robust representations [
16]. These approaches often outperform CSP and reduce calibration time in MI tasks [
14], but they still depend on manually designed filter banks and shallow decision rules and, crucially, they struggle to capture the long-range temporal structure of non-stationary EEG.
The advent of deep convolutional neural networks (CNNs) has enabled end-to-end learning of spatial–spectral filters from raw EEG, enabling development of more complex and capable analytical tools. It has been demonstrated that shallow/deep ConvNets can either perform equally well or better than filter bank-based methods for MI Schirrmeister et al. [
17] with added benefits including visualization of the learned features.
Lawhern et al. [
6] proposed EEGNet, a compact architecture that employs depthwise and separable convolutions to disentangle spatial from temporal filtering, achieving competitive accuracy with few parameters and generalizing across multiple BCI paradigms. Building on these ideas, multi-scale temporal CNNs deploy parallel or hybrid kernel sizes to capture rhythms spanning alpha/beta bands; for example, HS-CNN integrates multiple temporal receptive fields within each layer and improves MI performance by aggregating information across scales [
9]. Such models learn richer spatial–spectral features than classical pipelines, but their temporal receptive field remains bounded by kernel size and depth; without an explicit sequence module, they may under-exploit dependencies spread across sub-windows within a trial.
To address temporal integration, many works couple CNN front-ends with recurrent neural networks (RNNs). In these hybrid architectures, usually a CNN extracts features from short segments of EEG recordings and an RNN (e.g., LSTM/GRU) aggregates them over the 3–5 s MI trial. Zhang et al. [
18] introduced a parallel CNN-RNN scheme for four-class MI that outperformed standalone CNNs, and Khademi et al. [
19] showed that a transfer-learning-based CNN-LSTM markedly improves BCI Competition IV-2a performance after fine-tuning. Temporal convolutional networks (TCNs) with dilated convolutions provide a non-recurrent alternative to enlarge the temporal receptive field [
9]. Overall, these hybrids consistently help when discriminative cues are distributed over time (e.g., early cue-locked responses vs. sustained ERD/ERS). However, their effectiveness depends on how segments are defined, which normalization is used to stabilize training across subjects, and how temporal features are fused.
Attention mechanisms and transformer-based models have been applied to MI EEG to weight informative cue/ERD–ERS portions of each trial and to model dependencies across sensorimotor channels. Representative examples include EEG-Conformer [
10] and related local–global hybrids. Zhang et al. [
7] proposed an approach which pairs a CNN local feature extractor with a transformer encoder that models global structure via multi-head self-attention; in MI classification, this design reported gains of a few percentage points over CNN baselines, alongside attention maps that highlight physiologically plausible channels and time ranges. Tao et al. [
11] enhanced the standard Transformer by replacing residual connections with GRU-based gating mechanisms, applied to EEG sequences pre-processed through temporal and spatial 1D convolutions, to improve long-range dependency modeling for motor imagery and visual EEG classification. Xie et al. [
12] designed a Transformer-based model that jointly exploits spatial (cross-channel) and temporal dependencies in raw EEG signals, demonstrating strong cross-subject generalization on motor imagery tasks from the PhysioNet dataset. More recently, Muna et al. [
20] proposed SSTAF, combining a spectral and a spatial Transformer branch, fed with STFT-derived time-frequency features, to simultaneously attend to discriminative patterns across the spectral, spatial, and temporal domains for motor imagery classification. While attention improves modeling power and provides transparent saliency, transformer-heavy architectures can be data- and computation-hungry and require careful regularization to avoid overfitting noisy EEG [
2].
Despite these advances, robust generalization to unseen subjects remains a major bottleneck for practical MI BCIs. Inter-subject variability affects both the spatial signatures of sensorimotor rhythms (due to anatomy, montage differences, and impedance) and the temporal profile of ERD/ERS within each trial (latency, duration, and strategy-dependent timing). As a result, models that emphasize a single temporal scale or that aggregate features without explicitly modeling within-trial dynamics may under-utilize informative time structure and degrade under subject shift. In addition, training stability can be sensitive to normalization choices when batches mix different subjects. For this reason, we focus on architectures that (i) capture complementary temporal scales, (ii) integrate evidence explicitly over a sequence of sub-windows, and (iii) employ normalization schemes that are less dependent on batch statistics. An initial attempt to resolve this type of problem consists of adopting unsupervised domain adaptation, which aligns the representations of the source and target subjects during training: adversarial approaches in the spirit of DANN add a gradient-reversal branch that encourages subject-invariant features and have improved subject-independent MI decoding [
21]. Riemannian alignment (for example, Euclidean/whitening alignment of covariance or tangent-space centering) is complementary and forms a strong baseline for cross-subject MI [
16]. Recent surveys emphasize that deep decoders without explicit alignment or normalization can underperform on new users and that robust normalization is crucial for cross-subject stability [
2]. Beyond adaptation, domain generalization (DG) trains on multiple sources to learn invariances transferable to unseen users; teacher–student distillation, correlation alignment, and data-centric augmentations (noise, time-warping, mixup) have all been explored, often in combination, to mitigate inter-subject shift [
2,
5]. Despite encouraging progress, reported gains vary with datasets, windowing choices, and evaluation protocols, underscoring the need for carefully controlled comparisons. Even with domain adaptation or domain generalization, a residual “last-mile” gap often remains when moving to a new user or recording setup. In many cases, however, collecting a small amount of labeled data from the target user is feasible and can substantially improve reliability. Subject-specific transfer learning (SS-TL) is therefore a pragmatic strategy: a decoder is pretrained on a large multi-subject cohort and then quickly fine-tuned with a limited number of labeled trials from the new subject.
A common workflow pretrains a compact decoder on many subjects and then fine-tunes a small head or adapter using a few labeled trials from the new user. Wang et al. [
13] showed that an EEGNet-based MI decoder can be efficiently adapted on a low-power edge device, with brief on-device updates yielding consistent gains without cloud retraining. Similar benefits were observed for CNN-LSTM hybrids [
19]. SS-TL thus trades a short calibration session (minutes) for improved reliability, often the most practical option for deployment.
In summary, while prior work has made substantial progress in spatial filtering, temporal modeling, and subject-specific adaptation, robust generalization to unseen users remains challenging. This motivates architectures that jointly capture multi-scale temporal structure, explicitly model within-trial dynamics, and rely on normalization schemes that remain stable across heterogeneous subjects—objectives that guide the design of our proposed CNN–LSTM model.
3. Proposed Model
Our architecture, illustrated in
Figure 1, is a multi-branch CNN-LSTM designed to learn complementary temporal features at different receptive-field scales and to integrate them over the sequence of sub-windows that compose each event. The network expects as input a 4D tensor
, where
B is the batch size,
T the number of sub-windows in the event,
C the number of input channels, and
L the number of time samples per sub-window. Throughout this section we keep
C generic, with the only constraint being that it must be divisible by eight due to the group normalization configuration described below.
The model is divided into multiple components: a first feature extractor part, followed by temporal feature analysis and classification. The first component is a parallel 1D-CNN feature extractor with
branches operating in parallel on the same input. Branch
uses a kernel size
drawn from the ordered set
, so that each branch specializes to a distinct temporal scale. Given the common resampling rate of 160 Hz adopted in all experiments, the kernel sizes 7, 13, 25, 31 correspond to temporal receptive fields of approximately 44–194 ms. This range was chosen to capture short-term transient modulations of sensorimotor rhythms while preserving neighboring temporal resolutions across branches. Rather than targeting a single frequency band explicitly, the proposed design enables the network to learn complementary temporal patterns related to mu/beta desynchronization and rebound phenomena [
22]. Within every branch, temporal processing is performed by stacking in series
identical
ConvBlocks. Each
ConvBlock consists of two temporal convolutions that preserve the channel dimension followed by normalization, rectification, and downsampling along time: specifically,
,
,
,
,
,
, and finally
. Convolutions use unit stride and are bias-enabled; “same” padding keeps the temporal length invariant inside each convolution so that the only temporal downsampling comes from the max pooling. Because two
ConvBlocks are applied in sequence, the temporal length is reduced by a factor of four within each branch, whereas the channel count remains
C. Implementation-wise, the tensor is first reshaped to
to process the
T sub-windows independently; after the two blocks, each branch produces
, which is then flattened to
with
. This flattening makes the subsequent recurrent module agnostic to the spatial arrangement of channels while retaining all information preserved by the convolutions.
Temporal processing across the T sub-windows is performed per branch by a single-layer LSTM configured with batch_first=True, hidden size , input size , and default PyTorch (v 2.7.1) gates and activations (no recurrent dropout, unidirectional). Feeding yields a sequence of hidden states per branch. The four sequences are then concatenated along the feature axis, resulting in . This late fusion preserves the per time step alignment across branches while maximizing the representational capacity; in practice, it is equivalent to running four independent LSTMs and stacking their outputs channel-wise.
Classification is performed at the time-step level by a lightweight MLP applied identically to each row of the fused sequence. Concretely, the tensor is reshaped to , passed through a fully connected layer to 384 units, followed by , , and , and finally projected to logits by a second fully connected layer. The output is reshaped back to . We deliberately keep the classifier shallow to concentrate capacity in the recurrent integration, which empirically carries the bulk of temporal modeling. During training, a standard cross-entropy loss is computed at every time step using the event label repeated T times; this objective encourages consistent predictions across the sub-windows of the same event without imposing hard constraints on the internal dynamics. At inference time, if a single decision per event is desired, the model’s per-step logits can be averaged across the T positions (or, equivalently, a majority vote can be taken on the per-step argmax), but such aggregation is external to the architecture and does not affect training. The overall architecture, in the configuration with four branches and the subsequent temporal correlation step and classification, presents a computational complexity corresponding with GFLOPs.
4. Datasets
We evaluate the proposed model on a large public EEG motor imagery dataset and on a smaller ISLab-MI dataset acquired with a wearable system, in order to assess both cross-subject robustness and portability across recording setups.
4.1. PhysioNet EEG Motor Movement/Imagery Dataset
We evaluate the proposed model on the EEG Motor Movement/Imagery (EEGMMI) dataset hosted on PhysioNet [
23]. EEGMMI includes recordings from 109 subjects who performed 14 experimental runs spanning motor execution and motor imagery (MI) tasks. Signals were acquired from 64 scalp electrodes arranged according to the international 10 –10 system. Following common practice, we focus on four MI classes (left fist, right fist, both fists, both feet) and a resting state (0). Unless otherwise specified, our subject pool includes 105 participants (excluding subjects 89, 92, 100, and 104 due to data inconsistencies). Comparisons that require all participants are reported on the full 109-subject set for completeness. Each recording is: (i) notch-filtered at 60 Hz to suppress power-line noise; (ii) resampled to a uniform 160 Hz; and (iii) z-score normalized per channel using statistics computed strictly on the training subjects within each cross-validation fold. Those statistics are then applied to both training and test sets for the fold. This step proved important for training stability and cross-subject generalization.
Each labeled event (macro-window) lasts 4 s and is partitioned into a sequence of
T non-overlapping sub-windows of duration
seconds, yielding
. With
channels and a sampling rate of 160 Hz, each sub-window contains
time samples. The resulting input tensor for a batch of
B macro-windows is
, matching the model contract introduced in
Section 3. An illustration of the segmentation is provided in
Figure 2.
4.2. The ISLab-MI Dataset
In order to test model generalization capabilities, we collected a motor imagery (MI) dataset, called ISLab-MI Dataset, with a wearable g.Nautilus PRO system. Recordings were performed on five healthy adult volunteers (20–40 years) recruited from the university community. Each participant completed one experimental visit in a quiet laboratory room, seated comfortably in front of a monitor and instructed to minimize head and body movements and to avoid eye blinks during the active phases. The visit was structured as two consecutive EEG sessions separated by a short rest break, yielding two internally consistent datasets per subject; across participants and visits this resulted in ten recording sessions in total. The amplifier was used with the full 32-electrode wet cap arranged according to an extended 10–20 layout, plus one auxiliary channel, for a total of 33 recorded channels. Raw EEG was acquired with g.Recorder at 500 Hz, 24-bit resolution and an input range of mV, so as to capture small-amplitude oscillatory activity while preserving headroom for artifacts. Event markers were delivered through the digital inputs of the base station to timestamp the onset and duration of each trial.
The MI protocol was based on a small set of discrete imagined movements designed for powered-wheelchair control. Each trial followed a structured sequence of phases: a fixation period in which the subject focused on a central cross; a cue phase in which a visual stimulus indicated the upcoming task; a perform phase during which the subject continuously imagined the required movement; and a blank period where the subject relaxed. The four MI classes corresponded to opening and closing the right hand, opening and closing the left hand, opening and closing both hands, and dorsiflexion of both feet, respectively. A no-movement baseline was also recorded and later treated as a separate rest class (0). Within each session, stimuli were presented in pseudo-random order with an equal number of repetitions per class so that all commands were equally represented.
To align training and inference with the model presented in
Section 3, we converted all recordings to a uniform 32-channel montage (10 –10 subset) and a sampling rate of 160 Hz. The 32 channels were derived from the original 64-channel EEG montage by selecting a consistent subset of electrodes within the 10–10/10–20 system, so that all sessions shared the same spatial layout. In practice, we chose one electrode from each pair of closely spaced 10–10 sites and kept the main midline positions (Fpz, Fz, Cz, Pz, Oz). This produced a balanced set of frontal, central, parietal, temporal, and occipital channels, preserving the overall structure of the original montage while ensuring a uniform 32-channel configuration. Signal conversion and preprocessing followed a fixed pipeline: a mild high-pass at 0.2 Hz at the native sampling rate, polyphase resampling to 160 Hz, a 50 Hz notch (EU mains), and a 0.5–40 Hz band-pass. Signals were scaled to microvolts and channels were reordered to the 32-target layout used throughout this work. Events were reconstructed from the trigger channel using the “PERFORM” schedule adopted in our protocol. Each occurrence of the
perform cue was mapped to one of four labels: left hand (L), right hand (R), both hands (F), or both feet (B), while “blank” segments were split into two 4 s rest windows (label 0). The final archive for each session follows a compact HDF5 layout containing raw data, an event table with onset, duration, and label, and a convenience group with per channel
computed once for analysis reproducibility.
Unless stated otherwise, we adopt the same temporal framing used for EEGMMI: each labeled event is a 4 s macro-window decomposed into non-overlapping sub-windows of 1 s ( samples at 160 Hz), yielding inputs with . For zero-shot evaluation, we standardize channels using the statistics associated with the pre-trained EEGMMI models, ensuring strict consistency between training and deployment. We report three classification settings that mirror our public-data experiments: L/R/0 (3-class), L/R/0/F (4-class), and L/R/0/F/B (5-class).
4.3. Experimental Setup
We employ two distinct training and evaluation setups: one for cross-subject modeling on the public EEGMMI dataset and one for transfer learning and personalization on the ISLab-MI dataset.
4.3.1. Training on EEGMMI
Unless otherwise noted, results are obtained with 5-fold subject-independent cross-validation on 105 subjects. In each split, one fold is held out for testing and the remaining folds are used for training. The training is then further split with a split to obtain a validation set for each fold. Models are trained for up to 40 epochs with AdamW (learning rate , weight decay ) and batch size 16. At the end of every epoch we evaluate accuracy and loss on the validation fold. Early stopping halts training if validation accuracy does not improve for 10 consecutive epochs, and the score reported for the fold is achieved using the model weights at the stopping epoch. Cross-entropy loss is computed over the predictions of all sub-windows.
4.3.2. Pre-Training and Adaptation to the ISLab-MI Dataset
We retrained the proposed CNN-LSTM on EEGMMI restricted to the same 32-channel subset and imagery runs, using a leave-one-subject-out (LOSO) protocol over all 109 subjects. For each held-out subject we tracked the best epoch; the fixed training horizon for the final weights was set to the median of these best epochs, computed per task: eight epochs for the five-class configuration and seven epochs for the three- and four-class configurations. We then trained on the full 109-subject cohort at the corresponding horizon to obtain the final zero-shot models used below.
We evaluated both zero-shot transfer and rapid subject-specific adaptation on our dataset. For adaptation, we adopted a simple and reproducible split: the first 20% of events in each session were used for fine-tuning and the remaining 80% for testing. Fine-tuning ran for 10 epochs with Adam (classifier head learning rate , backbone ), computing the loss on the per event decision obtained by averaging the model’s per sub-window logits. This protocol matches the SS-TL spirit of prior work while keeping the amount of subject data and training time minimal.
6. Conclusions
Motor imagery BCIs promise hands-free control for assistive mobility, yet practical deployment is still limited by the low signal-to-noise ratio of EEG and, most critically, by inter-subject variability that breaks models trained on previous users. In this work, we addressed this bottleneck by designing a subject-robust decoder that explicitly targets the cross-subject variability that typically disrupts MI-BCI performance.
We introduced a multi-branch CNN-LSTM model for cross-subject motor imagery classification. The proposed architecture combines (i) multi-scale temporal feature extraction through parallel 1D convolutional branches and (ii) within-trial sequence modeling by per-branch LSTMs with late fusion. Together with group normalization and time-step supervision over sub-windows, this design stabilizes training in mixed-subject batches and encourages consistent predictions across the temporal evolution of each event. On the large EEGMMI cohort and under strictly subject-independent protocols, the proposed model consistently outperformed an EEGNet-style compact baseline across binary and multi-class settings, achieving high accuracy with 4 s trials and retaining strong performance even with shorter sub-windows. Under the more stringent LOSO evaluation, the model maintained competitive generalization and improved over recent transformer-based reports.
We also tested portability beyond the source dataset. When we transferred zero-shot to recordings acquired with different hardware and a different electrode montage, the EEGMMI-pre-trained model provided competitive accuracy, and brief subject-specific fine-tuning using only a small fraction of each session further closed the remaining gap. These results support a pragmatic deployment pathway: a strong cross-subject backbone for plug-and-play initialization, complemented by a short calibration stage that yields reliable control without lengthy data collection.
The encouraging zero-shot transfer results obtained on the ISLab-MI dataset suggest that the proposed architecture already captures partially transferable representations despite not relying on explicit domain adaptation objectives. A promising future direction consists of integrating the proposed backbone with domain-adversarial or invariant-learning strategies to further improve robustness across unseen subjects and acquisition setups.
Although the present analysis supports the relevance of the multi-branch temporal design, a fully exhaustive component-wise ablation remains an important direction for future work. In particular, replacing the LSTM with alternative temporal aggregation modules, such as temporal convolution, attention pooling, or transformer blocks, would allow for a more systematic assessment of the trade-off between sequence modeling capacity, computational cost, and cross-subject robustness. Similarly, further comparisons among normalization strategies could better quantify their impact under heterogeneous subject-independent training.