1. Introduction
Engineering has emerged as a cornerstone in addressing pressing global health challenges, as emphasized in UNESCO’s Engineering for Sustainable Development report (2021), which recognizes the discipline as a central pillar of the 2030 Agenda. Among the Sustainable Development Goals, SDG 3—“ensure healthy lives and promote well-being for all at all ages”—underscores the need for accessible and affordable technologies to strengthen medical diagnostics and healthcare delivery [
1]. Within this framework, brain–computer interfaces (BCIs) have gained increasing international attention for their potential to revolutionize human–machine interaction in clinical, rehabilitative, and assistive domains. Beyond their scientific and societal impact, BCIs are also economically significant: the global market is projected to reach USD 2.21 billion by 2025 and expand further to USD 3.60 billion by 2030, with a compound annual growth rate (CAGR) of 10.29 [
2]. This convergence of societal need, technological innovation, and market growth highlights BCIs as a key enabler for sustainable health solutions.
Within this landscape, electroencephalography (EEG) has become a foundational technology for BCI implementation, owing to its non-invasive nature, low cost, portability, and high temporal resolution [
3]. EEG captures the brain’s electrical activity through scalp-mounted electrodes and enables the use of advanced signal processing techniques such as event-related potentials (ERPs), which are widely used to assess cognitive and motor functions [
4]. These features have established EEG as the technological backbone of many modern BCI platforms, allowing for the decoding of neural signals to control external devices without requiring muscular input [
5]. Among the various paradigms, motor imagery (MI)—the mental rehearsal of movement without physical execution—has demonstrated remarkable clinical potential in post-stroke rehabilitation, neuroprosthetic control, and assistive technologies such as robotic wheelchairs and virtual spellers [
6].
Despite its versatility, EEG-based BCI systems face inherent structural limitations, primarily due to the physical and physiological nature of the recorded brain signals [
7]. These constraints compromise both signal quality and interpretability, directly affecting their clinical and functional applicability [
8]. One of the most critical challenges is inter-subject variability, which introduces pronounced inconsistency in neural activation patterns and undermines the generalization of classification models used in BCI [
9,
10]. This variability has been strongly linked to the phenomenon known as BCI illiteracy, where a substantial subset of users is unable to gain intentional control of the system, even after repeated training sessions [
11,
12]. Beyond its technical implications, this limitation poses a fundamental challenge to the inclusion and scalability of BCI technologies in real-world clinical settings, where system adaptability to diverse neurophysiological profiles is essential [
13]. Another key obstacle in the development of EEG-based BCI systems is their limited spatial resolution, which is inherently constrained by volume conduction effects. Unlike imaging modalities such as functional magnetic resonance imaging (fMRI) or magnetoencephalography (MEG), EEG suffers from spatial distortions because neural electrical signals must traverse multiple layers with varying conductivities—such as the skull and scalp—before being recorded at the surface electrodes [
14]. This biophysical phenomenon results in signal mixing across electrodes, making it difficult to accurately localize cortical activity [
14]. Consequently, spatial specificity is reduced in applications that require fine-grained identification of motor or sensory regions, ultimately limiting the system’s performance in rehabilitation, neurofeedback, and precision control tasks [
15].
In this context, various classical signal processing and machine learning methods have been proposed to improve signal quality and extract discriminative features from EEG recordings, particularly in MI paradigms. Among the earliest approaches, time–frequency domain techniques such as the short-time Fourier transform (STFT) [
16] and the wavelet transform [
17,
18] have proven effective in decomposing EEG signals into more informative components by capturing their non-stationary nature. These tools facilitate the identification of relevant brain rhythms, particularly the
(8–12 Hz) and
(13–30 Hz) bands, which have been widely linked to movement execution and motor imagery [
19]. Additionally, the filter bank common spatial patterns (FBCSP) approach extends this principle by dividing the EEG into multiple frequency sub-bands and applying the CSP algorithm to each of them, thereby enhancing class discrimination [
20]. While these strategies improve the signal-to-noise ratio (SNR), their effectiveness depends on fixed or heuristically defined frequency ranges, which limits their adaptability to inter-subject spectral variability [
21]. Furthermore, CSP variants—such as regularized CSP, discriminative CSP, and sparse CSP—attempt to mitigate overfitting and improve generalization, but remain noise-sensitive and often require subject-specific calibration [
22]. Crucially, these approaches provide limited robustness to the spatial distortions inherent in EEG, caused by volume conduction, which restricts their ability to resolve cortical sources with precision and thus limits their performance in tasks requiring spatial specificity [
23].
Conversely, recent approaches have leveraged deep learning, particularly convolutional neural networks (CNNs), to extract hierarchical representations from raw EEG signals. Architectures such as EEG network (EEGNet) [
24], shallow convolutional network (ShallowNet) and deep convolutional network (DeepConvNet) [
25], and temporal–channel fusion network (TCFusionNet) [
26] have shown potential in MI classification by learning spatial and temporal patterns without the need for manual feature engineering. These models exhibit increased robustness to noise and, in some cases, can implicitly compensate for spatial distortions through convolutional kernels. Building on this foundation, kernel-based regularized EEGNet (KREEGNet) [
27] introduces explicit spatial encodings and specialized convolutional kernels to enhance cortical sensitivity, offering a more targeted solution to the spatial resolution limitations of EEG. Nevertheless, the performance of these architectures remains highly dependent on large, high-quality datasets, and deeper networks are particularly prone to overfitting. Moreover, achieving robust model generalization remains a significant challenge, particularly due to high inter-subject variability in EEG patterns. As a result, most approaches still rely on subject-specific calibration or employ transfer learning strategies to adapt models across individuals [
28]. Beyond CNNs, cutting-edge research has begun to explore attention mechanisms, transformer architectures, and generative modeling [
29]. Transformer-based models have shown strong performance in capturing long-range temporal dependencies and improving generalization across subjects. For instance, convolutional Transformer network (CTNet) [
30] leverages multi-head self-attention to dynamically extract discriminative spatial-temporal features. Similarly, spatial–temporal transformer models have demonstrated robustness in multi-scale temporal feature extraction [
31]. Complementary generative approaches, including autoencoders and variational autoencoders (VAEs), have been employed for nonlinear denoising and unsupervised feature learning [
32,
33]; however, these models inadvertently discard class-discriminative information unless properly regularized [
33]. More recently, multimodal and diffusion-based transformer models have been proposed to integrate spatial, temporal, and topological EEG dynamics [
34], though their architectural complexity and computational demands may limit their application in real-time or clinical settings [
32].
In addition to approaches based on local or spatial features, EEG representation through connectivity models has been extensively explored, including spectral, structural, directed, and functional connectivity. Spectral connectivity, based on measures such as coherence and spectral entropy, allows for the capture of phase and power relationships between cortical regions, but may be sensitive to noise and the choice of frequency bands [
35]. Structural connectivity, typically derived from neuroimaging techniques such as MRI, is difficult to obtain from EEG and is rarely integrated directly into non-invasive applications [
36]. Directed connectivity aims to identify causal relationships between regions using metrics like partial directed coherence or dynamic causal modeling [
37]; although it enhances interpretability, it often entails significant computational complexity. Functional connectivity, by contrast, has been the most widely applied in EEG contexts due to its ability to model statistical dependencies. As a recent advancement, Gaussian functional connectivity has been proposed as a more robust representation for capturing nonlinear spectral-domain relationships. This formulation was employed in the kernel cross-spectral functional connectivity network (KCS-FCNet) model [
38], which integrates kernelized functional connectivity to enhance class discrimination in motor imagery paradigms. Overall, although various forms of connectivity have been explored in conjunction with deep learning, their direct application has yet to mature to a point that effectively addresses critical challenges such as inter-subject variability and limited spatial resolution [
36].
Here, we introduce EEG-GCIRNet—a Gaussian connectivity-driven EEG imaging representation Network. This framework is designed to transform functional connectivity patterns into a robust image-based representation for MI classification using a variational autoencoder. Unlike conventional approaches, our method creates a rich representation by generating topographic maps from Gaussian functional connectivity, which model nonlinear spatial–functional dependencies across brain regions. Our EEG-GCIRNet framework comprises three key stages:
- –
Image-based encoding: Gaussian connectivity-based image representations are encoded into a shared latent space that captures complementary spatio-temporal and frequency information, enabling more discriminative and interpretable feature representations.
- –
Adaptive Multi-objective training: The model is optimized through a composite loss that jointly enforces reconstruction fidelity, classification accuracy, and latent space regularization. Crucially, this enables an adaptive learning strategy where the model automatically prioritizes representation learning (reconstruction) over classification when faced with noisy signals, thereby enhancing robustness and mitigating inter-subject variability.
- –
Physiological Interpretability: The framework’s variational design allows for deep validation of the learned features. By integrating latent space visualization with layer-wise relevance analysis (Grad-CAM++) and functional connectivity patterns, we move beyond black-box predictions to confirm that the model decisions are driven by genuine neurophysiological mechanisms, such as fronto-parietal network integration.
Experimental evaluations on benchmark MI datasets demonstrate that EEG-GCIRNet consistently outperforms state-of-the-art baselines, achieving superior classification accuracy and effectively mitigating “BCI illiteracy” by eliminating low-performance groups. Crucially, this robustness extends to complex multi-class scenarios, where the framework demonstrates statistically significant superiority over advanced temporal-channel fusion architectures. Moreover, interpretability analyses reveal distinct functional connectivity structures that align with known motor cortical regions, offering neurophysiological validation of the model’s learned representations. As such, the proposed framework advances the development of robust and interpretable EEG-based BCIs, paving the way for adaptive neurotechnologies in rehabilitation, assistive communication, and motor recovery. By combining multimodal encoding, multi-objective learning, and explainable AI, EEG-GCIRNet contributes a reproducible and scalable paradigm for addressing two of the most persistent challenges in EEG-based BCI research: inter-subject variability and limited spatial specificity.
The agenda is as follows.
Section 2 describes the materials and methods.
Section 4 presents the experiments and results.
Section 5 provides the discussion. Finally,
Section 6 outlines the conclusions, limitations and future work.
2. Materials and Methods
2.1. GIGAScience Dataset for EEG-Based Motor Imagery
The Giga Motor Imagery–DBIII (GigaScience) dataset, publicly available at
http://gigadb.org/dataset/100295 (accessed on 1 July 2025), provides one of the most comprehensive EEG corpora for MI analysis. The dataset comprises recordings from 52 healthy participants (50 with usable data), each performing a single EEG-MI session. Every session consists of five to six experimental blocks, with each block containing approximately 100–120 trials per class. Each trial spans seven seconds and follows a fixed timeline: an initial blank screen (0–2 s), a visual cue indicating either left- or right-hand MI (2–5 s), and a concluding blank interval (5–7 s). Inter-trial intervals vary randomly between 0.1 s and 0.8 s to mitigate anticipatory bias, as illustrated in
Figure 1. EEG signals were recorded at a sampling frequency of 512 Hz using a 64-channel cap arranged according to the international 10–10 electrode placement system. In addition to the MI sessions, the dataset includes recordings of real motor execution and six auxiliary non-task-related events—eye blinks, vertical and horizontal eye movements, head motions, jaw clenching, and resting state—enabling a broader exploration of EEG noise sources and artifact correction. This multimodal composition makes the GigaScience particularly valuable for benchmarking advanced deep learning and connectivity-based EEG decoding frameworks, as it supports both intra-subject and inter-subject generalization studies.
2.2. EEG Motor Movement/Imagery Database
The EEG Motor Movement/Imagery Database (EEGMMIDB) [
39], available at
https://physionet.org/content/eegmmidb/1.0.0/ (accessed on 7 December 2025), was employed as a secondary benchmark to evaluate the framework’s versatility. This dataset comprises electroencephalographic recordings from 109 healthy participants performing a variety of real and imagined motor tasks. The EEG signals were acquired using 64 scalp electrodes positioned according to the international 10–10 system and sampled at 160 Hz. The recording protocol encompasses 14 sessions per subject, including resting-state conditions, motor execution tasks, and MI tasks. Data segmentation was performed based on event annotations to define discrete trials. Each trial corresponds to a 4.1 s window, yielding a raw input matrix of
. To standardize the classification task, the original PhysioNet labels (0–11) were reorganized into five interpretable MI categories: (1) Right Hand (labels 0–1); (2) Left Hand (labels 2–3); (3) Both Hands (labels 4–5); (4) Both Feet (labels 6–7); and (5) Rest (labels 8–9). Annotations 10 and 11, corresponding to non-motor tasks, were discarded. In alignment with the experimental design of the GigaScience dataset, we focused exclusively on the MI-related sessions, excluding motor execution trials. The experimental configuration is summarized in
Figure 2.
Each MI trial is structured for the proposed framework. Let denote a multichannel EEG and its associated MI target label, where represents the EEG recording with C spatial channels and temporal samples, and is a one-hot encoded vector indicating the MI class among Q possible categories.
2.3. Laplacian Filtering and Time Segmentation
To enhance the spatial resolution and mitigate the volume conduction effects inherent in EEG recordings, a Surface Laplacian filter is applied to each trial
. This filter acts as a spatial high-pass filter by estimating the second spatial derivative of the scalp potential at each electrode
with respect to its neighbors
, where
. Following the methodology in [
40], this is achieved by using spherical splines to project the electrode positions onto a unit sphere, which allows for the interpolation of scalp potentials via Legendre polynomials. The interaction between any pair of electrodes
is modeled as:
where
is the Legendre Polynomial of order
n,
is the highest polynomial order considered,
is a smoothness constant, and
are the 3D electrode positions normalized to a unit-radius sphere. The cosine distance is defined as
.
The Laplacian-filtered EEG data, denoted as
, is subsequently computed using the weighting matrices derived from the spline interpolation:
where
is a column vector of ones,
is the identity matrix, and
is a regularization parameter. The matrix
is a regularized (smoothed) version of
. The weighting matrices
hold the elements derived from Equation (
1), with their specific values determined by the parameter
as follows:
where
and
are the elements of matrices
and
, respectively. This filtering step produces a spatially enhanced representation
that serves as input for the subsequent feature extraction stages.
Further, to focus the analysis exclusively on the period of active motor imagery, the Laplacian-filtered signal
is temporally segmented. The time window corresponding to the MI task, specifically between
and
seconds of each trial, is retained. Let
and
be the start and end times of the MI segment, and
be the sampling frequency, the segmented signal
is obtained as:
where the slicing notation
indicates the selection of temporal samples from the start index to the end index. For brevity, this segmented signal will be denoted as
in the subsequent sections.
2.4. Kernel-Based Cross-Spectral Gaussian Connectivity for EEG Imaging
To model the mutual dependency between EEG channels, we consider any two channels
from a given trial
(where
). Their mutual dependency can be captured using a stationary kernel
, which maps both signals into a reproducing kernel Hilbert space (RKHS) via a nonlinear feature map
[
41]. Indeed, according to Bochner’s theorem, a sufficient condition for the kernel
to be stationary is that it admits a spectral representation [
42]:
where
is a frequency vector, and
is the cross-spectral density between
and
, derived from the spectral distribution
.
Building on this spectral representation, the cross-spectral power within a specific frequency band
can be computed via the Fourier transform of the kernel:
where
denotes the Fourier transform. This spectral formulation allows capturing both linear and nonlinear dependencies in the frequency domain, making it particularly useful for analyzing brain signals.
A widely used choice for
is the Gaussian kernel, which ensures smoothness, locality, and analytic tractability [
43]:
where
is a bandwidth hyper-parameter.
Inspired by the kernel-based spectral approaches introduced in [
38,
44], we compute a Gaussian kernel cross-spectral connectivity estimator to encode spatio–frequency interactions among pairwise EEG channels. Specifically, for each EEG channel
c, a band-limited spectral reconstruction is obtained as:
where
denotes a given frequency bandwidth (rhythm). Then, the Gaussian Function Connectivity (GFC) matrix
is derived to quantify the degree of similarity between the spectral representations of all channel pairs, as:
where
is a Gaussian kernel with scale parameter
. To ensure adaptive sensitivity across rhythms,
is estimated as the median of all pairwise Euclidean distances between spectral reconstructions
and
,
. This formulation provides a data-driven normalization of connectivity strength, enabling robust comparison across heterogeneous EEG rhythms and subjects.
Afterward, we propose to compute an EEG connectivity flow from
, preserving a direct one-to-one correspondence with the electrode spatial configuration. Specifically, the GFC flow vector
holds elements:
where each element
represents the mean functional coupling of channel
c with all other channels within the given frequency band
. The latter compresses the pairwise connectivity information into a compact, channel-wise flow representation while retaining the spatial and spectral data patterns.
To ensure a consistent feature scale for the imaging stage, the GFC flow vectors are normalized across the entire training dataset. Specifically, a channel-wise Min-Max normalization is applied, scaling the connectivity values of each channel to a uniform range of . This procedure preserves the relative topography of neural connectivity while standardizing the input scale, yielding the normalized flow vector for each trial.
2.5. Topographic Map Generation
The final feature engineering step transforms the one-dimensional GFC flow vectors into two-dimensional topographic images, creating a data representation suitable for convolutional neural network (CNN) architectures. For each trial, this process converts the normalized flow vector from each frequency band into a corresponding topographic image .
This transformation is accomplished via spatial interpolation guided by Delaunay triangulation. First, the set of 2D scalp coordinates of the electrodes, , is triangulated. This partitions the electrode layout into a mesh of non-overlapping triangles, where the circumcircle of each triangle contains no other electrode points. This triangulation provides a structured grid for interpolating the connectivity values, where each element is associated with its corresponding coordinate .
To generate the final image, the value
for each pixel is computed using barycentric interpolation within its enclosing triangle
:
where
is the connectivity value at vertex
, and
are the barycentric coordinates of
satisfying
and
. This procedure is applied across all pixels to render the smooth topographic map
. The resulting set of four maps (one for each frequency band) is then stacked to form a multi-channel image, which serves as the final input to the deep learning model.
2.6. EEG-GCIRNet: Multimodal Architecture
The proposed model is a variational autoencoder (VAE) designed to process topographic maps derived from functional connectivity representations. Its architecture relies on a single input stream that learns to extract and encode the most relevant spatial features from the maps, which are then projected into a shared latent space where a multivariate Gaussian distribution is modeled. From this latent space, the model simultaneously performs reconstruction of the topographic maps and classification of motor imagery tasks. These objectives are integrated within a composite loss function that balances reconstruction fidelity, classification accuracy, and latent space regularization. This approach enables the learning of robust and interpretable latent representations capable of capturing discriminative spatial relationships and adapting to the variability and noise inherent in EEG signals.
The core of our framework is a VAE based on the LeNet-5 architecture, which is partitioned into three functional blocks: an encoder, a decoder, and a classifier, all operating on the shared latent space. Let be the multi-channel input image for a given trial, formed by stacking the B topographic maps (one for each frequency band).
The encoder, defined as a function
parameterized by
, maps the input image
Y to the parameters of the posterior distribution
. This transformation is realized through a composition of functions, where each function represents a layer in the network:
Here,
and
are convolutional layers with SELU activation,
and
are average pooling layers, and
is a fully connected layer with SELU activation after flattening the feature maps. The resulting hidden representation
is then linearly transformed to produce the mean vector
and the log-variance vector
of the latent space:
The decoder, defined as a function
parameterized by
, reconstructs the original input image
from a latent vector
. Its architecture mirrors the encoder by composing functions that progressively up-sample the representation to the original image dimensions:
Here, is a fully connected layer followed by a reshape operation, is a transposed convolutional layer with SELU activation, and is a final transposed convolutional layer with a Sigmoid activation to ensure the output pixel values are in a normalized range.
Concurrently, the classifier, a function
parameterized by
, predicts the MI task label probabilities
from the same latent vector
z. It is implemented as a multi-layer perceptron:
Formally, the latent vector
for a given input sample
i is computed using the reparameterization trick:
where
and
denote the mean and variance of the approximate posterior distribution learned by the encoder for sample
i, and
is drawn from a standard normal distribution. The exponential term ensures the sampled standard deviation remains strictly positive.
The total objective function,
, is defined as a weighted sum of the three loss terms:
where
where
,
, and
are hyperparameters controlling the contribution of each term. The first component,
, is the normalized mean squared error (NMSE), which evaluates reconstruction accuracy by comparing the original topographic maps (
) and their reconstructions (
), normalized by the dataset’s variance (
is the mean image). The Frobenius norm,
, is used for the image-wise error. The second term,
, represents the normalized binary cross-entropy (NBCE). It penalizes misclassifications between the true one-hot labels (
) and predicted probabilities (
), while adjusting the loss based on the entropy over an ideal, non-informative prediction (e.g., a uniform distribution
). This maintains a balanced contribution from all classes. Finally, the third term,
, is the normalized Kullback–Leibler (KL) divergence between the approximate posterior
and a unit Gaussian prior
. This regularizes the latent space, promoting smoothness and disentanglement in the learned representations.
This KL divergence term encourages the latent representations to follow a standard normal distribution, promoting structure and generalization. The use of in the denominator prevents this term from dominating the loss in large batches. Taken together, these components ensure that the model jointly optimizes for faithful reconstruction, discriminative performance, and a well-regularized latent structure—crucial for interpretable and generalizable multimodal brain–computer interfaces.
The model is trained by solving the following optimization problem:
where
denotes the complete set of trainable parameters in the encoder, decoder, and classifier, respectively.
5. Discussion
The results obtained with the proposed EEG-GCIRNet architecture provide valuable insights into how model design and latent regularization can jointly address key challenges in motor imagery decoding. This study demonstrated that by transforming functional connectivity into an image-based representation and processing it with a variational autoencoder, it is possible to create a BCI framework that is not only highly accurate but also robust and interpretable. The model achieved remarkable performance across subjects, reaching the highest average accuracy () and the lowest inter-subject variability among all evaluated methods, confirming the efficacy of this unimodal, VAE-based approach.
A key contribution of this work lies in its direct response to the critical challenges of inter-subject variability and “BCI illiteracy”. The most compelling finding was the complete elimination of the “Bad” performance group (
Figure 6c), coupled with substantial accuracy gains of
and
for the “Bad” and “Mid” groups, respectively (
Table 4). This demonstrates the profound robustness of EEG-GCIRNet in handling noisy or low-separability EEG signals, where conventional architectures typically fail. By elevating the performance of these challenging subjects, the framework serves a corrective function, suggesting a promising path toward more inclusive and reliable BCI systems that can adapt to a wider range of users. Furthermore, the scalability of the proposed framework is evidenced by its superior performance in the complex 5-class EEGMMIDB scenario, where it achieved an average accuracy of
%, surpassing advanced temporal-channel fusion models by approximately
%. Unlike binary tasks, where hemispheric lateralization is often sufficient for discrimination, the 5-class problem requires the model to resolve finer topological differences between spatially overlapping classes, such as “Both Hands” versus single-hand imagery.
The mechanism underlying this robustness appears to be the model’s sophisticated, adaptive learning strategy. The analysis of the loss weight distributions (
Figure 9) revealed that EEG-GCIRNet is not a static “black box” but dynamically reorganizes its optimization priorities based on signal quality. In the bi-class scenario, for subjects with high-quality signals, it maintains a harmonious balance between reconstruction, classification, and regularization. For subjects with more challenging signals, it strategically prioritizes representation learning (REC loss) over immediate classification (CLA loss). This intelligent trade-off is visually confirmed by the qualitative analysis of the reconstructions (
Figure 10 and
Figure 11), which shows that the model consistently generates spatially coherent and physiologically plausible connectivity maps. This behavior supports the notion that a well-regularized and accurately reconstructed representation is a prerequisite for effective classification, especially in noisy conditions. Moreover, as observed in
Table 7, for the most challenging subjects (e.g., Subject 49) in multi-class scenario, the network implicitly shifts its role from a classifier to a non-linear denoiser (
), prioritizing the reconstruction of a clean latent representation before attempting to separate complex decision boundaries. This suggests that the connectivity-driven topographic maps provide a sufficiently rich feature space to support multi-class decoding, provided that the training objective is dynamically regularized to handle the increased cognitive load and signal ambiguity.
The above is further corroborated by the functional connectivity analysis, which reveals distinct network topologies associated with performance levels (see
Figure 12). For high-performing subjects, the observed integrated fronto–centro–parietal network supports established findings on the critical role of fronto–parietal and parieto–occipital connectivity in supporting MI performance [
63], while the specific beta-band synchronization over central and parietal regions aligns with literature identifying these features as informative markers for task discrimination [
64]. Conversely, the connectivity profile of mid-performing subjects suggests a redistribution toward fronto–central and centro–posterior networks; notably, the persistence of
/
-band connections over sensorimotor and parietal areas in these subjects remains consistent with studies highlighting the fundamental role of these networks in effective BCI control [
65,
66].
The ultimate outcome of this adaptive process is the creation of a well-structured and discriminative latent space. The t-SNE visualizations (
Figure 13) provide direct evidence that the encoder successfully learns to disentangle the features of different MI classes, creating clearly separated clusters for high-performing subjects and maintaining reasonable separation even for mid-performing subjects. This demonstrates that the variational formulation and structured latent regularization are the primary drivers of the model’s superior performance, allowing it to move beyond the limitations of conventional architectures that often struggle with noisy or overlapping feature distributions.
To verify that these learned latent structures correspond to genuine neural mechanisms rather than artifacts, we examined the spatial contribution patterns identified by the network. The layer-wise analysis of the relevance maps reveals consistent spatial structures that align with anatomically meaningful channel groups (see
Figure 14 and
Figure 15). High-contribution areas in the superior frontal region likely reflect higher-order cognitive control and attentional processes facilitating MI, consistent with reports of prefrontal involvement linked to imagery quality [
67,
68,
69]. Meanwhile, prominent bands of importance spanning fronto-central and parietal sites correspond to the well-known fronto-parietal MI network—encompassing premotor and parietal regions recruited during kinesthetic and visual MI [
70] and align with established evidence of alpha and beta modulation and information flow across these regions [
71]. Thus, the specific emphasis on parieto-occipital electrodes in the deepest layers suggests the network leverages visuo-spatial components alongside sensorimotor rhythms, a view supported by studies indicating occipital recruitment and connectivity during visual-motor imagery [
72]. Regarding the multi-class scenario, the extended Grad-CAM++ analysis confirms that EEG-GCIRNet addresses this complexity through hierarchical spatial refinement, evolving from generic spectral features in shallow layers to highly specialized, class-specific electrode clusters in deep layers. This capability is underpinned by the model’s adaptive optimization strategy, which becomes even more pronounced in this high-complexity setting. Overall, these results confirm that the proposed architecture converges towards neurophysiologically plausible channel patterns, emphasizing networks repeatedly implicated in motor simulation.
Finally, the statistical analysis confirms the performance advantage of EEG-GCIRNet across different levels of task complexity. While the model secured a leading average ranking of
(
) in the binary scenario (
Table 3), its dominance became even more pronounced in the multi-class evaluation (
Table 6). In this challenging 5-class setting, EEG-GCIRNet achieved a perfect average ranking of
and a highly significant average Wilcoxon
p-value of
, establishing a statistically clear superiority over advanced architectures like TCFusionNet and DeepConvNet. This indicates that the observed performance gains are not random fluctuations but a consistent outcome of the model’s design. By learning robust latent structures directly from connectivity-based topographic representations, EEG-GCIRNet emerges as a reliable, interpretable, and computationally efficient framework for motor imagery decoding. It effectively balances accuracy, generalization, and representational stability, allowing it to adapt to diverse subject profiles and maintain strong performance across varying signal quality conditions while preserving a physiologically consistent representational organization.
6. Concluding Remarks
In this work, we introduced EEG-GCIRNet, a novel unimodal framework based on a variational autoencoder designed to process topographic representations of functional connectivity for motor imagery classification. Through a comprehensive analysis, we demonstrated that this approach effectively addresses the persistent challenges of low spatial resolution, susceptibility to noise, and high inter-subject variability in EEG-based BCIs. By integrating Gaussian functional connectivity with a generative VAE architecture, our method creates a rich, spatially structured representation that allows for simultaneous classification and reconstruction. The findings confirm that our proposed method not only sets a new benchmark for performance but also provides a robust and transparent solution grounded in neurophysiology.
The primary contribution of this study is the demonstration of superior classification performance and the effective mitigation of “BCI illiteracy”. EEG-GCIRNet achieved the highest average accuracy () and, critically, the lowest inter-subject variability () among all evaluated state-of-the-art models in the binary GigaScience benchmark. Perhaps the most significant finding is the model’s corrective capability: it completely eliminated the “Bad” performance group (subjects with <60% accuracy) and provided substantial gains of ∼22% for these challenging users. This result demonstrates notable statistical significance () and suggests that the framework’s variational nature allows it to extract discriminative information even from noisy or low-separability signals. Furthermore, the framework exhibited remarkable scalability in the complex 5-class EEGMMIDB scenario, achieving a perfect average ranking of and a highly significant Wilcoxon p-value of against advanced baselines. This confirms that EEG-GCIRNet maintains its robustness even when resolving finer topological differences between spatially overlapping classes.
This robust performance is driven by the model’s sophisticated, adaptive learning strategy. The analysis of loss component weights revealed that EEG-GCIRNet is not a static “black box” but dynamically reorganizes its optimization priorities based on the quality of the input signal. For high-performing subjects, the model maintains a harmonious balance between reconstruction, classification, and regularization. However, for subjects with weaker signals (the “Mid” group), or in high-complexity multi-class tasks, the model automatically shifts its focus, prioritizing representation learning (reconstruction loss) over immediate classification accuracy. This intelligent trade-off allows the model to act as a non-linear denoiser, ensuring a stable feature encoding before attempting classification. The ultimate outcome of this process is a well-structured latent space where MI classes are clearly disentangled, as evidenced by the distinct clustering observed in the t-SNE projections.
Furthermore, our extended interpretability analysis provides direct physiological validation of the learned representations, confirming that the model’s decisions are based on genuine neural mechanisms. The layer-wise Grad-CAM++ visualization demonstrated a hierarchical learning process: the model progresses from capturing generic spectral features in shallow layers to extracting highly localized, discriminative patterns in deep layers, specifically targeting sensorimotor and parieto-occipital regions. Crucially, the functional connectivity analysis confirmed that these patterns are biologically meaningful. The model correctly identifies the integrated fronto–centro–parietal networks, characterized by long-range synchronization, which are typical of high-performing subjects. Conversely, for mid-performing subjects, it detects distinct compensatory mechanisms, characterized by localized, midline-focused connectivity patterns. This confirms that EEG-GCIRNet leverages genuine neurophysiological mechanisms—such as / rhythm modulation and network integration—rather than relying on artifacts or spurious correlations.
Despite the promising results, this study has several limitations that must be acknowledged. First, while the framework was validated on two distinct datasets (GigaScience and EEGMMIDB), the evaluation was restricted to motor imagery paradigms. This constrains the generalizability of our findings to other BCI paradigms (e.g., P300, SSVEP) with different temporal characteristics. Second, although the model demonstrated adaptive behavior, the hyperparameters weighting the loss components were static. This fixed weighting may not be optimal for all subjects, and no strategies for dynamic, performance-based adaptation were explored. Third, our framework’s reliance on topographic map generation means that it explicitly requires electrode position coordinates as an input. This requirement for spatial metadata is not shared by some benchmark models, such as those that process the EEG signal as a simple (channels × time) matrix. While this could be considered a limitation in a hypothetical scenario where such standard information is unavailable, we frame it as a deliberate methodological choice. Finally, our interpretability analysis, while insightful, remains indirect. The analysis of reconstructions and latent space provides a high-level understanding but does not offer the granular, feature-level attribution that techniques like attention-based visualization can provide.
The findings and limitations of this study open several avenues for future research. To address generalizability, the next logical step is to evaluate the EEG-GCIRNet framework on larger-scale benchmarks such as the BCI Competition IV-2a. This will allow us to assess the model’s performance and the clarity of its VAE-based visualizations when handling more complex decision boundaries on widely standardized data. Furthermore, we propose exploring extensions of the model that incorporate dynamic weighting mechanisms for the loss components, which could allow for even greater subject-specific adaptation. A further step will be to statistically validate the model’s internal dynamics; while our analysis of the loss component weights reveals a clear reorganization of priorities, future work will include formal statistical testing to quantify the significance of these shifts between performance groups. We also plan to investigate hybrid architectures, particularly those based on Transformers, to enhance the fusion of raw temporal EEG signals with our connectivity-derived topographic maps, potentially creating a more powerful, end-to-end model. Finally, integrating complementary interpretability techniques, such as feature attribution or attention-based visualization methods, will be crucial for providing a more precise understanding of the model’s decision-making process at the clinical level.