Decoding Cognitive States via Riemannian Geometry-Informed Channel Clustering for EEG Transformers

Feng, Luoyi; Yan, Gangxing

doi:10.3390/math14081327

Open AccessArticle

Decoding Cognitive States via Riemannian Geometry-Informed Channel Clustering for EEG Transformers

by

Luoyi Feng

and

Gangxing Yan

^*

Faculty of Data Science, City University of Macau, Macau 999078, China

^*

Author to whom correspondence should be addressed.

Mathematics 2026, 14(8), 1327; https://doi.org/10.3390/math14081327

Submission received: 14 March 2026 / Revised: 14 April 2026 / Accepted: 14 April 2026 / Published: 15 April 2026

(This article belongs to the Special Issue Advanced Applications of Deep Learning Methods: Interdisciplinary Perspectives)

Download

Browse Figures

Versions Notes

Abstract

Electroencephalography (EEG) provides a non-invasive and high-temporal-resolution modality for decoding cognitive states, but high-density recordings remain challenging for Transformer-based models because self-attention scales quadratically with the number of channels. In addition, conventional Euclidean representations do not fully capture the intrinsic geometry of EEG covariance features, which may limit robustness in cross-subject settings. To address these issues, we propose EEG-RCformer, a Riemannian geometry-informed channel clustering Transformer for EEG decoding. The model first computes per-channel symmetric positive definite (SPD) covariance matrices from windowed EEG features and uses the affine-invariant Riemannian metric (AIRM) to identify trial-specific functional hubs. These hubs are then integrated with capacity-constrained spatial clustering to generate anatomically plausible and computationally efficient channel groups, which are encoded as tokens for a Transformer classifier. We evaluated EEG-RCformer on the MODMA and SEED datasets under both subject-dependent and -independent paradigms, achieving area under the curve (AUC) values of 0.9802 and 0.7154 on MODMA and 0.8541 and 0.8011 on SEED, respectively. Paired statistical tests further showed significant gains for MODMA in both the subject-dependent and -independent settings and for SEED in the subject-dependent setting, while SEED still showed a positive but non-significant mean improvement in the subject-independent setting.

Keywords:

electroencephalography; EEG decoding; Riemannian geometry; affine-invariant Riemannian metric; Transformer; channel clustering

MSC:

68T10

1. Introduction

Electroencephalography (EEG)-based brain–computer interfaces (BCIs) have become an important paradigm for cognitive state decoding, affective computing, and neurological assessment because EEG is non-invasive, relatively inexpensive, and offers millisecond-level temporal resolution [1,2]. Despite these advantages, reliable EEG decoding remains difficult in realistic settings due to the low signal-to-noise ratio of EEG, its non-stationary temporal dynamics, and the large inter-subject variability caused by anatomical, physiological, and behavioral differences across individuals [3,4,5]. These factors are particularly detrimental in subject-independent evaluation, where distribution shifts between training and test subjects often lead to marked performance degradation and limit the deployment of calibration-free BCI systems [6,7].

In the deep learning era, convolutional neural networks (CNNs), recurrent neural networks (RNNs), graph-based models, and Transformers have all been applied to EEG decoding. CNN-based methods such as EEGNet are effective at learning local spatial–temporal patterns but are inherently limited by local receptive fields [8,9]. RNN-based approaches can model temporal dependencies, yet their sequential processing constrains parallelization and often weakens long-range interaction modeling [10,11]. More recently, Transformers have emerged as competitive backbones because self-attention can capture global dependencies across channels and temporal segments [12,13]. However, for high-density EEG montages, directly treating all channels as tokens leads to substantial computational cost because the complexity of self-attention grows quadratically with the sequence length [14,15].

A second and less discussed limitation is geometric mismatch. Many existing EEG decoders compare channel features in Euclidean space after vectorization, even though covariance-based EEG representations are symmetric positive definite (SPD) matrices that naturally lie on a Riemannian manifold rather than in a flat vector space [16,17,18]. Ignoring this structure may distort pairwise relations between channels and weaken robustness, especially under cross-subject variability. Riemannian geometry provides a principled way to compare SPD matrices through geodesic distances such as the affine-invariant Riemannian metric (AIRM), which has shown strong effectiveness in BCI-related classification and transfer settings [19,20,21].

These characteristics of EEG motivate the use of a geometry-aware region-level Transformer. Emotion- and cognition-related EEG activity is usually distributed across multiple frequency bands and cortical regions rather than being confined to a single electrode, while many neighboring channels are partially redundant. Therefore, an effective model should jointly address channel redundancy, preserve spatial plausibility, and capture non-Euclidean covariance structure. This motivates us to combine channel clustering with the affine-invariant Riemannian metric in a Transformer framework.

Another practical consideration is that EEG channels are not independent sensors in the same sense as image pixels are. Nearby electrodes often measure partially overlapping cortical activity because of volume conduction and limited spatial resolution, whereas distant channels can still become functionally coupled during task execution. Consequently, using all channels as equally weighted and mutually independent Transformer tokens may be suboptimal from both statistical and neurophysiological perspectives. Region-level aggregation provides a useful compromise: it compresses redundant channels, stabilizes spatial organization across subjects, and exposes higher-level interaction patterns to self-attention. However, such aggregation should not be purely anatomical, because the most informative channel relations can vary across trials and mental states. This is precisely why we introduce a hybrid strategy that balances spatially stable clustering with trial-adaptive, geometry-aware centroid proposals.

To address the above challenges, we propose EEG-RCformer, a Riemannian geometry-informed channel clustering Transformer for EEG decoding. The model first extracts multi-band spatiotemporal features from sliding EEG windows and constructs channel-wise covariance descriptors. AIRM is then used to identify functionally diverse centroid proposals, which are integrated with stable spatial centers through a hybrid clustering procedure. The resulting cluster tokens are finally processed by a lightweight Transformer encoder for classification. Compared with directly operating on all channels, this design reduces token redundancy while preserving spatial plausibility and trial-adaptive functional information.

The contributions of this work are threefold. First, we introduce a hybrid channel clustering mechanism that combines stable spatial partitioning with AIRM-based functional centroid proposals. Second, we show that geometry-aware token construction improves EEG decoding, especially under subject-independent evaluation. Third, we provide ablation and sensitivity analyses to examine the effect of the clustering metric and the number of clusters.

From a methodological perspective, the present work also aims to bridge two lines of EEG research that are often studied separately. One line emphasizes carefully designed statistical descriptors such as covariance structure, PSD, and differential entropy; the other emphasizes representation learning with attention-based neural architectures. Rather than replacing one with the other, EEG-RCformer combines them. Handcrafted multi-band descriptors provide stable low-level cues for short-window EEG analysis, while the Transformer focuses on modeling interactions between learned regional tokens. This combination is especially suitable for relatively small EEG datasets, where purely end-to-end models may overfit and where prior geometric knowledge can substantially improve sample efficiency.

2. Materials and Methods

Figure 1 illustrates the architecture of the proposed EEG Riemannian geometry-informed channel clustering Transformer (EEG-RCformer). As shown in the top row, multi-channel EEG signals are first segmented and processed to extract PSD and DE features. These feature vectors are used to compute channel-wise covariance matrices, which are then fed into the channel cluster module. Finally, the pooled tokens are processed by a Transformer and MLP classifier to obtain the classification results. The bottom row details the channel cluster module: initial clustering is based purely on spatial locations, while functional centers are computed via farthest-point sampling on the feature manifold. These two sets of centers are combined via a smoothing coefficient

α

to produce final centers. Channels are re-clustered based on these updated centers, and the pooled regions serve as tokens for the Transformer.

2.1. Spatiotemporal Feature Extraction

The first stage of the proposed method extracts handcrafted spatiotemporal descriptors from raw EEG signals. Following implementation, each trial is sampled at 250 Hz and segmented by a sliding window of 250 samples with a stride of 50 samples, corresponding to a 1.0 s window, a 0.2 s hop size, and an 80% overlap. The use of overlapping windows increases the temporal sampling density of short-lived affective or cognitive dynamics and reduces information loss at window boundaries; without overlap, the temporal sequence becomes coarser and substantially fewer segments are available for the same trial duration. This setting also improves the continuity of the token sequence seen by the Transformer, because adjacent windows share sufficient context while still capturing local temporal change. In other words, the overlap is not introduced merely for data augmentation; it also stabilizes the downstream covariance estimation and makes the extracted spectral descriptors less sensitive to abrupt segmentation boundaries.

For each channel and window, a Hann window is first applied and the real-valued fast Fourier transform (rFFT) is then computed. Let

x_{c, w} \in R^{L}

denote the w-th window from channel c, where

L = 250

. The power spectrum is computed as

S_{c, w} (f) = \frac{| F (x_{c, w} ⊙ h) (f) |^{2}}{L},

(1)

where h denotes the Hann window. The band-wise PSD feature for band b is obtained by averaging spectral power over the corresponding frequency-bin set

I_{b}

:

{PSD}_{c, w}^{(b)} = \frac{1}{| I_{b} |} \sum_{f \in I_{b}} S_{c, w} (f) .

(2)

We consider seven frequency bands, namely delta (1–4 Hz), theta (4–8 Hz), alpha (8–12 Hz), beta₁ (12–16 Hz), beta₂ (16–20 Hz), gamma₁ (20–30 Hz), and gamma₂ (30–45 Hz). Differential entropy (DE) is then computed from the band power as

{DE}_{c, w}^{(b)} = \frac{1}{2} log (2 π e {PSD}_{c, w}^{(b)} + ϵ),

(3)

where

ϵ = 10^{- 9}

is used for numerical stability. Finally, PSD and DE are independently normalized by z-score over the window and band dimensions of each channel and concatenated to form the feature tensor

F \in R^{C \times N_{w} \times D_{f}}

for each EEG trial, where C denotes the number of channels,

N_{w}

the number of temporal windows, and

D_{f}

the feature dimension. In the default PSD+DE setting,

D_{f} = 14

.

2.2. Hybrid Clustering with AIRM-Informed Centroid Proposals

Motivation. Existing geometry-aware channel grouping methods often overlook trial-level dynamics, whereas purely feature-driven clustering may lack spatial stability across trials and subjects. To address these limitations, we integrate a stable spatial prior with trial-adaptive functional information.

Let

Σ_{c} \in S_{+ +}^{d}

denote the symmetric positive definite (SPD) covariance matrix for channel c, computed from the windowed features. Channel similarity is quantified using the affine-invariant Riemannian metric (AIRM):

δ_{R} (Σ_{i}, Σ_{j}) = {∥log (Σ_{i}^{- \frac{1}{2}} Σ_{j} Σ_{i}^{- \frac{1}{2}})∥}_{F} .

(4)

The proposed hybrid strategy maintains K stable spatial centers that are updated by an exponential moving average (EMA) using batch-level centroid proposals derived from the current input features. Importantly, in the implementation, the cluster capacity is not an independently tuned hyperparameter. Instead, the C channels are evenly distributed across K clusters, so that each cluster size is either

⌊ C / K ⌋

or

⌈ C / K ⌉

. Accordingly, the previously used symbol

S_{max}

should be understood as a derived balancing constraint rather than an additional free parameter.

Methodology. For each mini-batch, we extract per-channel windowed features and use known three-dimensional sensor coordinates

p_{c} \in R^{3}

. The method maintains K stable spatial centers

{c_{k}}_{k = 1}^{K}

through the following steps.

1. Stable center initialization. We initialize K stable spatial centers by farthest-point sampling in the electrode coordinate space so that the initial partition covers the scalp as evenly as possible.

2. Batch-specific centroid proposals. For each sample in the mini-batch, temporary centroids

{c_{k}^{'}}_{k = 1}^{K}

are identified to capture transient functionally correlated activations. This is achieved by computing inter-channel distances using either the AIRM on the SPD covariance matrices of windowed features or, for comparison, Euclidean distances on flattened features, followed by farthest-point sampling.

3. Geometry-aware matching and update. The temporary assignments from all samples in the mini-batch are aggregated to obtain batch-averaged centroid proposals, which are then matched to the closest stable centers in the electrode coordinate space. The stable centers are updated by

c_{k} \leftarrow (1 - α) c_{k} + α {\bar{c}}_{k}^{'},

(5)

where

{\bar{c}}_{k}^{'}

denotes the matched batch-averaged proposal and

α

is an EMA smoothing coefficient.

4. Balanced spatial assignment. All channels are assigned to the updated centers according to spatial proximity while enforcing balanced capacities

| C_{k} | \in \{⌊\frac{C}{K}⌋, ⌈\frac{C}{K}⌉\}, \sum_{k = 1}^{K} | C_{k} | = C,

(6)

so that the resulting partition remains spatially contiguous and numerically balanced. The obtained channel groups define the tokens for the subsequent Transformer encoder. The overall hybrid clustering procedure is summarized in Algorithm 1.

Algorithm 1 EEG-RCformer Hybrid Clustering

Require:: Full feature tensor $F \in R^{B \times C \times N_{w} \times D_{f}}$ , electrode 3D positions $P \in R^{C \times 3}$ , number of clusters K, smoothing coefficient $α$ .
Ensure:: Balanced channel assignments $A$ for token construction.
1:: Initialize stable centers $C = {c_{k}}_{k = 1}^{K}$ by farthest-point sampling on $P$ .
2:: for each mini-batch do
3:: Compute channel-wise SPD covariance matrices from the windowed features.
4:: For each sample, identify temporary functional centroids $C^{'} = {c_{k}^{'}}_{k = 1}^{K}$ using farthest-point sampling with AIRM or Euclidean distance.
5:: Aggregate temporary assignments across the mini-batch to obtain batch-averaged centroid proposals ${\bar{C}}^{'}$ .
6:: for $k = 1$ to K do
7:: Match the nearest proposal in ${\bar{C}}^{'}$ to stable center $c_{k}$ .
8:: Update the stable center by EMA: $c_{k} \leftarrow (1 - α) c_{k} + α {\bar{c}}_{k}^{'}$ .
9:: end for
10:: Set target cluster sizes ${s_{k}}_{k = 1}^{K}$ such that $s_{k} \in {⌊ C / K ⌋, ⌈ C / K ⌉}$ and $\sum_{k = 1}^{K} s_{k} = C$ .
11:: Compute distances between all electrodes and the updated centers in 3D coordinate space.
12:: for each channel i do
13:: Assign i to the nearest cluster whose current size is smaller than its target size.
14:: end for
15:: end for
16:: return assignments $A$ .

2.3. Cluster-Based Token Generation

After clustering, we construct the input tokens for the Transformer. The feature tensor

F

is flattened and projected into a

d_{model}

-dimensional embedding space using a linear layer:

P = ReLU (LayerNorm (Flatten (F) W_{p} + b_{p})),

(7)

where

W_{p} \in R^{(N_{w} D_{f}) \times d_{model}}

and

b_{p} \in R^{d_{model}}

are learnable parameters, yielding channel embeddings

P \in R^{C \times d_{model}}

.

We then aggregate the embeddings within each cluster by average pooling:

T_{k} = \frac{1}{| C_{k} |} \sum_{i \in C_{k}} P_{i},

(8)

where

T_{k} \in R^{d_{model}}

denotes the token of the k-th cluster,

P \in R^{C \times d_{model}}

denotes the channel embeddings, and

C_{k}

is the set of channel indices assigned to that cluster. This yields a token sequence

T \in R^{K \times d_{model}}

.

Average pooling inside each cluster is chosen because it provides a simple and stable summary of all channels assigned to the same region-level token. In contrast to selecting a single representative electrode, averaging preserves shared spectral trends within a region and reduces the influence of noisy or idiosyncratic channels. Since the preceding clustering stage already enforces balanced cluster sizes and spatial plausibility, the pooled token can be interpreted as a compact representation of a functionally coherent regional pattern rather than a purely anatomical average.

2.4. Cognitive State Classification with a Transformer Encoder

The final stage employs a standard Transformer encoder to model relationships among the cluster tokens. A learnable positional encoding

P_{pos} \in R^{K \times d_{model}}

is added to the token sequence:

Z^{(0)} = T + P_{pos} .

(9)

The embedded sequence is then processed by a stack of

N_{L}

identical Transformer encoder layers. Each layer consists of a multi-head self-attention (MHSA) module and a position-wise feed-forward network (FFN), together with residual connections and layer normalization (LN):

\begin{matrix} Z^{' (l)} & = LN (Z^{(l - 1)} + MHSA (Z^{(l - 1)})), \end{matrix}

(10)

\begin{matrix} Z^{(l)} & = LN (Z^{' (l)} + FFN (Z^{' (l)})) . \end{matrix}

(11)

After the final Transformer layer, global max pooling is applied across the encoded token sequence to obtain a trial-level representation:

z = max_{k = 1, \dots, K} Z_{k}^{(N_{L})},

(12)

which is then passed to the final classifier. In the default implementation, the Transformer uses

d_{model} = 128

,

N_{h} = 4

attention heads,

N_{L} = 3

encoder layers, a feed-forward dimension of 256, and dropout of 0.5. When clustering is enabled, the number of tokens equals the number of clusters K.

2.5. Datasets, Preprocessing, and Experimental Setup

The proposed model is evaluated on two public EEG datasets, MODMA [22] and SEED [23]. MODMA is used here as a binary classification task, where the two classes correspond to major depressive disorder (MDD) and normal control (NC), and it contains 128-channel resting-state EEG recordings from 53 subjects. SEED is used as a three-class emotion recognition task, where the labels are positive, neutral, and negative emotions; the dataset contains 62-channel EEG recordings collected from 15 subjects during film-clip-based emotion elicitation.

For both datasets, the EEG signals are band-pass filtered between 0.5 and 50 Hz. In the implementation, feature extraction is performed at a sampling rate of 250 Hz with a window length of 250 samples and a stride of 50 samples. Therefore, each window spans 1.0 s and adjacent windows overlap by 80%. PSD and DE features are then extracted from seven standard frequency bands as described in Section 2.1.

We considered two evaluation paradigms. In the subject-dependent setting, 5-fold cross-validation was performed within each subject; in the subject-independent setting, stratified cross-validation with five held-out subjects was used for MODMA, whereas a leave-one-subject-out (LOSO) protocol was adopted for SEED. Model performance was primarily evaluated using the area under the receiver operating characteristic curve (AUC). Unless otherwise stated, the default model configuration uses

K = 5

clusters,

d_{model} = 128

, 4 attention heads, 3 Transformer encoder layers, and a feed-forward dimension of 256. All experiments were implemented on a server equipped with two NVIDIA GeForce RTX 3090 graphics processing units (GPUs) running Ubuntu 24.04.

The two datasets are deliberately complementary. MODMA evaluates whether the model can distinguish a clinically relevant resting-state condition with substantial cross-subject heterogeneity, while SEED tests whether the same architecture can capture stimulus-induced emotional states with richer category structure. Therefore, using both datasets allows us to assess whether the proposed clustering-and-Transformer design is only beneficial for one specific application or whether it transfers across distinct EEG decoding scenarios.

3. Results

3.1. Main Results

As summarized in Table 1, EEG-RCformer achieved competitive performance on both datasets relative to the compared baselines. The observed improvements may be related to several aspects of the proposed design. Region-level tokenization reduces channel redundancy and may alleviate spatial misalignment across electrodes, which can be relevant in the subject-independent setting. In addition, the trial-adaptive centroid proposal may help preserve transient co-activation patterns during token construction. The use of a covariance-aware distance metric may also retain cross-band relations that are less explicitly modeled in standard per-channel baselines. Taken together, these results suggest that the hybrid clustering design can provide informative and compact tokens for subsequent Transformer modeling.

Table 1 shows that the numerical advantage of EEG-RCformer over the Euclidean-only variant is largest in the subject-independent setting on MODMA and in the subject-dependent setting on SEED. This pattern may reflect differences in the dominant sources of variation across the two datasets: MODMA places stronger demands on cross-subject robustness, whereas SEED may benefit more from region-level compression together with within-subject global token interaction. At the same time, these observations should be interpreted cautiously, because the evaluation is limited to two datasets and the statistical evidence is not uniform across all settings. Overall, the results suggest that the proposed framework can be useful across more than one EEG decoding scenario, but broader validation is still needed.

To examine the sensitivity to region granularity, we additionally evaluated

K = 3

and 9 under the same AIRM-informed setting and report the results in Table 1. On MODMA, the performances of

K = 3

and 5 were close, suggesting that the proposed model is relatively robust to coarse regional partitioning for this task. On SEED,

K = 5

outperformed both

K = 3

and 9, indicating that an intermediate number of clusters provides a better trade-off between preserving functional specialization and avoiding over-fragmentation. Overall, there is no universally optimal cluster number; the most suitable K depends on the task characteristics, the degree of inter-subject variability, and the acquisition setup, especially the number and spatial density of EEG channels.

Statistical Significance of AIRM Versus Euclidean Metric

To assess whether the improvement brought by the Riemannian metric over the Euclidean metric is statistically significant, we compared the fold-wise AUC values of the full model and the Euclidean-only variant under identical data splits. Because the two methods were evaluated on the same folds or held-out subjects, paired statistical testing is appropriate. We therefore used a paired t-test, and the results are summarized in Table 2.

As shown in Table 2, the AIRM-informed model consistently achieves higher mean AUC values than the Euclidean-only variant in all four evaluation settings. The paired t-test indicates statistically significant improvements for MODMA under both the subject-dependent (

p = 0.0008

) and -independent (

p = 0.0127

) settings, as well as for SEED under the subject-dependent setting (

p = 0.0426

), with large effect sizes in all three cases. By contrast, although AIRM still yields a higher mean AUC value for SEED under the subject-independent setting, the difference is not statistically significant (

p = 0.3833

). These results indicate that AIRM yields statistically significant improvements on MODMA in both evaluation paradigms and on SEED in the subject-dependent setting, whereas on SEED under the subject-independent setting, the observed mean advantage does not reach statistical significance.

3.2. Computational Complexity Analysis

Let C be the number of channels and K the number of clusters, where

K ≪ C

. The computational complexity of a Transformer encoder, measured in FLOPs, is dominated by two terms:

{FLOPs}_{enc} \approx c_{1} L d^{2} + c_{2} L^{2} d,

where L is the sequence length (number of tokens) and d is the model dimension. By reducing the token length from C to K, our clustering method significantly decreases the computational cost.

As quantified in Table 3, the choice of the Riemannian versus Euclidean metric for clustering has a negligible impact on the model’s parameter count and FLOPs. The primary computational benefit arises from the clustering itself, which drastically reduces FLOPs compared to the no-clustering baseline. This reduction is especially pronounced when processing high-density, 128-channel EEG data.

3.3. Model Visualization and Interpretation

As illustrated in Figure 2, the proposed model attributes higher weight to

α

(8–12 Hz), and

β

(13–30 Hz) bands. This observation is broadly consistent with prior EEG decoding and affective neuroscience studies showing that discriminative information is often distributed across alpha- and beta-related activity rather than concentrated in a single universal band. Instead of assuming a one-band-one-state mapping, our method preserves seven standard frequency bands and jointly models their PSD and DE representations, allowing the Transformer to learn task-specific cross-band interactions from the data.

The attention map between tokens, shown in Figure 3, reveals the presence of hub-like clusters that integrate information from multiple others. This pattern is consistent with a functional topography where certain nodes facilitate long-range integration while others specialize in more local processing. Figure 4 demonstrates the dynamic nature of the learned clusters over training.

For clarity, the post-clustering visualizations are not ordered by the original electrode index or by the rasterized 2D scalp layout. Instead, channels are grouped according to learned cluster membership. Consequently, the patterns after clustering may appear less regular visually, but this does not indicate weaker structure; rather, the display emphasizes functionally aggregated regional tokens instead of the acquisition order of electrodes.

3.4. Ablation Studies

We conducted ablation studies to deconstruct the performance contributions of the key components in our model.

3.4.1. Impact of the Clustering Module

We compared our full model against a “No Clustering” variant, where each channel’s feature vector was treated as an individual token for the Transformer. Removing the clustering module resulted in consistent performance degradation across the evaluated settings. This finding suggests that the clustering module plays an important role in organizing spatial information into a compact set of region-level tokens, which may help the Transformer model interactions among larger-scale patterns more efficiently [27].

3.4.2. Impact of the Clustering Metric: Riemannian vs. Euclidean

We evaluated our proposed AIRM against a standard Euclidean distance for guiding cluster updates. The AIRM-based variant achieved higher mean AUC values in all four evaluation settings. Statistical testing further indicates significant differences for MODMA under both the subject-dependent and -independent settings and for SEED under the subject-dependent setting, whereas the subject-independent SEED setting shows only a non-significant mean improvement. These results suggest that respecting the intrinsic non-Euclidean geometry of covariance matrices can provide a more informative signal for cluster formation in this framework, although the magnitude and statistical reliability of the benefit appear to remain dataset- and setting-dependent [21].

3.4.3. Sensitivity to the Number of Clusters

We further analyzed the sensitivity of EEG-RCformer to the number of clusters K. Figure 5 visualizes the initial spatial partitions for

K = 3

, 5, and 9, and the quantitative results are reported in Table 1. The results show that the best choice of K is task-dependent rather than universal. For MODMA,

K = 3

and 5 yielded comparable results, indicating that a relatively coarse spatial grouping may already be sufficient for resting-state MDD recognition. For SEED,

K = 5

outperformed both

K = 3

and 9, suggesting that emotion recognition benefits from an intermediate regional granularity. When K is too small, different functional subregions may be overly merged; when K is too large, the spatial partition may become fragmented and the statistical reliability of each token may decrease. Therefore, in practice, K should be selected jointly according to the target task, the number of channels, and the electrode layout rather than assumed to have a single universally optimal value.

4. Discussion

4.1. Why Geometry-Aware Clustering Benefits EEG Decoding

The observed performance differences of EEG-RCformer may be interpreted from both signal geometry and model efficiency. EEG channel interactions are often redundant, while covariance descriptors computed from multi-band windowed features are SPD matrices that do not naturally reside in Euclidean vector space. Prior Riemannian EEG studies suggest that manifold-aware distances can preserve the intrinsic structure of covariance representations more faithfully than direct vectorization can [16,18]. In our framework, this property is used in the proposed functionally informative centroids, while the final balanced assignment remains spatially constrained. This hybrid design may be beneficial for EEG because a purely functional clustering strategy can be unstable across trials, whereas a purely anatomical partition may overlook transient co-activation patterns. From this perspective, EEG-RCformer can be viewed as producing tokens that are compact, spatially interpretable, and adaptive to the current input. This may be particularly relevant in subject-independent decoding, where direct channel-wise comparison across individuals is affected by inter-subject variability and distribution shift [6,7].

Another possible advantage of the proposed design is that it separates two roles that are often intertwined in EEG modeling. One role is to determine which channels should be summarized together so that the Transformer does not devote excessive capacity to redundant or noisy fine-grained tokens; the other is to evaluate similarity between channels in a way that respects feature geometry. In EEG-RCformer, these roles are handled by coordinated but distinct mechanisms: spatially balanced assignment controls token compactness and interpretability, whereas AIRM is used to propose functionally informative centroids. This separation may partly explain why the method remains computationally efficient while showing better mean performance than the Euclidean-only alternative.

4.2. Interpretation of Frequency-Band and Regional Effects

The band-importance and clustering results require task-specific interpretation. Existing EEG emotion and cognitive-state studies do not support a single universally dominant frequency band for all tasks, datasets, or label spaces. Instead, informative patterns are often distributed across alpha-, beta-, and gamma-related activities, with their relative importance depending on the experimental paradigm and the definition of categories. Accordingly, our model does not assume a one-band-one-state mapping. Instead, it retains seven frequency bands and jointly models their PSD and DE representations, allowing the Transformer to learn task-specific cross-band interactions from the data. At the regional level, the learned clusters should also be viewed as data-driven functional summaries rather than direct physiological connectivity estimates between electrodes. They reflect how the model organizes channels to improve decoding performance under a balanced spatial prior.

A high importance assigned to a frequency band indicates its contribution to the current classification task, but it does not imply that the band is the sole neurophysiological substrate of the underlying mental state. Likewise, the learned clusters summarize how the network compresses and routes EEG information under the imposed spatial constraints, rather than constituting a direct parcellation of brain function. These visualization results provide an interpretable account of model behavior, but they do not replace dedicated neuroscience experiments.

4.3. Limitations and Future Work

This study still has several limitations. First, the current evaluation is restricted to two public datasets, namely MODMA and SEED. Although these datasets cover both clinical-state decoding and emotion recognition, they do not fully represent the diversity of EEG paradigms, recording montages, and acquisition conditions encountered in practice. Second, the current visualizations reflect learned token relations and cluster evolution, but they should not be interpreted as explicit physiological connectivity analyses. A more direct connectivity study, for example, using coherence- or graph-based measures, would provide stronger neuroscientific interpretation and will be investigated in future work. Third, the present framework relies on handcrafted PSD/DE features; an end-to-end variant that learns multiscale spectral-spatial representations directly from raw EEG may further improve generalization. Future work will therefore validate the framework on more diverse EEG tasks and datasets, strengthen visualization with explicit connectivity analysis, and explore end-to-end geometry-aware feature learning.

A further limitation is that the present experiments focus on classification performance and computational efficiency, whereas other practically relevant properties such as calibration time, robustness to missing channels, and sensitivity to electrode montage changes were not evaluated in this study. These issues are highly relevant for real-world BCI deployment and may interact with the proposed clustering strategy in meaningful ways. For example, region-level tokens may prove advantageous when some channels are noisy or unavailable, but this hypothesis requires dedicated testing. We leave these questions to future work.

5. Conclusions

We propose EEG-RCformer, a Transformer-based framework that incorporates Riemannian geometry for EEG state decoding. The proposed geometry-aware hybrid clustering module integrates stable spatial information with trial-specific functional patterns, with the aim of reducing spatial redundancy and better respecting the geometry of covariance representations. Experiments on MODMA and SEED indicate competitive performance. The statistical analysis further indicates that AIRM-informed clustering yields statistically significant improvements on MODMA in both evaluation paradigms and on SEED in the subject-dependent paradigm, whereas the subject-independent SEED comparison shows only a non-significant mean improvement. The model also offers a degree of interpretability through attention and feature-importance analyses. Nevertheless, the current evaluation is limited to two public datasets, so broader validation across additional EEG tasks and acquisition settings is still needed. Future work will replace handcrafted features with end-to-end representations and further examine manifold-based clustering strategies for broader BCI applications.

Author Contributions

Conceptualization, methodology, writing—original draft preparation, L.F.; writing—review and editing, G.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data used in this study are publicly available from the MODMA and SEED datasets cited in the manuscript. The implementation code for EEG-RCformer is available at https://github.com/floydfeng-coding/EEG-RCFormer (accessed on 1 September 2025).

Acknowledgments

The authors would like to thank the providers of the MODMA and SEED datasets for making their data publicly available.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Wolpaw, J.R.; Birbaumer, N.; McFarland, D.J.; Pfurtscheller, G.; Vaughan, T.M. Brain-computer interfaces for communication and control. Clin. Neurophysiol. 2002, 113, 767–791. [Google Scholar] [CrossRef]
Nicolas-Alonso, L.F.; Gomez-Gil, J. Brain computer interfaces, a review. Sensors 2012, 12, 1211–1279. [Google Scholar] [CrossRef]
Lotte, F.; Bougrain, L.; Cichocki, A.; Clerc, M.; Congedo, M.; Rakotomamonjy, A.; Yger, F. A review of classification algorithms for EEG-based brain–computer interfaces: A 10 year update. J. Neural Eng. 2018, 15, 031005. [Google Scholar] [CrossRef] [PubMed]
Jin, J.; Miao, Y.; Daly, I.; Zuo, C.; Hu, D.; Cichocki, A. EEG-based brain–computer interfaces (BCIs): A survey of recent studies on signal sensing technologies and computational intelligence approaches and their applications. IEEE/ACM Trans. Comput. Biol. Bioinform. 2020, 18, 1645–1666. [Google Scholar]
Urigüen, J.A.; Garcia-Zapirain, B. EEG artifact removal-state-of-the-art and guidelines. J. Neural Eng. 2015, 12, 031001. [Google Scholar] [CrossRef]
Jayaram, V.; Alamgir, M.; Altun, Y.; Schölkopf, B.; Grosse-Wentrup, M. Transfer learning in brain–computer interfaces. IEEE Comput. Intell. Mag. 2016, 11, 20–31. [Google Scholar] [CrossRef]
Hang, W.; Feng, W.; Du, R.; Liang, S.; Chen, Y.; Wang, Q.; Liu, X. Cross-Subject EEG Signal Recognition Using Deep Domain Adaptation Network. IEEE Access 2019, 7, 128273–128282. [Google Scholar] [CrossRef]
Lawhern, V.J.; Solon, A.J.; Waytowich, N.R.; Gordon, S.M.; Hung, C.P.; Lance, B.J. EEGNet: A compact convolutional neural network for EEG-based brain–computer interfaces. J. Neural Eng. 2018, 15, 056013. [Google Scholar] [CrossRef] [PubMed]
Schirrmeister, R.T.; Springenberg, J.T.; Fiederer, L.D.J.; Glasstetter, M.; Eggensperger, K.; Tangermann, M.; Hutter, F.; Burgard, W.; Ball, T. Deep learning with convolutional neural networks for EEG decoding and visualization. Hum. Brain Mapp. 2017, 38, 5391–5420. [Google Scholar] [CrossRef] [PubMed]
Wang, P.; Jiang, A.; Liu, X.; Shang, J.; Zhang, L. LSTM-based EEG classification in motor imagery tasks. IEEE Trans. Neural Syst. Rehabil. Eng. 2018, 26, 2086–2095. [Google Scholar] [CrossRef] [PubMed]
Zhang, D.; Yao, L.; Zhang, X.; Wang, S.; Chen, W.; Boots, R.; Benatallah, B. Cascade and parallel convolutional recurrent neural networks on EEG-based intention recognition for brain computer interface. In Proceedings of the AAAI Conference on Artificial Intelligence; AAAI: Washington, DC, USA, 2018; Volume 32. [Google Scholar]
Song, Y.; Zheng, Q.; Liu, B.; Gao, X. EEG conformer: Convolutional transformer for EEG decoding and visualization. IEEE Trans. Neural Syst. Rehabil. Eng. 2022, 31, 710–719. [Google Scholar] [CrossRef] [PubMed]
Pan, Y.T.; Chou, J.L.; Wei, C.S. MAtt: A Manifold Attention Network for EEG Decoding. In Proceedings of the Advances in Neural Information Processing Systems; Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., Oh, A., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2022; Volume 35, pp. 31116–31129. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems; Curran Associates Inc.: Red Hook, NY, USA, 2017; Volume 30. [Google Scholar]
Tay, Y.; Dehghani, M.; Bahri, D.; Metzler, D. Efficient transformers: A survey. ACM Comput. Surv. 2020, 55, 109. [Google Scholar] [CrossRef]
Barachant, A.; Bonnet, S.; Congedo, M.; Jutten, C. Riemannian geometry applied to BCI classification. Lect. Notes Comput. Sci. 2010, 6503, 629–636. [Google Scholar]
Barachant, A.; Bonnet, S.; Congedo, M.; Jutten, C. Classification of covariance matrices using a Riemannian-based kernel for BCI applications. Neurocomputing 2013, 112, 172–178. [Google Scholar] [CrossRef]
Congedo, M.; Barachant, A.; Bhatia, R. Riemannian geometry for EEG-based brain-computer interfaces: A primer and a review. Brain-Comput. Interfaces 2017, 4, 155–174. [Google Scholar] [CrossRef]
Pennec, X.; Fillard, P.; Ayache, N. A Riemannian framework for tensor computing. Int. J. Comput. Vis. 2006, 66, 41–66. [Google Scholar] [CrossRef]
Arsigny, V.; Fillard, P.; Pennec, X.; Ayache, N. Log-Euclidean metrics for fast and simple calculus on diffusion tensors. Magn. Reson. Med. 2006, 56, 411–421. [Google Scholar] [CrossRef]
Rodrigues, P.L.C.; Jutten, C.; Congedo, M. Riemannian procrustes analysis: Transfer learning for brain–computer interfaces. IEEE Trans. Biomed. Eng. 2018, 66, 2390–2401. [Google Scholar] [CrossRef] [PubMed]
Cai, H.; Gao, Y.; Sun, S.; Li, N.; Tian, F.; Xiao, H.; Li, J.; Yang, Z.; Li, X.; Zhao, Q.; et al. Modma dataset: A multi-modal open dataset for mental-disorder analysis. arXiv 2020, arXiv:2002.09283. [Google Scholar]
Zheng, W.L.; Lu, B.L. Investigating Critical Frequency Bands and Channels for EEG-based Emotion Recognition with Deep Neural Networks. IEEE Trans. Auton. Ment. Dev. 2015, 7, 162–175. [Google Scholar] [CrossRef]
Ingolfsson, T.M.; Hersche, M.; Wang, X.; Kobayashi, N.; Cavigelli, L.; Benini, L. EEG-TCNet: An accurate temporal convolutional network for embedded motor-imagery brain–machine interfaces. In 2020 IEEE International Conference on Systems, Man, and Cybernetics (SMC); IEEE: Piscataway, NJ, USA, 2020; pp. 2958–2965. [Google Scholar]
Ding, Y.; Tong, C.; Zhang, S.; Jiang, M.; Li, Y.; Lim, K.J.; Guan, C. EmT: A novel transformer for generalized cross-subject EEG emotion recognition. IEEE Trans. Neural Netw. Learn. Syst. 2025, 36, 10381–10393. [Google Scholar] [CrossRef] [PubMed]
Wu, H.; Hu, T.; Liu, Y.; Zhou, H.; Wang, J.; Long, M. TimesNet: Temporal 2D-Variation Modeling for General Time Series Analysis. In Proceedings of the International Conference on Learning Representations, Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
Tibermacine, I.E.; Russo, S.; Tibermacine, A.; Rabehi, A.; Nail, B.; Kadri, K.; Napoli, C. Riemannian geometry-based eeg approaches: A literature review. arXiv 2024, arXiv:2407.20250. [Google Scholar]

Figure 1. Overview of the proposed EEG-RCformer framework. (Top) Raw multi-channel EEG signals are segmented and converted into multi-band PSD/DE features to construct channel-wise SPD covariance matrices. These are passed to the channel cluster module and subsequently to the Transformer and MLP classifier. (Bottom) The channel cluster module initializes stable spatial centers and computes functional centers on the feature manifold. These are combined via

α

to form the final centers for re-clustering. The pooled clusters are then used as input tokens for the Transformer. Each colored dot represents one EEG channel, different colors indicate different cluster assignments, and different circles correspond to different clusters.

Figure 1. Overview of the proposed EEG-RCformer framework. (Top) Raw multi-channel EEG signals are segmented and converted into multi-band PSD/DE features to construct channel-wise SPD covariance matrices. These are passed to the channel cluster module and subsequently to the Transformer and MLP classifier. (Bottom) The channel cluster module initializes stable spatial centers and computes functional centers on the feature manifold. These are combined via

α

to form the final centers for re-clustering. The pooled clusters are then used as input tokens for the Transformer. Each colored dot represents one EEG channel, different colors indicate different cluster assignments, and different circles correspond to different clusters.

Figure 2. Band-wise feature importance weights extracted from the model, illustrating the relative contributions of PSD and DE features across seven frequency bands for the classification task.

Figure 3. Visualization of the inter-token attention weights among the five learned clusters, highlighting how regional tokens dynamically integrate information from other clusters during the decoding process.

Figure 4. Evolution of the channel clustering on the scalp, comparing the initial spatial partition with the final functionally adapted regions learned after the model’s training process.

Figure 5. Illustration of the initial balanced spatial partition for

K = 3

, 5, and 9. A smaller K yields coarser regional tokens, whereas a larger K produces finer-grained partitions.

Figure 5. Illustration of the initial balanced spatial partition for

K = 3

, 5, and 9. A smaller K yields coarser regional tokens, whereas a larger K produces finer-grained partitions.

Table 1. Comparison of AUC values for baseline models and ablation variants on the MODMA and SEED datasets. Bold values indicate the best result under each evaluation setting.

Model	MODMA		SEED
Model	Sub-Dependent	Sub-Independent	Sub-Dependent	Sub-Independent
EEGNet [8]	0.9091 ± 0.0039	0.6496 ± 0.0999	0.6809 ± 0.0030	0.6878 ± 0.0959
Conformer [12]	0.9347 ± 0.0051	0.6328 ± 0.1145	0.6124 ± 0.0522	0.6753 ± 0.1218
TCN [24]	0.9725 ± 0.0078	0.6952 ± 0.0724	0.7184 ± 0.0971	0.7653 ± 0.1401
EmT [25]	0.9731 ± 0.0029	0.7069 ± 0.0831	0.8378 ± 0.0022	0.8020 ± 0.1150
TimesNet [26]	0.9828 ± 0.0032	0.6671 ± 0.1581	0.6304 ± 0.1351	0.6134 ± 0.1276
Ours (AIRM-informed, $K = 5$ )	0.9802 ± 0.0037	0.7154 ± 0.0608	0.8541 ± 0.0047	0.8011 ± 0.0507
Ours (AIRM-informed, $K = 3$ )	0.9793 ± 0.0032	0.7083 ± 0.0552	0.8032 ± 0.0163	0.7772 ± 0.0822
Ours (AIRM-informed, $K = 9$ )	0.9012 ± 0.0043	0.6845 ± 0.0773	0.8413 ± 0.0184	0.7941 ± 0.0851
Ours (Euclidean-only)	0.9492 ± 0.0046	0.6249 ± 0.1050	0.8424 ± 0.0049	0.7850 ± 0.0279
Ours (No clustering)	0.8966 ± 0.0160	0.6337 ± 0.0896	0.7559 ± 0.0095	0.7567 ± 0.0545

Table 2. Statistical significance analysis between the AIRM-informed model and the Euclidean-only variant.

Dataset	Setting	AIRM AUC	Euclidean AUC	p-Value	Effect Size
MODMA	Sub-dependent	0.9802 ± 0.0037	0.9492 ± 0.0046	0.0008	4.0197
MODMA	Sub-independent	0.7154 ± 0.0608	0.6249 ± 0.1050	0.0127	1.9191
SEED	Sub-dependent	0.8541 ± 0.0047	0.8424 ± 0.0049	0.0426	1.3129
SEED	Sub-independent	0.8011 ± 0.0507	0.7850 ± 0.0279	0.3833	0.4375

Table 3. Comparison of parameter counts and FLOPs across settings.

Setting	Params	FLOPs
Ours (full, AIRM-informed)	427,136	$7.1315 \times 10^{7}$
Ours (Euclidean-only)	427,136	$7.1315 \times 10^{7}$
Ours (no clustering)	426,624	$3.8046 \times 10^{10}$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Feng, L.; Yan, G. Decoding Cognitive States via Riemannian Geometry-Informed Channel Clustering for EEG Transformers. Mathematics 2026, 14, 1327. https://doi.org/10.3390/math14081327

AMA Style

Feng L, Yan G. Decoding Cognitive States via Riemannian Geometry-Informed Channel Clustering for EEG Transformers. Mathematics. 2026; 14(8):1327. https://doi.org/10.3390/math14081327

Chicago/Turabian Style

Feng, Luoyi, and Gangxing Yan. 2026. "Decoding Cognitive States via Riemannian Geometry-Informed Channel Clustering for EEG Transformers" Mathematics 14, no. 8: 1327. https://doi.org/10.3390/math14081327

APA Style

Feng, L., & Yan, G. (2026). Decoding Cognitive States via Riemannian Geometry-Informed Channel Clustering for EEG Transformers. Mathematics, 14(8), 1327. https://doi.org/10.3390/math14081327

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Decoding Cognitive States via Riemannian Geometry-Informed Channel Clustering for EEG Transformers

Abstract

1. Introduction

2. Materials and Methods

2.1. Spatiotemporal Feature Extraction

2.2. Hybrid Clustering with AIRM-Informed Centroid Proposals

2.3. Cluster-Based Token Generation

2.4. Cognitive State Classification with a Transformer Encoder

2.5. Datasets, Preprocessing, and Experimental Setup

3. Results

3.1. Main Results

Statistical Significance of AIRM Versus Euclidean Metric

3.2. Computational Complexity Analysis

3.3. Model Visualization and Interpretation

3.4. Ablation Studies

3.4.1. Impact of the Clustering Module

3.4.2. Impact of the Clustering Metric: Riemannian vs. Euclidean

3.4.3. Sensitivity to the Number of Clusters

4. Discussion

4.1. Why Geometry-Aware Clustering Benefits EEG Decoding

4.2. Interpretation of Frequency-Band and Regional Effects

4.3. Limitations and Future Work

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI