1. Introduction
Electroencephalography (EEG)-based brain–computer interfaces (BCIs) have become an important paradigm for cognitive state decoding, affective computing, and neurological assessment because EEG is non-invasive, relatively inexpensive, and offers millisecond-level temporal resolution [
1,
2]. Despite these advantages, reliable EEG decoding remains difficult in realistic settings due to the low signal-to-noise ratio of EEG, its non-stationary temporal dynamics, and the large inter-subject variability caused by anatomical, physiological, and behavioral differences across individuals [
3,
4,
5]. These factors are particularly detrimental in subject-independent evaluation, where distribution shifts between training and test subjects often lead to marked performance degradation and limit the deployment of calibration-free BCI systems [
6,
7].
In the deep learning era, convolutional neural networks (CNNs), recurrent neural networks (RNNs), graph-based models, and Transformers have all been applied to EEG decoding. CNN-based methods such as EEGNet are effective at learning local spatial–temporal patterns but are inherently limited by local receptive fields [
8,
9]. RNN-based approaches can model temporal dependencies, yet their sequential processing constrains parallelization and often weakens long-range interaction modeling [
10,
11]. More recently, Transformers have emerged as competitive backbones because self-attention can capture global dependencies across channels and temporal segments [
12,
13]. However, for high-density EEG montages, directly treating all channels as tokens leads to substantial computational cost because the complexity of self-attention grows quadratically with the sequence length [
14,
15].
A second and less discussed limitation is geometric mismatch. Many existing EEG decoders compare channel features in Euclidean space after vectorization, even though covariance-based EEG representations are symmetric positive definite (SPD) matrices that naturally lie on a Riemannian manifold rather than in a flat vector space [
16,
17,
18]. Ignoring this structure may distort pairwise relations between channels and weaken robustness, especially under cross-subject variability. Riemannian geometry provides a principled way to compare SPD matrices through geodesic distances such as the affine-invariant Riemannian metric (AIRM), which has shown strong effectiveness in BCI-related classification and transfer settings [
19,
20,
21].
These characteristics of EEG motivate the use of a geometry-aware region-level Transformer. Emotion- and cognition-related EEG activity is usually distributed across multiple frequency bands and cortical regions rather than being confined to a single electrode, while many neighboring channels are partially redundant. Therefore, an effective model should jointly address channel redundancy, preserve spatial plausibility, and capture non-Euclidean covariance structure. This motivates us to combine channel clustering with the affine-invariant Riemannian metric in a Transformer framework.
Another practical consideration is that EEG channels are not independent sensors in the same sense as image pixels are. Nearby electrodes often measure partially overlapping cortical activity because of volume conduction and limited spatial resolution, whereas distant channels can still become functionally coupled during task execution. Consequently, using all channels as equally weighted and mutually independent Transformer tokens may be suboptimal from both statistical and neurophysiological perspectives. Region-level aggregation provides a useful compromise: it compresses redundant channels, stabilizes spatial organization across subjects, and exposes higher-level interaction patterns to self-attention. However, such aggregation should not be purely anatomical, because the most informative channel relations can vary across trials and mental states. This is precisely why we introduce a hybrid strategy that balances spatially stable clustering with trial-adaptive, geometry-aware centroid proposals.
To address the above challenges, we propose EEG-RCformer, a Riemannian geometry-informed channel clustering Transformer for EEG decoding. The model first extracts multi-band spatiotemporal features from sliding EEG windows and constructs channel-wise covariance descriptors. AIRM is then used to identify functionally diverse centroid proposals, which are integrated with stable spatial centers through a hybrid clustering procedure. The resulting cluster tokens are finally processed by a lightweight Transformer encoder for classification. Compared with directly operating on all channels, this design reduces token redundancy while preserving spatial plausibility and trial-adaptive functional information.
The contributions of this work are threefold. First, we introduce a hybrid channel clustering mechanism that combines stable spatial partitioning with AIRM-based functional centroid proposals. Second, we show that geometry-aware token construction improves EEG decoding, especially under subject-independent evaluation. Third, we provide ablation and sensitivity analyses to examine the effect of the clustering metric and the number of clusters.
From a methodological perspective, the present work also aims to bridge two lines of EEG research that are often studied separately. One line emphasizes carefully designed statistical descriptors such as covariance structure, PSD, and differential entropy; the other emphasizes representation learning with attention-based neural architectures. Rather than replacing one with the other, EEG-RCformer combines them. Handcrafted multi-band descriptors provide stable low-level cues for short-window EEG analysis, while the Transformer focuses on modeling interactions between learned regional tokens. This combination is especially suitable for relatively small EEG datasets, where purely end-to-end models may overfit and where prior geometric knowledge can substantially improve sample efficiency.
2. Materials and Methods
Figure 1 illustrates the architecture of the proposed EEG Riemannian geometry-informed channel clustering Transformer (EEG-RCformer). As shown in the top row, multi-channel EEG signals are first segmented and processed to extract PSD and DE features. These feature vectors are used to compute channel-wise covariance matrices, which are then fed into the channel cluster module. Finally, the pooled tokens are processed by a Transformer and MLP classifier to obtain the classification results. The bottom row details the channel cluster module: initial clustering is based purely on spatial locations, while functional centers are computed via farthest-point sampling on the feature manifold. These two sets of centers are combined via a smoothing coefficient
to produce final centers. Channels are re-clustered based on these updated centers, and the pooled regions serve as tokens for the Transformer.
2.1. Spatiotemporal Feature Extraction
The first stage of the proposed method extracts handcrafted spatiotemporal descriptors from raw EEG signals. Following implementation, each trial is sampled at 250 Hz and segmented by a sliding window of 250 samples with a stride of 50 samples, corresponding to a 1.0 s window, a 0.2 s hop size, and an 80% overlap. The use of overlapping windows increases the temporal sampling density of short-lived affective or cognitive dynamics and reduces information loss at window boundaries; without overlap, the temporal sequence becomes coarser and substantially fewer segments are available for the same trial duration. This setting also improves the continuity of the token sequence seen by the Transformer, because adjacent windows share sufficient context while still capturing local temporal change. In other words, the overlap is not introduced merely for data augmentation; it also stabilizes the downstream covariance estimation and makes the extracted spectral descriptors less sensitive to abrupt segmentation boundaries.
For each channel and window, a Hann window is first applied and the real-valued fast Fourier transform (rFFT) is then computed. Let
denote the
w-th window from channel
c, where
. The power spectrum is computed as
where
h denotes the Hann window. The band-wise PSD feature for band
b is obtained by averaging spectral power over the corresponding frequency-bin set
:
We consider seven frequency bands, namely delta (1–4 Hz), theta (4–8 Hz), alpha (8–12 Hz), beta
1 (12–16 Hz), beta
2 (16–20 Hz), gamma
1 (20–30 Hz), and gamma
2 (30–45 Hz). Differential entropy (DE) is then computed from the band power as
where
is used for numerical stability. Finally, PSD and DE are independently normalized by z-score over the window and band dimensions of each channel and concatenated to form the feature tensor
for each EEG trial, where
C denotes the number of channels,
the number of temporal windows, and
the feature dimension. In the default PSD+DE setting,
.
2.2. Hybrid Clustering with AIRM-Informed Centroid Proposals
Motivation. Existing geometry-aware channel grouping methods often overlook trial-level dynamics, whereas purely feature-driven clustering may lack spatial stability across trials and subjects. To address these limitations, we integrate a stable spatial prior with trial-adaptive functional information.
Let
denote the symmetric positive definite (SPD) covariance matrix for channel
c, computed from the windowed features. Channel similarity is quantified using the affine-invariant Riemannian metric (AIRM):
The proposed hybrid strategy maintains K stable spatial centers that are updated by an exponential moving average (EMA) using batch-level centroid proposals derived from the current input features. Importantly, in the implementation, the cluster capacity is not an independently tuned hyperparameter. Instead, the C channels are evenly distributed across K clusters, so that each cluster size is either or . Accordingly, the previously used symbol should be understood as a derived balancing constraint rather than an additional free parameter.
Methodology. For each mini-batch, we extract per-channel windowed features and use known three-dimensional sensor coordinates . The method maintains K stable spatial centers through the following steps.
1. Stable center initialization. We initialize K stable spatial centers by farthest-point sampling in the electrode coordinate space so that the initial partition covers the scalp as evenly as possible.
2. Batch-specific centroid proposals. For each sample in the mini-batch, temporary centroids are identified to capture transient functionally correlated activations. This is achieved by computing inter-channel distances using either the AIRM on the SPD covariance matrices of windowed features or, for comparison, Euclidean distances on flattened features, followed by farthest-point sampling.
3. Geometry-aware matching and update. The temporary assignments from all samples in the mini-batch are aggregated to obtain batch-averaged centroid proposals, which are then matched to the closest stable centers in the electrode coordinate space. The stable centers are updated by
where
denotes the matched batch-averaged proposal and
is an EMA smoothing coefficient.
4. Balanced spatial assignment. All channels are assigned to the updated centers according to spatial proximity while enforcing balanced capacities
so that the resulting partition remains spatially contiguous and numerically balanced. The obtained channel groups define the tokens for the subsequent Transformer encoder. The overall hybrid clustering procedure is summarized in Algorithm 1.
| Algorithm 1 EEG-RCformer Hybrid Clustering
|
- Require:
Full feature tensor , electrode 3D positions , number of clusters K, smoothing coefficient . - Ensure:
Balanced channel assignments for token construction. - 1:
Initialize stable centers by farthest-point sampling on . - 2:
for
each mini-batch
do - 3:
Compute channel-wise SPD covariance matrices from the windowed features. - 4:
For each sample, identify temporary functional centroids using farthest-point sampling with AIRM or Euclidean distance. - 5:
Aggregate temporary assignments across the mini-batch to obtain batch-averaged centroid proposals . - 6:
for to K do - 7:
Match the nearest proposal in to stable center . - 8:
Update the stable center by EMA: . - 9:
end for - 10:
Set target cluster sizes such that and . - 11:
Compute distances between all electrodes and the updated centers in 3D coordinate space. - 12:
for each channel i do - 13:
Assign i to the nearest cluster whose current size is smaller than its target size. - 14:
end for - 15:
end for - 16:
return
assignments .
|
2.3. Cluster-Based Token Generation
After clustering, we construct the input tokens for the Transformer. The feature tensor
is flattened and projected into a
-dimensional embedding space using a linear layer:
where
and
are learnable parameters, yielding channel embeddings
.
We then aggregate the embeddings within each cluster by average pooling:
where
denotes the token of the
k-th cluster,
denotes the channel embeddings, and
is the set of channel indices assigned to that cluster. This yields a token sequence
.
Average pooling inside each cluster is chosen because it provides a simple and stable summary of all channels assigned to the same region-level token. In contrast to selecting a single representative electrode, averaging preserves shared spectral trends within a region and reduces the influence of noisy or idiosyncratic channels. Since the preceding clustering stage already enforces balanced cluster sizes and spatial plausibility, the pooled token can be interpreted as a compact representation of a functionally coherent regional pattern rather than a purely anatomical average.
2.4. Cognitive State Classification with a Transformer Encoder
The final stage employs a standard Transformer encoder to model relationships among the cluster tokens. A learnable positional encoding
is added to the token sequence:
The embedded sequence is then processed by a stack of
identical Transformer encoder layers. Each layer consists of a multi-head self-attention (MHSA) module and a position-wise feed-forward network (FFN), together with residual connections and layer normalization (LN):
After the final Transformer layer, global max pooling is applied across the encoded token sequence to obtain a trial-level representation:
which is then passed to the final classifier. In the default implementation, the Transformer uses
,
attention heads,
encoder layers, a feed-forward dimension of 256, and dropout of 0.5. When clustering is enabled, the number of tokens equals the number of clusters
K.
2.5. Datasets, Preprocessing, and Experimental Setup
The proposed model is evaluated on two public EEG datasets, MODMA [
22] and SEED [
23]. MODMA is used here as a binary classification task, where the two classes correspond to major depressive disorder (MDD) and normal control (NC), and it contains 128-channel resting-state EEG recordings from 53 subjects. SEED is used as a three-class emotion recognition task, where the labels are positive, neutral, and negative emotions; the dataset contains 62-channel EEG recordings collected from 15 subjects during film-clip-based emotion elicitation.
For both datasets, the EEG signals are band-pass filtered between 0.5 and 50 Hz. In the implementation, feature extraction is performed at a sampling rate of 250 Hz with a window length of 250 samples and a stride of 50 samples. Therefore, each window spans 1.0 s and adjacent windows overlap by 80%. PSD and DE features are then extracted from seven standard frequency bands as described in
Section 2.1.
We considered two evaluation paradigms. In the subject-dependent setting, 5-fold cross-validation was performed within each subject; in the subject-independent setting, stratified cross-validation with five held-out subjects was used for MODMA, whereas a leave-one-subject-out (LOSO) protocol was adopted for SEED. Model performance was primarily evaluated using the area under the receiver operating characteristic curve (AUC). Unless otherwise stated, the default model configuration uses clusters, , 4 attention heads, 3 Transformer encoder layers, and a feed-forward dimension of 256. All experiments were implemented on a server equipped with two NVIDIA GeForce RTX 3090 graphics processing units (GPUs) running Ubuntu 24.04.
The two datasets are deliberately complementary. MODMA evaluates whether the model can distinguish a clinically relevant resting-state condition with substantial cross-subject heterogeneity, while SEED tests whether the same architecture can capture stimulus-induced emotional states with richer category structure. Therefore, using both datasets allows us to assess whether the proposed clustering-and-Transformer design is only beneficial for one specific application or whether it transfers across distinct EEG decoding scenarios.
4. Discussion
4.1. Why Geometry-Aware Clustering Benefits EEG Decoding
The observed performance differences of EEG-RCformer may be interpreted from both signal geometry and model efficiency. EEG channel interactions are often redundant, while covariance descriptors computed from multi-band windowed features are SPD matrices that do not naturally reside in Euclidean vector space. Prior Riemannian EEG studies suggest that manifold-aware distances can preserve the intrinsic structure of covariance representations more faithfully than direct vectorization can [
16,
18]. In our framework, this property is used in the proposed functionally informative centroids, while the final balanced assignment remains spatially constrained. This hybrid design may be beneficial for EEG because a purely functional clustering strategy can be unstable across trials, whereas a purely anatomical partition may overlook transient co-activation patterns. From this perspective, EEG-RCformer can be viewed as producing tokens that are compact, spatially interpretable, and adaptive to the current input. This may be particularly relevant in subject-independent decoding, where direct channel-wise comparison across individuals is affected by inter-subject variability and distribution shift [
6,
7].
Another possible advantage of the proposed design is that it separates two roles that are often intertwined in EEG modeling. One role is to determine which channels should be summarized together so that the Transformer does not devote excessive capacity to redundant or noisy fine-grained tokens; the other is to evaluate similarity between channels in a way that respects feature geometry. In EEG-RCformer, these roles are handled by coordinated but distinct mechanisms: spatially balanced assignment controls token compactness and interpretability, whereas AIRM is used to propose functionally informative centroids. This separation may partly explain why the method remains computationally efficient while showing better mean performance than the Euclidean-only alternative.
4.2. Interpretation of Frequency-Band and Regional Effects
The band-importance and clustering results require task-specific interpretation. Existing EEG emotion and cognitive-state studies do not support a single universally dominant frequency band for all tasks, datasets, or label spaces. Instead, informative patterns are often distributed across alpha-, beta-, and gamma-related activities, with their relative importance depending on the experimental paradigm and the definition of categories. Accordingly, our model does not assume a one-band-one-state mapping. Instead, it retains seven frequency bands and jointly models their PSD and DE representations, allowing the Transformer to learn task-specific cross-band interactions from the data. At the regional level, the learned clusters should also be viewed as data-driven functional summaries rather than direct physiological connectivity estimates between electrodes. They reflect how the model organizes channels to improve decoding performance under a balanced spatial prior.
A high importance assigned to a frequency band indicates its contribution to the current classification task, but it does not imply that the band is the sole neurophysiological substrate of the underlying mental state. Likewise, the learned clusters summarize how the network compresses and routes EEG information under the imposed spatial constraints, rather than constituting a direct parcellation of brain function. These visualization results provide an interpretable account of model behavior, but they do not replace dedicated neuroscience experiments.
4.3. Limitations and Future Work
This study still has several limitations. First, the current evaluation is restricted to two public datasets, namely MODMA and SEED. Although these datasets cover both clinical-state decoding and emotion recognition, they do not fully represent the diversity of EEG paradigms, recording montages, and acquisition conditions encountered in practice. Second, the current visualizations reflect learned token relations and cluster evolution, but they should not be interpreted as explicit physiological connectivity analyses. A more direct connectivity study, for example, using coherence- or graph-based measures, would provide stronger neuroscientific interpretation and will be investigated in future work. Third, the present framework relies on handcrafted PSD/DE features; an end-to-end variant that learns multiscale spectral-spatial representations directly from raw EEG may further improve generalization. Future work will therefore validate the framework on more diverse EEG tasks and datasets, strengthen visualization with explicit connectivity analysis, and explore end-to-end geometry-aware feature learning.
A further limitation is that the present experiments focus on classification performance and computational efficiency, whereas other practically relevant properties such as calibration time, robustness to missing channels, and sensitivity to electrode montage changes were not evaluated in this study. These issues are highly relevant for real-world BCI deployment and may interact with the proposed clustering strategy in meaningful ways. For example, region-level tokens may prove advantageous when some channels are noisy or unavailable, but this hypothesis requires dedicated testing. We leave these questions to future work.