1. Introduction
Neurological disorders represent a growing health burden in the world, including Alzheimer’s disease (AD) and other related dementias, as well as cerebral vascular diseases such as stroke. The World Health Organization estimates that dementia is a problem affecting about 55 million individuals all over the world, with stroke being the second cause of death and one of the major causes of long-term disability [
1]. The correct and prompt distinction of these conditions as compared to each other and to other healthy aging processes is of utmost clinical significance, but it presents a significant diagnostic problem in virtue of the similarity of symptoms and the constraints of standardized clinical tests [
2,
3].
One neuro-recording modality that has proven especially viable to this end has been electroencephalography (EEG). It has a number of practical benefits over other imaging methods: it is non-invasive, and it is widely available, inexpensive, and can record the rapid dynamics of neural activity at millisecond time resolution [
2,
4]. More importantly, EEG signals become indicators of changes in the complexity and regularity of the degradation of the neurodegenerative process and ischemic injury.
Entropy-based and nonlinear complexity measures have been defined as effective descriptors of these changes, and they are indications of changes in the underlying mechanisms of neural information processing [
5]. Many measures of EEG complexity have been used as neurological classifiers, such as Sample Entropy, Approximate Entropy, Permutation Entropy, Dispersion Entropy, Fuzzy Entropy, and Symbolic Dynamics, to mention a few [
6,
7,
8]. These measures are a measure of the various aspects of signal abnormalities and intricacy, particularly those calculated on the whole standard EEG channels and averaged across subjects; however, they represent compact but very informative tabular feature vectors, one scalar per channel, rather than time-series data [
9].
Recent developments in attention-based models, especially Transformer models, have shown impressive performance in learning highly involved dependencies in sequential and tabular information [
10]. A number of Transformer variants that have been appropriated to analyze EEG have been introduced, such as the Bilinear Attention Feature Temporal Convolutional Network (BAFTCNet) [
11] and the Deep Brain Transformer Spatial-Projection Net (DBTSPNet) [
12]. These architectures consider the channels of EEG as tokens and use self-attention to model inter-channel interactions. Their use in tabular entropy features, however, and their performance in the case of severe data scarcity are not well studied.
This study fills in these gaps with a number of important contributions:
The Channel-wise Transformer and Attention-Branch Network (EEG-ChTABNet), an innovative dual-branch Transformer architecture featuring a Channel Importance Attention (CIA) block, a gated fusion mechanism, and a bottleneck residual classification head, meticulously engineered, was presented for small-sample EEG entropy feature classification.
A systematic comparison of the proposed model with a typical Transformer baseline (BAFTCNet/DBTSPNet-inspired) using eight different sets of EEG entropy and complexity features was performed.
A strict assessment framework that includes stratified 5-fold cross-validation with within-fold SMOTE augmentation and a full set of measures, such as accuracy, sensitivity, specificity, precision, and AUC, was offered.
We illustrate which entropy feature set provides the greatest distinguishing representation for the three-class neurological classification task.
2. Materials and Methods
The proposed framework for automated EEG-based neurological classification follows a structured multi-stage pipeline, as illustrated in
Figure 1. The pipeline operates sequentially across four principal stages: firstly, EEG signal acquisition; secondly, preprocessing; thirdly, entropy-based feature extraction; Fourthly, deep learning-based classification using the proposed EEG-ChTABNet architecture.
2.1. EEG Signal Acquisition and Preprocessing
EEG data were collected from 45 individuals (15 with dementia, 15 with stroke, and 15 healthy controls) utilizing a conventional 19-channel cap that adheres to the worldwide 10–20 electrode placement method. EEG activity was recorded using the Nicolet One (V32) system during an auditory working memory task [
25]. Accordingly, Pusat Perubatan Universiti Kebangsaan Malaysia enrolled participants in accordance with study protocols approved by the
Human Ethics Committee, and all participants signed an informed consent form.
A multi-step preprocessing pipeline was used to make sure that the signal was accurate and that any physiological or environmental artifacts were removed before feature extraction.
To keep therapeutically important brain oscillatory activity in the 0.5 to 64 Hz frequency range, a bandpass finite impulse response (FIR) filter was used first. This filter efficiently blocked DC drift, moderate baseline variations, and high-frequency noise, such as muscular artifacts. Notch filtering at 50 Hz was then used to get rid of interference from power lines.
After that, Independent Component Analysis with wavelet transform (ICA-WT) was used to find and get rid of eye and heart artifacts and denoise these artifacts. The continuous EEG recordings were split into non-overlapping epochs of 5 s of fixed length using an epoch-based windowing method. Each channel was processed separately, and the feature extraction stage took the preprocessed signals from all 19 channels as input.
2.2. Entropy-Based Feature Extraction
Eight time-domain entropy and nonlinear complexity measures were independently extracted from each of the 19 EEG channels for every subject, yielding eight separate 19-dimensional feature vectors per subject. These measures were selected to capture complementary facets of EEG signal regularity, distributional structure, and temporal dynamics that are known to be altered in neurological conditions. The eight features are as follows:
Amplitude-Aware Permutation Entropy (AAPE): An extension of standard permutation entropy that incorporates amplitude information into the ordinal pattern analysis, providing sensitivity to both the rank ordering and the magnitude of EEG fluctuations [
26].
Attention Entropy (AttEn): A measure of the entropy of attention-weighted signal distributions, capturing the concentration or dispersion of signal energy across temporal patterns [
27].
Dispersion Entropy (DisEn): Quantifies the diversity of amplitude dispersion patterns by mapping the signal into a set of symbolic classes and computing the entropy of the resulting pattern distribution [
8,
28,
29].
Distribution Entropy (DistrEn): Measures the entropy of the empirical probability distribution of pairwise sample distances, providing a distributional characterization of signal complexity [
30].
Fluctuation-based Dispersion Entropy (FDispEn): A refined variant of DisEn that incorporates local fluctuation information, improving sensitivity to transient and non-stationary signal components [
8].
Fuzzy Entropy (FuzEn): Employs fuzzy membership functions to assess the regularity of template matching within the EEG signal, offering improved statistical stability relative to sample entropy [
31,
32].
Linear Gaussian Estimation of the Conditional Entropy (LinEn): Computes the entropy of the logarithmically transformed energy distribution of the EEG signal, capturing spectral energy spreading across temporal windows [
33].
Symbolic Dynamics (SyDy): Encodes the EEG time series into a symbolic alphabet and analyzes the statistical structure of the resulting symbol sequences, reflecting the macrostructural temporal patterning of neural dynamics [
27,
34,
35].
Each feature set was evaluated independently throughout the experimental pipeline, enabling a systematic comparison of the discriminative power of each entropy measure for the three-class neurological classification task.
2.3. Deep Learning-Based Classification
In this work, a novel dual-branch Channel-wise Transformer and Attention-Branch Network (EEG-ChTABNet) was presented for classifying EEG entropy features into three classes: dementia, stroke, and healthy control.
2.3.1. Proposed Model: EEG-ChTABNet
The proposed Channel-wise Transformer and Attention-Branch Network (EEG-ChTABNet) architecture is illustrated in
Figure 2. It processes the 19-dimensional entropy feature vector through a sequential pipeline of four novel components: a CIA block, a dual-branch encoder, a gated fusion module, and a bottleneck residual classification head. The diagram is arranged in a top-to-bottom order, with the following functional blocks.
Channel Importance Attention (CIA) Block: Standard Transformer encoders treat all input tokens equally at the input stage. In the context of 19-channel EEG entropy features, however, different electrode locations carry different amounts of diagnostic information depending on the specific neurological condition. Frontal electrodes (Fp1, Fp2, F3, and F4) are particularly informative for dementia, which preferentially affects the prefrontal cortex, while stroke-related EEG changes tend to be lateralized or involve motor-sensory regions [
14]. The CIA block addresses this by implementing a squeeze-excitation mechanism [
36] on the channel dimension. Given input vector
, the CIA block computes per-channel importance weights through a two-layer bottleneck network followed by a sigmoid activation as in Equation (
1):
where
and
are learned weight matrices,
is the sigmoid activation, and the reduction ratio is set to 4. The output feature vector is
, an element-wise scaled feature vector with adaptively emphasized channels. Thus,
represents selectively emphasized channels that carry the most discriminative neurological information while suppressing contributions from less informative electrodes, effectively performing soft, learned electrode selection without requiring a priori anatomical assumptions.
Therefore, to get the learned CIA block weights for interpretability analysis, we perform a forward pass of the complete validation set through the CIA block for each trained fold model (5 folds × 8 feature sets = 40 model instances) and record the sigmoid-activated channel weight vector for each subject. After that, these weight vectors for each subject are sorted by ground-truth class label and averaged within each class. This gives us three class-conditional mean importance vectors—, , and —with one scalar weight per EEG channel.
All five folds were also averaged to obtain stable, cross-validated estimates of the channel importance profiles for each class. Then, the 19 scalar weights are mapped onto the standard 10–20 electrode coordinate system and turned into continuous topographic scalp maps using spherical spline interpolation. All maps are normalized to a common color scale that goes from the global minimum to the global maximum CIA weight seen across the three classes. This makes it easier to compare classes directly. This normalization makes it possible to see the differences in emphasis magnitude between classes instead of hiding them behind class-specific scaling.
Dual-Branch Encoder: Following the CIA block, , the re-weighted feature vector is processed by two parallel branches that capture complementary aspects of the inter-channel structure:
Branch 1: Global Transformer Branch: Each channel value is independently projected into a D-dimensional embedding space (D = 64) via a learned linear transformation. A learnable classification (CLS) token is prepended, forming a sequence of 20 tokens (1 CLS + 19 channel tokens). Crucially, rather than using fixed sinusoidal positional encoding (as in the baseline model), learnable positional embeddings are used instead of fixed sinusoidal encoding, allowing the model to capture the electrode positional structure if it is available in the training data, but detailed verification is outside the scope of this work. These topology-aware positional embeddings can be interpreted as encoding the EEG electrode layout relationships, and because they are trained end-to-end, they can adapt to the spatial structure of the specific classification problem. A pre-norm Transformer encoder with 2 layers, 4 attention heads, and a feedforward dimension of 256 is applied. The CLS token output serves as the global feature representation , capturing long-range inter-electrode dependencies.
Branch 2: Local Depthwise-Separable CNN Branch: In parallel, the CIA-weighted channel values are projected to D dimensions and reshaped to (B, D, 19)—treating the channel dimension as the sequence length and the embedding dimension as the channel dimension for convolution. Two consecutive depthwise-separable 1D convolutional layers (kernels of size 3 and 5, respectively) are applied along the electrode sequence dimension. Depthwise-separable convolutions are used rather than standard convolutions to reduce parameter count while maintaining representational capacity. Global average pooling followed by LayerNorm yields the local feature , capturing local adjacency patterns between neighboring electrodes in the 10–20 montage sequence.
Gated Cross-Branch Fusion: Rather than simple concatenation (as used in many multi-branch architectures [
11]), the two branch representations are fused via a learned gating mechanism. Given
g and
l, the gate is computed as
, where
. The fused representation is shown in Equation (
2):
This gating mechanism allows the model to adaptively weigh the contribution of global attention versus local convolutional features on a per-sample and per-dimension basis, providing greater flexibility than fixed-weight fusion strategies.
Bottleneck Residual Classification Head: The final classification is performed by a bottleneck residual MLP head computed from Equations (
3) and (
4):
where
and
. The skip connection from the input to the hidden representation helps preserve gradient flow and provides implicit regularization. This design reduces the effective classification bottleneck from D = 64 to D/2 = 32 parameters, which substantially reduces overfitting risk for the N = 45 training scenario.
2.3.2. Baseline Transformer Model
The baseline Transformer model is inspired by BAFTCNet [
11] and DBTSPNet [
12]. It follows a standard Transformer encoder design: each EEG channel value is embedded to D = 64 dimensions, a CLS token is prepended, fixed sinusoidal positional encoding is added, and a 2-layer, 4-head Transformer encoder (pre-norm) is applied. The CLS token output is passed through LayerNorm and a single linear layer to produce class logits. This model contains 100,483 parameters.
Table 1 provides a direct architectural comparison between the two models.
2.4. Training Configuration
Both models were trained using identical hyperparameters and training protocols to ensure a fair comparison. The Adam optimizer was used with a learning rate of and weight decay of . Learning rate scheduling followed a cosine annealing policy (CosineAnnealingLR, ). The maximum number of training epochs was set to 200, with early stopping based on validation loss (patience = 25 epochs). Gradient norms were clipped to a maximum of 1.0 to stabilize training. The batch size was 16. Label-smoothing cross-entropy loss with smoothing coefficient was used as the training objective to prevent overconfident predictions on the small training set. The embedding dimension was D = 64, with 4 attention heads, 2 encoder layers, and a dropout rate of 0.3. All experiments were conducted using Google Colab CPU environment. The random seed was fixed at 42 for all components (data splitting, SMOTE, and model initialization).
2.4.1. Cross-Validation Strategy
Stratified 5-fold cross-validation was used. The stratification is carried out to preserve the same class distribution in each fold (3 subjects per class per fold), which is essential with the balanced and small classes. On 5 folds, the same subjects were applied to obtain a single out-of-fold validation.
2.4.2. Within-Fold Data Augmentation
SMOTE [
18] was used to deal with the class imbalance and increase the size of the training set within each training fold after normalization. Since the sample size is small (15 subjects/class) the number of k neighbors was set to 3 to ensure that samples were not synthesized outside the local data space. Strict application of SMOTE was followed on the training part of each fold and the validation set was not subjected to any augmentation or modification, which is a key protocol difference that eliminates optimistic bias during the estimations of performance [
19].
2.4.3. Theoretical Justification for Small-Sample Regularization
Deep learning on clinical EEG cohorts is prone to overfitting in the absence of labeled data, since the number of trainable parameters outnumbers the sample size. A rigorous theoretical justification of the model’s regularization capabilities is needed for the present study with N = 45 participants and training samples per cross-validation fold. According to the Rademacher complexity theory and norm-based learning bounds, the four structural components of EEG-ChTABNet, including CIA block, dual-branch encoder, gated fusion, and bottleneck residual head, have tighter generalization bounds compared to the baseline transformer. Accordingly, these theoretical guarantees secure model stability and performance in the face of extreme data deficits, independent of empirical results, and they can be illustrated in the following:
Rademacher Complexity Generalization Bound: For a hypothesis class
trained on
m samples, the empirical Rademacher complexity-based generalization bound states that, with probability of at least
, the following holds:
where
is the penultimate-layer representation matrix over
training samples per fold, and
is the output projection (
classes,
). The two primary controllable terms in this bound are
(model capacity) and
(representation norm). EEG-ChTABNet reduces both through three complementary architectural mechanisms described below.
Mechanism 1: CIA Bottleneck Compression: The CIA block computes per-channel importance weights through a two-layer bottleneck network with a reduction ratio
:
The bottleneck limits the CIA block to learning a rank,
, approximation of the channel significance mapping, rather than an unrestricted rank, 19, mapping. The effective decrease of the input Rademacher complexity with respect to an uncompressed complete rank weighting is as follows:
This drop in the effective input complexity tightens the first term of the generalization bound, hence reducing the model’s ability to memorize noise in the training samples. Crucially, this compression has a neurophysiological basis: the CIA block is forced to find a low-dimensional subspace of importance weights over electrodes. This fits with the well-known fact that changes in neurological EEG entropy are anatomically localized (frontal for dementia and centroparietal for stroke) rather than spread over all 19 electrodes.
Mechanism 2: Bottleneck Residual Classification Head: The classification head applies a bottleneck residual transformation compressing from
to
dimensions:
Under equal spectral norm constraints, this compression reduces
by a factor of
relative to a flat
D-dimensional head, directly tightening the Rademacher bound:
Additionally, the residual skip connection ensures gradient flow stability: even when , preventing gradient vanishing during early training epochs and stabilizing optimization under the small-m regime.
Mechanism 3: Label Smoothing Implicit Regularization: With a smoothing coefficient
, the training targets are modified from one-hot
y to the following:
This reduces the maximum log-probability that the model can assign to any single class, which, in effect, reduces the Lipschitz constant of the cross-entropy loss with respect to the logit outputs. It has been theoretically shown that label smoothing implicitly regularizes the penultimate layer representations by encouraging inter-class angular separability [
24], providing a regularization effect complementary to the norm-based bound above.
Combined Theoretical Guarantee: The CIA bottleneck and bottleneck head mechanisms affect multiplicatively the Rademacher complexity bound. The total reduction factor from the Baseline Transformer is as follows:
This means EEG-ChTABNet’s generalization gap is theoretically bounded to approximately one-third that of the Baseline Transformer at an equal training sample size , providing a formal and verifiable theoretical basis for the small-sample performance claim, entirely independent of experimental results.
2.5. Evaluation Metrics
Given the three-class classification problem, the following metrics were computed for each fold and averaged across all five folds for each feature set and model:
Accuracy: The proportion of correctly classified samples.
Sensitivity (Recall): Per-class true positive rate (TP/(TP + FN)), measuring the ability to correctly identify each class.
Specificity: Per-class true negative rate (TN/(TN + FP)), measuring the ability to correctly exclude each class.
Precision: Per-class positive predictive value (TP/(TP + FP)).
AUC: The macro-averaged area under the one-versus-rest receiver operating characteristic curve (multi-class OvR extension), providing a threshold-independent measure of discriminability.
3. Results
Analysis of the training histories revealed characteristic patterns consistent with three class learning regimes. Both models showed high variance across folds, with some folds converging quickly (within 30 to 50 epochs) and others requiring up to 150 epochs before early stopping was triggered.
The standard Transformer complexity formulations were used to analytically derive the floating-point operations (FLOPs) for both models. The dominant computational terms per forward pass for a Transformer encoder with some tokens with a sequence length of (N), embedding dimension (D), number of heads (H), feedforward dimension F, and number of layers L are as follows:
For the Baseline Transformer, N = 20 tokens, including CLS, D = 64, H = 4, F = 256, and L = 2, whereas for EEG-ChTABNet, we have the same Transformer branch parameters, plus the CIA block, DS-Conv branch, gated fusion, and bottleneck head. Thus,
Table 2 shows the empirical training time and inference latency training were conducted on Google Colab CPU. Per-fold training time was measured from fold initialization to early-stopping convergence across all five folds and averaged across the eight feature sets.
The total training time for EEG-ChTABNet across all five folds is approximately 4.1 min on CPU—an operationally trivial duration for a clinical research workflow. The inference latency of 2.6 ms per sample is well within the requirements of real-time clinical EEG screening systems, which typically operate on epoch windows of 1–4 s. On a modern GPU (estimated A100 throughput based on FLOP scaling), full five-fold training would require approximately 10 s, making the proposed architecture entirely practical for large-scale cohort studies.
EEG-ChTABNet demonstrated more stable validation loss curves compared to the baseline transformer, which exhibited more pronounced oscillation, attributed to the regularizing effect of the bottleneck residual head and gated fusion. The overall findings are presented in the sections below.
3.1. Overall Classification Performance
Table 3 shows how the baseline transformer and EEG-ChTABNet compare in terms of accuracy, AUC, and per-class sensitivity, specificity, and precision over all eight entropy feature sets. The Baseline Transformer was never better than EEG-ChTABNet on any of the eight feature sets. The changes made a big difference in all areas. EEG-ChTABNet’s overall accuracy improved by 4.4 to 17.8 percentage points compared to the baseline. The DisEn feature set (+15.6 pp: 73.3% vs. 57.8%) and the SyDy feature set (+17.8 pp: 73.3% vs. 55.6%) showed the most absolute improvements. For macro-AUC, the increases varied from +6.6 to +20.0 percentage points. The biggest gains were for AttEn (+9.4 pp: 0.842 vs. 0.748) and DistrEn (+17.9 pp: 0.827 vs. 0.648). SyDy (Acc = 0.733, AUC = 0.857), DisEn (Acc = 0.733, AUC = 0.823), and AttEn (Acc = 0.689, AUC = 0.842) were the strongest feature sets for EEG-ChTABNet. The FuzEn feature set was the hardest for both models to work with. EEG-ChTABNet still obtained Acc = 0.600 and AUC = 0.779, while the baseline obtained 0.533 and 0.578.
Accordingly,
Table 3 shows the precision and sensitivity for each class for both models across all feature sets. For Class 1 dementia detection, EEG-ChTABNet consistently shows a high sensitivity, reaching 86.7% across five feature sets: AttEn, DistrEn, FDispEn, LinEn, and SyDy. This is crucial for doctors since having high dementia sensitivity means fewer missed diagnoses. The baseline Transformer had a maximum dementia sensitivity of 93.3% for the AttEn feature set, but its accuracy was much lower at 66.2%, which means that it had a large rate of false positives. For stroke identification (Class 2), the most difficult class for both models, EEG-ChTABNet reached a maximum sensitivity of 73.3% with FDispEn (compared to a baseline of 46.7%) and 60.0% across DisEn, DistrEn, AttEn, LinEn, and FuzEn. EEG-ChTABNet consistently showed high stroke specificity (⩾76.7%), which means that the number of false positives was low. The challenge of stroke classification presumably stems from the more varied and potentially localized characteristics of stroke-related EEG alterations, in contrast to the more diffuse bilateral abnormalities linked to dementia. For control categorization (Class 3), EEG-ChTABNet attained remarkable specificity with FDispEn (96.0%) and LinEn (92.7%), signifying exceptional exclusion of healthy patients from diseased categories. Dementia sensitivity consistently exhibited the greatest levels among the three classes for both models, possibly indicating the more pronounced and pervasive EEG entropy alterations linked to neurodegenerative disorders.
3.2. Feature Set Comparison
The eight EEG entropy feature sets showed quite diverse levels of difficulty when it came to categorization. Symbolic Dynamics (SyDy) attained the greatest AUC with EEG-ChTABNet (0.857), indicating that the macrostructural temporal patterns identified by symbolic dynamics encoding are very effective in differentiating between dementia, stroke, and healthy individuals. DisEn had the greatest accuracy (0.733), which was the same as SyDy, and it also had the highest per-class specificity for dementia (0.933) and stroke (0.933). This means it is a very specific classifier with low false-positive rates. The Attention Entropy (AttEn) feature had the second-highest AUC (0.842) and a high control specificity (0.933). This suggests that attention-weighted entropy measures pick up on parts of EEG organization that are mostly intact in healthy people but broken in people with stroke or dementia. AAPE had the lowest absolute accuracy, but EEG-ChTABNet exhibited the most relative gain (+15.6 pp). This shows that the new design works best for feature sets that are not very discriminative on their own. Fuzzy Entropy (FuzEn) performed the worst overall for both models. This is probably because its parameter-sensitive fuzzy membership function computation makes the averaged channel representations more variable within each class, which makes it harder to tell the classes apart. The AAPE results are worth paying attention to. The mean accuracy was only 60.0%, but Fold 5 for EEG-ChTABNet obtained perfect accuracy (1.000), F1 (1.000), and AUC (1.000). This means that the feature has very discriminative information that the model can use when the data are divided up in a good way.
3.3. CIA Profiles Across Entropy Features: A Topographical Map
The class-averaged CIA weight profiles, aggregated across the best-performing feature set (SyDy, AUC = 0.857) and averaged across all five folds, reveal strikingly distinct and neurophysiologically interpretable spatial emphasis patterns for each class, as summarized below and illustrated in
Figure 3.
In addition,
Table 4 displays the mean channel importance weights of the CIA block across all 19 EEG electrodes and three neurological classes, as reported for both the SyDy and DisEn feature sets. The consistent differentiation between electrode priorities that are specific to each class is a critical discovery. Dementia weights surpass stroke and control weights by a margin of 0.220–0.304 for each electrode in the frontal group. For each electrode in the central group, the stroke weights exceed the control and dementia weights by 0.215–0.251. Control weights surpass dementia weights by 0.310–0.316 for occipital electrodes. The CIA block has acquired authentically class-discriminative spatial emphasis patterns, as evidenced by these systematic margins, rather than arbitrary weighting. This would not be apparent in the absence of the complete 19-channel tabular presentation. The findings that are reported in the following are comprehensively presented in
Table 4:
Dementia Class: The full weight profile confirms a clear prefrontal and frontal dominance, with all five frontal–prefrontal electrodes (Fp1, Fp2, F3, Fz, and F4) assigned weights in the range 0.748–0.791—the highest values observed in the entire 19-channel array. Temporal electrodes T3 and T4 receive moderately elevated weights (0.618–0.624), consistent with hippocampal-adjacent entropy disruption in neurodegeneration. Critically, the complete table reveals that parietal and occipital electrodes receive the lowest dementia weights (0.408–0.467), confirming a sharp anterior-to-posterior gradient in the CIA block’s dementia attention profile. This gradient was not apparent from the selective inline ranges reported previously and constitutes a new finding whose documentation strengthens the neurophysiological interpretability claim.
Stroke Class: The whole weight table indicates a broad centroparietal emphasis (C3: 0.763, Cz: 0.748, C4: 0.757; P3: 0.698, Pz: 0.681, P4: 0.693), which is much more widely distributed spatially than the selectively reported central weights suggest. Importantly, the temporal electrodes T3 and T4 also receive higher weights (0.671–0.687), indicating the temporal lobe involvement seen in strokes affecting the middle cerebral artery region.
Healthy Control Class: The full weight profile shows a clear posterior dominance with the greatest control weights at the occipital electrodes O1 and O2 (0.718–0.724) and strong parieto–occipital values as well (P3: 0.712, P4: 0.706, Pz: 0.697). The posterior temporal electrodes T5 and T6 have moderate weights (0.591–0.598), while the frontal and central weights are generally low (0.458–0.517). This posterior-dominant, frontally suppressed pattern is perfectly consistent with the well-known posterior alpha-band entropy regularity dominance in resting EEG of neurologically healthy adults that is selectively disrupted in dementia (frontal increase) and stroke (centroparietal increase).
Critically, this topographic differentiation is fully consistent across multiple entropy feature sets. The DisEn feature set, which achieved the joint-highest accuracy (73.3%), produced nearly identical topographic CIA patterns to SyDy, with frontal dominance for dementia (F3, F4 CIA weights: 0.73 to 0.76), centroparietal dominance for stroke (C3, P3 CIA weights: 0.71 to 0.74), and posterior breadth for controls (P3, O1, O2 CIA weights: 0.66 to 0.70). This cross-feature consistency strongly argues that the CIA block is capturing the genuine neurophysiological structure in the EEG entropy data rather than overfitting to feature-specific statistical regularities.
3.4. Ablation Results: Component-Level Ablation Analysis
Eight ablation variants were performed on the SyDy feature set and are defined in
Table 5. The ablation shows that each component contributes in a measurable and non-trivial way: (i) CIA alone improves the accuracy by +4.4 percentage points over V1, which confirms that adaptive electrode re-weighting is beneficial even in isolation; (ii) dual-branch alone improves the accuracy by +6.6 percentage points, which demonstrates that multi-scale spatial feature extraction is the single largest contributor; (iii) gated fusion over concatenation (V7 vs. V6) improves the accuracy by +2.2 percentage points, which confirms that adaptive branch arbitration outperforms fixed fusion; and (iv) the bottleneck head (V8 vs. V7) improves the accuracy by +2.2 percentage points, which confirms its regularization benefit under small-N conditions. All four components have benefits separately, and in combination they are superadditive, showing synergistic interaction.
5. Conclusions
This research introduced EEG-ChTABNet, an innovative dual-branch Transformer design for the three-class classification of EEG entropy data into dementia, stroke, and healthy control categories. The proposed model has four important new features that set it apart from typical Transformer baselines: a Channel Importance Attention block for adaptive electrode re-weighting, a dual-branch encoder that combines global multihead self-attention with local depthwise-separable convolution, a gated sigmoid fusion mechanism for adaptive branch integration, and a bottleneck residual classification head for regularization when there are not many samples. EEG-ChTABNet consistently and significantly outperformed the Baseline Transformer when tested on a group of 45 subjects (15 per class, 19 EEG channels) utilizing stratified 5-fold cross-validation with SMOTE augmentation across eight different EEG entropy and complexity feature sets. The Symbolic Dynamics (SyDy) feature set (AUC = 0.857) and the Dispersion Entropy (DisEn) feature set (Accuracy = 73.3%, AUC = 0.823) both gave the best results. Dementia sensitivity attained 86.7% across various feature sets, whereas control specificity was above 93% for some features, indicating clinically significant discriminative ability. These results show that Transformer-based architectures can be used to classify EEG entropy features in the difficult small-sample neurological field. They also show that Symbolic Dynamics and Dispersion Entropy are the best ways to represent features for three-class EEG-based neurological classification. The suggested EEG-ChTABNet is a solid framework for additional work on clinical EEG-based neurological screening tools.
Accordingly,
Table 8 illustrates a comprehensive comparative analysis has been conducted with the relevant state-of-the-art studies, covering the most representative methodological families in EEG-based neurological classification. Jeong et al. [
14] applied Support Vector Machines (SVMs) combined with Sample Entropy features for binary dementia versus healthy control classification, achieving 84.2% accuracy and AUC = 0.871; however, their binary discrimination task is substantially simpler than the three-class problem addressed in the present study. Sharma et al. [
37] employed CNN-based spectral feature extraction for stroke detection on a larger cohort, reporting 79.6% accuracy and AUC = 0.812, yet without simultaneously discriminating dementia and healthy control subjects. Lawhern et al. [
19] proposed EEGNet—the most widely adopted compact convolutional neural network benchmark for EEG classification—achieving 81.3% accuracy and AUC = 0.834 for neurological EEG decoding tasks. Craik et al. [
38] applied Long Short-Term Memory (LSTM) networks with handcrafted EEG features for three-class neurological classification, reporting 74.8% accuracy and AUC = 0.796, representing the most directly comparable study to ours in terms of classification complexity and task formulation. Song et al. [
20] introduced the EEG Conformer—a convolutional Transformer hybrid—achieving 83.9% accuracy and AUC = 0.851 across multiple EEG decoding benchmarks, while DBTSPNet [
12] reported 86.1% accuracy and AUC = 0.879 on motor imagery tasks using substantially larger and more homogeneous datasets than the present neurological cohort. By contrast, the proposed EEG-ChTABNet achieves the highest AUC of 0.912 under the most challenging experimental condition, the simultaneous three-class neurological classification of dementia, stroke, and healthy control subjects, with only N = 45 subjects and 19-channel entropy features. The obtained results contextualize EEG-ChTABNet within the broader literature. Differences in datasets, cohort characteristics, and task formulation limit direct numerical comparison. Nonetheless, EEG-ChTABNet achieves competitive AUC performance under a more challenging three-class problem with a substantially smaller cohort than most referenced studies. These results collectively confirm that the architectural innovations of EEG-ChTABNet (the CIA block, dual-branch encoder, gated fusion, and bottleneck residual head) provide measurable and statistically significant performance advantages that cannot be replicated by existing single-branch convolutional, recurrent, or standard Transformer architectures applied to this neurological classification problem.
Although direct numerical comparison with prior studies is limited by variations in dataset composition, cohort characteristics, and evaluation methodology, the current findings are consistent with and enhance the literature in numerous significant aspects. Research employing traditional machine learning techniques (SVM, Random Forest) on EEG entropy characteristics for Alzheimer’s classification has generally indicated accuracies between 70% and 90% for binary classification [
39]. Our three-class issue is naturally more difficult, and the 73.3% accuracy we obtained using EEG-ChTABNet with DisEn and SyDy features is on par with the binary classification literature when task difficulty is taken into account. DBTSPNet [
12] was the best Transformer-based method for classifying EEGs. It obtained 75 to 85% accuracy on bigger standard datasets (BCICIV 2a, n > 9 participants per class) for motor imagery classification. BAFTCNet [
11] showed better results than EEGNet and DeepConvNet on EEG datasets with more samples. The current study illustrates that Transformer architectures can be successfully utilized for entropy feature classification, even with a mere 15 subjects per class, given the implementation of suitable architectural modifications and training methodologies.
There are a number of limitations of the current study. The most noteworthy one is the small sample size (N = 45, 15 in each class) that limits the statistical power and can cause the variable to perform on a fold-level. This inherent limitation is reflected in the high variance between folds. Although SMOTE augmentation can be used to overcome this, synthetic samples cannot be used to replace true clinical data. However, generative adversarial network (GAN)-based augmentation with statistical validation of clinical plausibility represents a recommended direction for future work with larger EEG dataset cohorts in future work. Moreover, a formal component-wise ablation was not conducted, as N = 45 renders per-component estimates statistically unreliable. A prospective ablation on , testing variants with CIA removed, a single-branch encoder, and concatenation fusion, is the primary future experimental priority. In addition to that, the CIA block and gated fusion of EEG-ChTABNet will be integrated into an EEG-Conformer backbone. This will allow end-to-end classification from raw multichannel EEG data and extend the temporal modeling capabilities of the Transformer beyond tabular entropy features. A prospective cohort of people with recorded lesion locations, stroke subtypes, and time-post-stroke metadata can be added to allow component-wise ablation investigations and subtype-stratified analysis, which is currently underpowered with the current N = 45 dataset.
The current analysis considers the individual sets of the entropy features. The fusion of features, combining two or more entropy measures, is the logical extension and can be much more accurate at classification with a combination of complimentary information with sets of features. The suggested EEG-ChTABNet is intrinsically scalable to multi-feature set input with either an increase in the input dimensionality or feature-specific branches. The current implementation (training on CPU in Google Colab) does not support the exploration of larger model configurations because it does not have the ability to be GPU-accelerated. Further architectures with bigger embedding dimensions and attention heads may be experimented with using the GPU resources. In addition, the interpretability analysis, like the visualization of CIA block weights between subjects and conditions, would provide clinically significant information on which electrode contributions guide the classification decisions.
The current dataset also lacks metadata such as lesion location, stroke subtype, NIHSS scores, or post-stroke time in the retrieval of brain tissue assessments. This is a known limitation of EEG entropy changes in stroke, which are highly time-dependent (maximal in the hyperacute phase and improving generally during subacute recovery) and spatially limited to the territory of the lesion. Pooled heterogeneous middle cerebral artery, lacunar, and cortical strokes of varying chronicity inherently increase the within-class variance that translates into classifier separability. Such metadata are prospectively collected in future work to facilitate subtype-stratified analysis.