Evaluating EEG-Based Seizure Classification Using Foundation and Classical Ensemble Models

Obaido, George; Esenogho, Ebenezer

doi:10.3390/app16073120

Open AccessArticle

Evaluating EEG-Based Seizure Classification Using Foundation and Classical Ensemble Models

by

George Obaido

^*

and

Ebenezer Esenogho

Center for Artificial Intelligence and Multidisciplinary Innovations, Department of Auditing, College of Accounting Sciences, University of South Africa, Pretoria 0002, South Africa

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(7), 3120; https://doi.org/10.3390/app16073120

Submission received: 6 January 2026 / Revised: 15 February 2026 / Accepted: 24 February 2026 / Published: 24 March 2026

(This article belongs to the Special Issue AI-Driven Healthcare)

Download

Browse Figures

Versions Notes

Abstract

Electroencephalogram (EEG)-based seizure classification remains challenging due to inter-subject variability and heterogeneous signal characteristics. Foundation models offer a promising alternative to dataset-specific training by leveraging pretrained priors. In this study, we evaluate a tabular foundation model, the Tabular Prior-Data Fitted Network (TabPFN), against classical ensemble baselines (gradient boosting, random forests, AdaBoost, and XGBoost) for EEG seizure segment classification. We use subject-independent GroupKFold cross-validation without out-of-fold evaluation to assess generalization to unseen individuals. Experiments on the Bangalore EEG Epilepsy Dataset (BEED) and the University of Bonn (Bonn) dataset show that TabPFN achieves higher accuracy than classical ensembles, reaching 99.7% on BEED and 99.6% on Bonn. These results suggest that pretrained tabular priors can be effective in feature-based EEG pipelines where subject-level generalization is required.

Keywords:

EEG; classical ensemble methods; seizure classification; foundation models; TabPFN; subject-independent evaluation; machine learning

1. Introduction

Epilepsy is a chronic neurological disorder in which recurrent seizures can disrupt daily life and, in some cases, lead to severe injury or death [1,2,3,4]. Electroencephalography (EEG) remains a core modality for clinical assessment because it measures brain electrical activity with high temporal resolution and is relatively accessible compared with many imaging alternatives [5,6,7]. However, manual EEG review is time-consuming, and seizure patterns can be subtle, heterogeneous, and easily confounded by artifacts. As a result, automated EEG-based seizure classification has become an active area of research, spanning signal processing, classical machine learning, and deep learning methods. Figure 1 illustrates example EEG signals during seizure and non-seizure activity.

A major barrier to reliable deployment is inter-subject variability. EEG characteristics differ substantially between individuals due to physiology, electrode placement, recording conditions, medication, and comorbidities. Models that perform well when training and testing share subjects and often degrade when evaluated on unseen individuals (a subject-independent setting). Recent work continues to emphasize the importance of evaluation protocols that reflect this real-world requirement and highlights patient-independent learning as a recurring gap between benchmark performance and clinical generalization [8,9]. In parallel, state-of-the-art seizure modeling has increasingly explored architectures designed to capture complex spatial and temporal dependencies, including graph-based formulations that encode relationships across EEG channels and attention mechanisms for long-range dynamics [10,11,12,13,14]. While these approaches can improve performance, they often require substantial training, careful tuning, and non-trivial engineering choices that may not transfer cleanly between datasets.

Figure 1. Comparison of EEG signals during epileptic seizure and normal brain activity [15].

In many practical pipelines, EEG is converted into a tabular representation for classification, where signals are segmented into windows, and each window is represented by features such as time-domain statistics, bandpower, entropy measures, or compact learned embeddings [16,17,18,19]. Tabular formulations are attractive because they are fast to train, are comparatively interpretable relative to end-to-end sequence models, and are compatible with widely used ensemble classifiers such as GBM, Random Forests, AdaBoost, and XGBoost. However, classical tabular learners can still struggle with generalization under distribution shift, and their performance can be sensitive to feature design and hyperparameter tuning [20].

Recently, tabular foundation models have emerged as an alternative approach. Instead of training a new model for each dataset from scratch, these methods pretrain on large collections of synthetic or curated tasks and then perform prediction on a new dataset with minimal or no dataset-specific optimization. Tabular Prior-Data Fitted Network (TabPFN) is a representative example that uses a transformer-style architecture and in-context learning to approximate a strong Bayesian predictor for small-to-medium tabular datasets [21]. In the broader tabular modeling literature, there is growing interest in adapting or developing pretrained transformer-based methods for structured data, with surveys documenting a rapid expansion of techniques and evaluation practices [22,23,24,25]. Despite this momentum, comparatively little is known about how tabular foundation models behave in EEG seizure classification settings, particularly under subject-independent evaluation where generalization to unseen individuals is central. Although EEG signals are inherently temporal and multi-channel, many clinical and research pipelines convert EEG segments into fixed-length vectors (e.g., handcrafted time- and frequency-domain features or compact learned embeddings) for efficient classification [26,27]. In this tabular setting, TabPFN is relevant because it encodes a pretrained prior over tabular prediction tasks and performs inference via in-context conditioning [28]. Under subject-aware evaluation, where fold isolation restricts training data and introduces distribution shift across individuals, such pretrained priors may provide improved data efficiency and robustness compared to boosting methods that learn decision boundaries solely from the available training folds. We explicitly acknowledge the limitations of reducing multichannel temporal EEG to fixed-length vectors and treat event-level detection and temporal modeling as important directions for future work.

In this study, we provide an empirical comparison between a tabular foundation model and commonly used classical ensemble methods for EEG-based seizure classification under a subject-independent cross-validation protocol. We evaluate performance using standard classification metrics to better approximate the realistic setting in which a classifier is deployed on individuals not observed during training. Our results show that TabPFN achieves consistently better performance than the classical models across the datasets evaluated, suggesting that pretrained tabular priors can be beneficial for EEG seizure classification when subject-level generalization is emphasized. The main contributions of this work are as follows. We:

evaluate the TabPFN model for EEG seizure segment classification under a subject-independent protocol, emphasizing generalization to unseen individuals.
provide a controlled comparison against widely used ensemble baselines with consistent fold isolation, out-of-fold evaluation, and explicitly described hyperparameter selection.
analyze when and why pretrained tabular priors may help in small-to-medium EEG tabular settings, and we provide practical guidance and limitations for deploying tabular foundation models in EEG pipelines.

The remainder of this paper is organized as follows. Section 2 reviews related work on EEG-based seizure classification. Section 3 describes the datasets, models, and evaluation methodology. Section 4 presents the experimental results and discusses their implications for subject-independent EEG classification. Finally, Section 5 concludes the paper.

2. Related Work

Automated EEG-based seizure classification has been widely studied using machine learning and deep learning techniques, driven by the need to reduce reliance on manual EEG interpretation and improve consistency across clinical settings [29,30]. Both early and recent approaches differ substantially in how EEG signals are represented and how models are designed to handle variability across recordings and individuals.

Classical machine learning methods remain a common choice for EEG seizure classification, particularly when EEG signals are transformed into feature-based or tabular representations. In this setting, handcrafted features derived from time-domain statistics, frequency-band power, entropy measures, and nonlinear dynamics are used as inputs to conventional classifiers. For example, Kapoor et al. [27] demonstrated that a hybrid ensemble combining random forest, AdaBoost, and decision trees achieved competitive performance on benchmark EEG seizure datasets. Similarly, Abirami et al. [3] reported strong results using XGBoost for multi-class EEG seizure recognition. Wavelet-based feature extraction paired with classical classifiers has also been widely explored; Subasi and Gursoy [26] showed that such approaches effectively capture discriminative frequency-domain patterns on the Bonn dataset. Bhattacharyya et al. [31] further demonstrated that boosted ensemble classifiers consistently outperform single models in multi-class seizure recognition tasks on the same benchmark. More recent studies continue to confirm the robustness of classical and ensemble methods: Abhishek et al. [32] evaluated random forests and gradient boosting on feature-based EEG representations, while Paneru [33] introduced interpretable ensemble models with LIME explanations and reported strong performance on the BEED dataset under nested cross-validation. Additionally, Najmusseher et al. [34] showed that combining multi-domain feature extraction with boosting classifiers such as AdaBoost and Gaussian Naïve Bayes yields high accuracy on both BEED and BONN datasets, albeit under sample-level evaluation protocols.

Deep learning techniques have also been extensively applied to EEG seizure classification, particularly to learn representations directly from raw EEG signals or time–frequency transformations. Aslam et al. [35] transformed 1D EEG signals into 2D spectrograms and employed deep convolutional neural networks (CNNs) to achieve high accuracy on the Bonn dataset. Recurrent and hybrid architectures have likewise been explored: Tsiouris et al. [36] proposed an LSTM-based framework to model temporal dependencies in EEG recordings, while Daoud and Bayoumi [37] introduced an efficient RNN-based approach emphasizing reduced computational complexity and real-time applicability. More recent deep architectures, including 1D-CNN and BiLSTM hybrids, continue to report near-ceiling performance on the Bonn benchmark [38]. In parallel, recent BEED-focused studies have combined deep learning and stacking strategies, such as SeqBoostNet [39], and tabular deep learning approaches, such as PaperNet [40], demonstrating strong macro-F1 and accuracy scores under various evaluation protocols.

In contrast to the approaches described above, tabular foundation models represent a relatively recent direction in machine learning. Rather than training models independently for each dataset, these methods leverage pretrained priors learned across a wide range of tasks and synthetic data distributions. TabPFN is a transformer-based model pretrained on synthetic tabular datasets that approximates Bayesian inference through in-context learning, enabling rapid prediction without dataset-specific optimization [28,41,42]. While tabular foundation models have been evaluated extensively on various real-world tabular benchmarks, their application to EEG-based seizure classification remains limited.

In this work, we address this gap by comparing a tabular foundation model with widely used classical ensemble methods for EEG seizure classification, with a particular focus on subject-independent evaluation and consistency across datasets.

3. Materials and Methods

This section describes the datasets used in this study and the machine learning models used for EEG-based seizure classification. Emphasis is placed on subject-independent evaluation to reflect realistic deployment scenarios, where models are required to generalize to previously unseen individuals.

3.1. Datasets

This study employs two publicly available EEG datasets that have been widely used for epileptic seizure classification and benchmarking. Both datasets provide EEG recordings transformed into fixed-length tabular representations, making them suitable for evaluating classical machine learning models as well as tabular foundation models.

3.1.1. Bangalore EEG Epilepsy Dataset (BEED)

The primary dataset used in this study is the Bangalore EEG Epilepsy Dataset (BEED) [34,43]. BEED consists of labeled EEG signal segments collected from multiple subjects, representing seizure and non-seizure states (binary classification). Each EEG segment is represented as a fixed-length vector, enabling a tabular formulation of the classification task. Importantly, the dataset includes subject identifiers, which are leveraged in this study to enforce subject-independent evaluation via GroupKFold cross-validation. This characteristic makes BEED particularly suitable for assessing model generalization across unseen individuals, a key requirement for realistic clinical deployment.

3.1.2. University of Bonn EEG Dataset

The second dataset used in this study is the University of Bonn EEG dataset [44,45], a well-known benchmark in EEG seizure analysis. The dataset comprises EEG recordings segmented into fixed-length samples and organized into five distinct classes representing seizure activity and various non-seizure conditions. Each EEG segment is represented by a vector of signal values, resulting in a tabular data structure suitable for classical machine learning and tabular foundation models. The Bonn dataset has been extensively used in prior studies for evaluating seizure classification algorithms and provides a standardized reference point for comparative analysis.

Table 1 shows a summary of the BEED and Bonn EEG datasets, including key characteristics such as subjects, channels, sampling rates, segment lengths, class distributions, and data representation used to ensure reproducibility.

3.2. Models Used

This study compares classical ensemble learning methods with a tabular foundation model for EEG-based seizure classification. All models operate on tabular representations of EEG segments and are evaluated under identical subject-independent cross-validation settings.

3.2.1. Random Forest

Random Forest (RF) is an ensemble learning method based on bootstrap aggregation (bagging) of decision trees [46,47]. Given a training dataset

D = {(x_{i}, y_{i})}_{i = 1}^{N}

, Random Forest constructs T decision trees, each trained on a bootstrap sample of

D

and using a random subset of features at each split.

For a classification task, the final prediction is obtained via majority voting:

\hat{y} = arg max_{c \in C} \sum_{t = 1}^{T} I (h_{t} (x) = c),

(1)

where

h_{t} (x)

denotes the prediction of the t-th decision tree and

I (\cdot)

is the indicator function.

Random Forests are known for their robustness to noise, reduced overfitting, and strong performance on high-dimensional tabular data, making them a common baseline for EEG seizure classification.

3.2.2. Gradient Boosting Machine

Gradient Boosting Machine (GBM) is a boosting-based ensemble method that builds decision trees sequentially, with each new tree trained to correct the errors of the current ensemble [48,49]. Let

F_{m} (x)

denote the model after m iterations. The model is updated as:

F_{m} (x) = F_{m - 1} (x) + ν h_{m} (x),

(2)

where

ν \in (0, 1]

is the learning rate and

h_{m} (x)

is the newly fitted base learner.

Each base learner is trained by minimizing the loss function

L (y, F (x))

using the negative gradient:

h_{m} = arg min_{h} \sum_{i = 1}^{N} {(- \frac{\partial L (y_{i}, F_{m - 1} (x_{i}))}{\partial F_{m - 1} (x_{i})} - h (x_{i}))}^{2} .

(3)

GBMs are effective at modeling complex non-linear relationships and interactions among EEG-derived features.

3.2.3. Extreme Gradient Boosting

Extreme Gradient Boosting (XGBoost) is an optimized extension of gradient boosting that incorporates regularization and efficient tree construction [50]. The objective function optimized at iteration m is:

L_{m} = \sum_{i = 1}^{N} L (y_{i}, {\hat{y}}_{i}^{(m)}) + \sum_{k = 1}^{m} Ω (h_{k}),

(4)

where

Ω (h_{k})

is a regularization term that penalizes model complexity:

Ω (h) = γ T + \frac{1}{2} λ \sum_{j = 1}^{T} w_{j}^{2},

(5)

with T denoting the number of leaves in a tree and

w_{j}

the leaf weights.

By explicitly controlling model complexity, XGBoost often achieves improved generalization and has become a strong baseline for tabular EEG classification tasks.

3.2.4. AdaBoost

Adaptive Boosting (AdaBoost) is an ensemble method that iteratively reweights training samples to focus learning on difficult cases [51]. At iteration m, sample weights are updated as:

w_{i}^{(m + 1)} = w_{i}^{(m)} exp (α_{m} I (y_{i} \neq h_{m} (x_{i}))),

(6)

where

h_{m}

is the base classifier and

α_{m}

is its weight:

α_{m} = \frac{1}{2} ln (\frac{1 - ϵ_{m}}{ϵ_{m}}),

(7)

with

ϵ_{m}

representing the weighted classification error.

The final prediction is obtained as a weighted majority vote:

\hat{y} = arg max_{c \in C} \sum_{m = 1}^{M} α_{m} I (h_{m} (x) = c) .

(8)

AdaBoost is particularly effective when weak learners are combined to improve classification performance.

3.2.5. Tabular Prior-Data Fitted Network

The Tabular Prior-Data Fitted Network (TabPFN) represents a fundamentally different paradigm from classical ensemble models [28,41]. TabPFN is a transformer-based tabular foundation model pretrained on large collections of synthetic tabular datasets. Rather than learning task-specific parameters, TabPFN performs inference through in-context learning. Figure 2 shows an illustration of the TabPFN technique. The process starts by representing each feature in the input tabular dataset as a token, effectively transforming the table into a sequence of feature tokens.

These tokens are augmented with additional learnable and randomly initialized tokens and passed through a self-attention mechanism, which enables the model to capture complex dependencies and interactions across features and samples. Through successive attention layers, the model aggregates relevant contextual information into a summary representation. This final representation is then fed into a multilayer perceptron (MLP) to produce the output logits, which are used to generate the final classification prediction.

Given a context set:

C = {(x_{i}, y_{i})}_{i = 1}^{K}

, and a query input

x_{q}

, TabPFN estimates the predictive distribution as:

p (y_{q} ∣ x_{q}, C) \approx p_{θ} (y_{q} ∣ x_{q}, X_{C}, y_{C}),

(9)

where

p_{θ}

is a pretrained transformer that approximates Bayesian inference by conditioning on the observed context.

This formulation enables rapid prediction without dataset-specific training or optimization, allowing TabPFN to leverage learned priors across tasks. In this study, TabPFN is applied directly to EEG tabular representations to assess its effectiveness relative to classical ensemble learning methods under subject-independent evaluation.

3.3. Performance Metrics

To evaluate model performance for EEG-based seizure classification, we employ a set of standard classification metrics that capture both overall predictive accuracy and class-wise discrimination performance. EEG seizure datasets often exhibit substantial inter-class overlap and variability across subjects, making it essential to assess models beyond accuracy alone. Accordingly, we report accuracy, precision, recall, F1-score, and the receiver operating characteristic (ROC) curve with the area under the ROC curve (AUC).

Accuracy measures the proportion of EEG segments that are correctly classified across all classes [52,53]. While accuracy provides a global measure of performance, it may mask poor performance on individual seizure or non-seizure classes in multi-class settings. To address this limitation, we additionally report precision and recall, which provide class-sensitive evaluations of prediction quality. Precision measures the proportion of EEG segments predicted for a given class that are correctly labeled, reflecting the model’s ability to limit false positive detections [54]. Recall or sensitivity measures the proportion of true EEG segments for a class that are correctly identified, capturing the model’s ability to detect seizure-related activity [55].

The F1-score is the harmonic mean of precision and recall. It balances the trade-off between false positives and false negatives and provides a single summary measure of class-wise performance [56,57,58]. For multi-class seizure classification, we report the macro-averaged F1-score, which assigns equal importance to each class and is particularly suitable when classes represent clinically distinct EEG conditions. In addition, model discrimination capability is evaluated using the receiver operating characteristic (ROC) curve and the corresponding area under the curve (AUC). For multi-class classification, ROC–AUC is computed using a one-vs-rest (OvR) strategy and macro-averaged across classes. ROC–AUC measures the ability of the model to distinguish between classes across varying decision thresholds and provides a threshold-independent performance indicator. These metrics are formally defined as follows:

Accuracy = \frac{T P + T N}{T P + T N + F P + F N}

(10)

Precision = \frac{T P}{T P + F P}

(11)

Recall = \frac{T P}{T P + F N}

(12)

F 1 - s c o r e = 2 \times \frac{Precision \cdot Recall}{Precision + Recall}

(13)

where

T P

denotes true positives (correctly classified EEG segments for a given class),

T N

denotes true negatives,

F P

denotes false positives (EEG segments incorrectly assigned to a class), and

F N

denotes false negatives (EEG segments belonging to a class but not detected by the model).

3.4. Evaluation Framework

This study proposes an evaluation framework for assessing the applicability of a foundation-model-based and classical ensemble approaches to EEG-based seizure classification across two datasets. The methodology is designed to ensure fair comparison with classical machine learning baselines while enforcing realistic evaluation conditions and consistent hyperparameter optimization. For each dataset, EEG recordings are represented as fixed-length tabular vectors, as provided by the dataset. To prevent subject leakage and overly optimistic performance estimates, a subject-aware cross-validation strategy is adopted independently for each dataset, ensuring that EEG segments from the same subject do not appear in both training and test sets within a fold.

3.4.1. Hyperparameter Tuning

Within each cross-validation fold, hyperparameter tuning is performed using the training data only. Classical ensemble models are optimized through grid-based or randomized hyperparameter search [59,60], while the foundation model, TabPFN, is configured using tunable inference-related parameters [28,42]. Following hyperparameter selection, optimized models are evaluated on the held-out fold. Out-of-fold predictions are aggregated across folds to compute performance metrics for each dataset separately. The overall evaluation procedure described above is summarized in Algorithm 1. The algorithm formalizes the application of the proposed framework across multiple EEG datasets, incorporating subject-aware cross-validation, hyperparameter optimization within training folds, and out-of-fold performance estimation. By explicitly enforcing subject-level separation and isolating model selection from test data, the algorithm provides a reproducible and fair evaluation pipeline for both classical ensemble methods and the foundation-model-based approach.

This study performs window-level EEG segment classification using the segment units provided by each dataset. The objective is to assign a class label to each fixed-length segment (e.g., seizure type or non-seizure state), rather than to detect seizure events with onset/offset timing in continuous recordings. Event-level seizure detection and latency-sensitive deployment considerations are discussed as future work. Although TabPFN does not require gradient-based retraining for each dataset, inference-related parameters, such as ensemble size and temperature, were specified before evaluation and applied consistently across folds. All performance metrics were computed exclusively from out-of-fold predictions, and no information from held-out test folds influenced model configuration.

Algorithm 1: EEG Seizure Classification with TabPFN and Classical Ensembles

3.4.2. Leakage Controls and Fold Isolation

To minimize optimistic bias, we enforce fold isolation at the subject level using GroupKFold whenever subject identifiers are available (BEED). All segments belonging to a subject are assigned to a single fold, ensuring that no subject appears in both training and test partitions within a fold. For datasets without subject metadata at the segment level (Bonn), we apply stratified K-fold cross-validation at the segment level. The Bonn dataset consists of pre-segmented, non-overlapping EEG recordings; however, explicit subject identifiers are not provided for individual segments. Consequently, subject-level isolation cannot be enforced for this dataset, and this limitation is acknowledged when interpreting results.

All preprocessing steps that could introduce information leakage (e.g., scaling, normalization, feature selection, or dimensionality reduction) are fit only on the training portion of each fold and then applied to the held-out test portion. Model selection and hyperparameter tuning are performed exclusively within the training data of each fold. Final reported metrics are computed from aggregated out-of-fold predictions across all held-out test folds.

4. Results and Discussion

The evaluation was conducted on two publicly available EEG seizure classification datasets: the Bangalore EEG Epilepsy Dataset (BEED) and the University of Bonn EEG dataset. These datasets differ in recording conditions, class structure, and signal characteristics, allowing for a comprehensive assessment of model behavior across heterogeneous EEG data sources. All preprocessing, modeling, and evaluation procedures were implemented in Python (Python 3.14.3) using scikit-learn for classical ensemble models, XGBoost for gradient-boosted trees, and the TabPFN library for foundation-model-based inference.

For each dataset, EEG recordings were represented as fixed-length tabular feature vectors as provided by the datasets. Model evaluation was performed independently for each dataset. For BEED, subject-aware cross-validation using GroupKFold was applied to ensure that all segments from a given subject were confined to a single fold. For the Bonn dataset, where explicit subject identifiers are not provided at the segment level, stratified K-fold cross-validation was used to preserve class balance while preventing sample leakage across folds.

Hyperparameter optimization was conducted within each training fold only. The hyperparameter search spaces and selected configurations for all evaluated models are summarized in Table 2 and Table 3 for the BEED and Bonn datasets, respectively. Classical ensemble models, including Random Forest, Gradient Boosting, XGBoost, and AdaBoost, were tuned using cross-validation on the training data, while the TabPFN model was configured using inference-related parameters without dataset-specific training. Out-of-fold predictions were collected across all folds and used to compute segment-level performance metrics, including accuracy and macro-averaged F1-score. Discrimination performance across thresholds is summarized using one-vs-rest ROC–AUC curves in Figure 3 and Figure 4. The following subsections present quantitative results for each dataset and discuss comparative performance trends between classical ensemble methods and the foundation-model-based approach.

4.1. Experimental Results

To assess whether the observed improvements are statistically meaningful, we conducted paired, fold-wise significance testing using the Wilcoxon signed-rank test. For each of the 10 outer cross-validation folds, we computed fold-level macro-F1 scores for TabPFN and for each classical ensemble model under the same subject-aware GroupKFold evaluation protocol. A one-sided Wilcoxon signed-rank test with the alternative hypothesis that TabPFN outperforms the corresponding classical model was applied to the paired fold-level scores. The results indicate that TabPFN achieves statistically significant improvements over the classical ensemble baselines (p < 0.001).

Table 4 and Table 5 summarize the classification performance on the BEED and University of Bonn EEG datasets, respectively. Across both datasets, classical ensemble methods, including Gradient Boosting, Random Forest, and XGBoost, achieve strong classification performance, with XGBoost consistently yielding the highest accuracy and F1-scores among the classical baselines. In contrast, AdaBoost exhibits noticeably lower performance, particularly on the BEED dataset, indicating limited robustness to increased signal variability.

TabPFN outperforms all classical models on both datasets. On the BEED dataset, TabPFN achieves near-perfect classification performance, substantially exceeding the best-performing classical ensemble. On the Bonn dataset, where all models perform strongly, TabPFN still attains the highest accuracy and F1-score, demonstrating consistent performance gains across datasets with differing EEG characteristics.

Figure 3 and Figure 4 illustrate the corresponding ROC curves. In both cases, TabPFN achieves the highest area under the ROC curve, indicating superior discriminative ability across classification thresholds. The ROC curves of XGBoost and Random Forest follow closely, while AdaBoost shows weaker separation. Overall, these results indicate that the foundation-model-based approach provides improved and consistent performance compared to classical ensemble methods for EEG-based seizure classification.

4.2. Discussion

Segment-level seizure classification is a common component in automated EEG pipelines, supporting downstream aggregation, triage, and clinical decision support. While event-level detection with precise onset and offset timing remains clinically critical, reliable segment classification under subject-independent evaluation is an essential building block toward robust and deployable monitoring systems.

This study compared a tabular foundation model, TabPFN, against classical ensemble baselines for EEG-based seizure classification on two benchmark datasets: BEED and the Bonn EEG dataset. Across both datasets, TabPFN achieved the strongest overall performance. This outcome is consistent with the design objective of TabPFN, namely, rapid in-context probabilistic prediction on small-to-medium tabular datasets without task-specific weight optimization, enabled by prior-data fitted network pretraining [28].

One plausible explanation is that pretraining on a large collection of synthetic tabular tasks induces a strong inductive bias tailored to small-to-medium structured problems. In contrast, classical ensembles must learn decision boundaries solely from the available training folds within each dataset. Under subject-aware cross-validation, where training data are constrained by fold isolation, such pretrained priors may offer advantages in data efficiency and robustness to subject-level heterogeneity. We emphasize that the advantage of TabPFN in this study does not imply superiority over temporally structured deep architectures, but rather demonstrates that pretrained tabular priors can be competitive when EEG is represented in fixed-length feature form. Empirically, recent studies on the BEED dataset report strong performance using engineered features and deep or ensemble-based architectures, but typically under different evaluation protocols (Table 6). SeqBoostNet achieves competitive accuracy and macro-F1 on BEED [39], and PaperNet reports strong macro-F1 under a subject-independent protocol [40]. Other studies employing interpretable ensembles [33] or multi-domain feature extraction combined with boosting classifiers [34] also demonstrate high accuracy, though predominantly under sample-level evaluation settings. In this context, the consistently strong performance of TabPFN under a strict subject-aware GroupKFold protocol suggests that tabular foundation models can serve as a competitive alternative for EEG seizure classification when signals are represented in fixed-length tabular form. The high performance observed on BEED likely reflects strong separability under the provided segment-level feature representation and controlled recording conditions. Although subject-aware fold isolation was enforced to prevent leakage, the dataset may exhibit relatively clean signal characteristics and limited inter-subject variability compared to larger, heterogeneous clinical EEG corpora. These results should therefore be interpreted within the context of benchmark-level evaluation rather than as definitive evidence of performance in fully unconstrained clinical environments.

On the University of Bonn benchmark, numerous classical, ensemble, and deep learning approaches have reported high performance. Early work combining wavelet-based feature extraction with SVM or kNN classifiers achieved solid accuracy under binary and multi-class settings [26]. Boosted ensemble classifiers further improved performance [31], while later deep learning models, including LSTM-based temporal architectures [36] and RNN-based frameworks [37], reported near-ceiling accuracy on segment-level evaluations. Hybrid signal-processing pipelines [61] and convolutional models operating on time–frequency representations [35] likewise achieved strong results. More recent 1D-CNN and BiLSTM hybrids further confirm the strong separability of the Bonn dataset under sample-level protocols [38].

Regarding interpretability, classical ensemble baselines admit established post-hoc analyses (e.g., permutation importance) that can highlight which input dimensions most strongly influence predictions under the chosen representation. In contrast, TabPFN performs in-context probabilistic inference through a pretrained transformer and does not expose feature attributions in the same manner as tree-based ensembles. Faithful explanations for tabular foundation models remain an active research area; we therefore treat systematic explanation of TabPFN decisions as future work, while noting that the controlled comparison here is conducted under identical tabular representations and strict leakage controls.

Within this broader literature, the TabPFN results reported here are comparable to the strongest published results on the Bonn benchmark (Table 7), while relying on a simpler modeling pipeline that operates directly on tabular segment representations. Importantly, performance was evaluated using strict fold isolation and out-of-fold aggregation, reinforcing that the reported gains are observed under subject-aware evaluation rather than sample-level splitting. It should be noted that, due to the absence of segment-level subject identifiers in the Bonn dataset, evaluation is conducted at the segment level rather than under strict subject isolation, and results should be interpreted in that context.

In practical EEG monitoring systems, segment-level classification is typically embedded within a larger pipeline that includes artifact handling, temporal smoothing, and event-level aggregation. A key operational advantage of TabPFN in this setting is that it does not require dataset-specific gradient-based retraining, which can simplify deployment when labeled data are limited or when rapid adaptation is needed. However, clinical deployment also depends on latency, robustness to artifacts, and calibration of predictive uncertainty; these aspects warrant targeted evaluation on continuous, multi-channel clinical EEG corpora in future work.

4.3. Limitations

Several limitations should be acknowledged. First, cross-study comparisons are inherently imperfect due to differences in preprocessing pipelines, feature construction, label definitions, and evaluation protocols. Reported performance metrics across papers may therefore not be strictly comparable. Second, TabPFN operates on fixed-length tabular vectors and does not explicitly model temporal dynamics or inter-channel spatial relationships, which are often exploited by convolutional, recurrent, transformer-based, or graph neural architectures. While tabular representations are computationally efficient and widely used, they may not fully capture structured dependencies present in raw multichannel EEG signals.

Third, conclusions are bounded by the selected benchmark datasets. Although BEED and Bonn are widely used in seizure classification research, further validation on larger-scale, multi-center, and clinically heterogeneous datasets would be necessary to establish robustness in real-world deployment scenarios.

Finally, this study does not provide feature-attribution explanations for TabPFN predictions. Although SHAP or LIME-based explanations are straightforward for tree-based ensemble models, adapting them faithfully to in-context tabular foundation models is non-trivial and may require dedicated methodological development. Therefore, we focus on subject-independent evaluation and leave foundation-model interpretability in EEG settings as an important direction for future work.

5. Conclusions and Future Work

This study demonstrates that a tabular foundation model can serve as an effective and practical approach for EEG-based seizure classification when EEG signals are represented as fixed-length feature vectors. Across two benchmark datasets, TabPFN approach consistently achieved stronger and more stable performance than classical ensemble methods under subject-independent evaluation, suggesting that pretrained tabular priors can transfer effectively to EEG classification tasks without dataset-specific weight optimization.

These findings indicate that foundation models represent a promising direction for tabular EEG analysis, particularly in scenarios requiring rapid deployment, limited training data, or robust generalization across heterogeneous subjects. More broadly, the results highlight the potential of in-context learning as an alternative paradigm to conventional retraining-based machine learning workflows in EEG pipelines.

Future work will investigate extensions to richer EEG representations, including multichannel and temporally structured inputs, and evaluate generalization across additional clinical datasets and recording environments. Integrating explainability techniques to support clinical interpretability, as well as assessing computational efficiency and latency in real-time seizure monitoring systems, are important directions for further study.

Author Contributions

Conceptualization, G.O. and E.E.; methodology, G.O.; validation, G.O. and E.E.; investigation, G.O. and E.E.; resources, G.O.; writing—original draft preparation, G.O. and E.E.; writing—review and editing, G.O. and E.E.; visualization, G.O.; supervision, G.O. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets analyzed in this study are publicly available. The Bonn EEG dataset is available via the Universitat Pompeu Fabra (UPF) Nonlinear Time Series Analysis Group downloads page at https://www.upf.edu/web/ntsa/downloads (accessed on 28 June 2025). The BEED (Bangalore EEG Epilepsy Dataset) is available from the UCI Machine Learning Repository at https://archive.ics.uci.edu/dataset/1134/beed-bangalore-eeg-epilepsy-dataset (accessed on 28 June 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AdaBoost	Adaptive Boosting
AUC	Area Under the Curve
BEED	Bangalore EEG Epilepsy Dataset
BiLSTM	Bidirectional Long Short-Term Memory
CNN	Convolutional Neural Network
CV	Cross-Validation
DWT	Discrete Wavelet Transform
EEG	Electroencephalogram
EWT	Empirical Wavelet Transform
FN	False Negative
FP	False Positive
GBM	Gradient Boosting Machine
kNN	k-Nearest Neighbors
LSTM	Long Short-Term Memory
MLP	Multilayer Perceptron
OOF	Out-of-Fold
OvR	One-vs-Rest
RF	Random Forest
RNN	Recurrent Neural Network
ROC–AUC	Receiver Operating Characteristic–Area Under the Curve
SVM	Support Vector Machine
TabPFN	Tabular Prior-Data Fitted Network
TN	True Negative
TP	True Positive
UCI	University of California, Irvine
UPF	Universitat Pompeu Fabra
XGBoost	Extreme Gradient Boosting

References

Raghu, S.; Sriraam, N.; Temel, Y.; Rao, S.V.; Kubben, P.L. EEG based multi-class seizure type classification using convolutional neural network and transfer learning. Neural Netw. 2020, 124, 202–212. [Google Scholar] [CrossRef]
Wang, Z.; Li, S.; Luo, J.; Liu, J.; Wu, D. Channel reflection: Knowledge-driven data augmentation for EEG-based brain–computer interfaces. Neural Netw. 2024, 176, 106351. [Google Scholar] [CrossRef]
Abirami, S.; Tikaram; Kathiravan, M.; Yuvaraj, R.; Menon, R.N.; Thomas, J.; Karthick, P.; Prince, A.A.; Ronickom, J.F.A. Automated multi-class seizure-type classification system using EEG signals and machine learning algorithms. IEEE Access 2024, 12, 136524–136541. [Google Scholar] [CrossRef]
Jain, S.; Srivastava, R. Electroencephalogram (EEG) based fuzzy logic and spiking neural networks (FLSNN) for advanced multiple neurological disorder diagnosis. Brain Topogr. 2025, 38, 33. [Google Scholar] [CrossRef] [PubMed]
Zhang, Y.; Chen, Z.S. Harnessing electroencephalography connectomes for cognitive and clinical neuroscience. Nat. Biomed. Eng. 2025, 9, 1186–1201. [Google Scholar] [CrossRef] [PubMed]
Zhang, H.; Zhou, Q.Q.; Chen, H.; Hu, X.Q.; Li, W.G.; Bai, Y.; Han, J.X.; Wang, Y.; Liang, Z.H.; Chen, D.; et al. The applied principles of EEG analysis methods in neuroscience and clinical neurology. Mil. Med Res. 2023, 10, 67. [Google Scholar] [CrossRef]
Wu, Y.; Su, E.L.M.; Wu, M.; Ooi, C.Y.; Holderbaum, W. A Review of Machine Learning and Deep Learning Trends in EEG-Based Epileptic Seizure Prediction. IEEE Access 2025, 13, 159812–159842. [Google Scholar] [CrossRef]
Zhang, X.; Zhang, X.; Huang, Q.; Chen, F. A Review of Epilepsy Detection and Prediction Methods Based on EEG Signal Processing and Deep Learning. Front. Neurosci. 2024, 18, 1468967. [Google Scholar] [CrossRef]
Lahiri, J.B.; Agarwal, P.; Kushwaha, S.; Singh, M.; Panwar, S. Evaluating the clinical readiness of artificial intelligence in EEG-Based epilepsy diagnosis. J. Neural Eng. 2025, 22, 1–12. [Google Scholar] [CrossRef]
Hu, Y.; Liu, J.; Sun, R.; Yu, Y.; Sui, Y. Classification of Epileptic Seizures in EEG Data Based on Iterative Gated Graph Convolution Network. Front. Comput. Neurosci. 2024, 18, 1454529. [Google Scholar] [CrossRef]
Tu, S.; Cao, L.; Zhang, D.; Chen, J.; Ma, L.; Zhang, Y.; Yang, Y. DMNet: Self-Comparison Driven Model for Subject-Independent Seizure Detection. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Vancouver, BC, Canada, 9–15 December 2024. [Google Scholar]
D’Amora, M.; Galgani, A.; Marchese, M.; Tantussi, F.; Faraguna, U.; De Angelis, F.; Giorgi, F.S. Zebrafish as an innovative tool for epilepsy modeling: State of the art and potential future directions. Int. J. Mol. Sci. 2023, 24, 7702. [Google Scholar] [CrossRef]
Larsen, P.M.; Beniczky, S. Non-electroencephalogram-based seizure detection devices: State of the art and future perspectives. Epilepsy Behav. 2023, 148, 109486. [Google Scholar] [CrossRef]
Manole, A.M.; Sirbu, C.A.; Mititelu, M.R.; Vasiliu, O.; Lorusso, L.; Sirbu, O.M.; Ionita Radu, F. State of the art and challenges in epilepsy—A narrative review. J. Pers. Med. 2023, 13, 623. [Google Scholar] [CrossRef]
Espinosa, N.; Amorim, A.; Huebner, R. Feedforward Neural Network with Backpropagation for Epilepsy Seizure Detection. In Proceedings of the IEEE EMBS International StudentConference 2020 in Latin America, Quito, Ecuador, 27–28 November 2020. [Google Scholar]
Madhavan, S.; Tripathy, R.K.; Pachori, R.B. Time-frequency domain deep convolutional neural network for the classification of focal and non-focal EEG signals. IEEE Sens. J. 2019, 20, 3078–3086. [Google Scholar] [CrossRef]
Iscan, Z.; Dokur, Z.; Demiralp, T. Classification of electroencephalogram signals with combined time and frequency features. Expert Syst. Appl. 2011, 38, 10499–10505. [Google Scholar] [CrossRef]
Chaovalitwongse, W.A.; Prokopyev, O.A.; Pardalos, P.M. Electroencephalogram (EEG) time series classification: Applications in epilepsy. Ann. Oper. Res. 2006, 148, 227–250. [Google Scholar] [CrossRef]
Tzallas, A.T.; Tsipouras, M.G.; Fotiadis, D.I. Epileptic seizure detection in EEGs using time–frequency analysis. IEEE Trans. Inf. Technol. Biomed. 2009, 13, 703–710. [Google Scholar] [CrossRef] [PubMed]
Al-jumaili, S.; Duru, A.D.; Ibrahim, A.A.; Osman, N.U. Investigation of epileptic seizure signatures classification in EEG using supervised machine learning algorithms. Trait. Du Signal 2023, 40, 43. [Google Scholar] [CrossRef]
Hollmann, N.; Müller, S.; Purucker, L.; Krishnakumar, A.; Körfer, M.; Hoo, S.B.; Schirrmeister, R.T.; Hutter, F. Accurate Predictions on Small Data with a Tabular Foundation Model. Nature 2024, 637, 319–326. [Google Scholar] [CrossRef] [PubMed]
Ruan, Y.; Lan, X.; Ma, J.; Dong, Y.; He, K.; Feng, M. Language modeling on tabular data: A survey of foundations, techniques and evolution. arXiv 2024, arXiv:2408.10548. [Google Scholar] [CrossRef]
Gorishniy, Y.; Rubachev, I.; Khrulkov, V.; Babenko, A. Revisiting deep learning models for tabular data. Adv. Neural Inf. Process. Syst. 2021, 34, 18932–18943. [Google Scholar]
Huang, X.; Khetan, A.; Cvitkovic, M.; Karnin, Z. Tabtransformer: Tabular data modeling using contextual embeddings. arXiv 2020, arXiv:2012.06678. [Google Scholar] [CrossRef]
Borisov, V.; Leemann, T.; Seßler, K.; Haug, J.; Pawelczyk, M.; Kasneci, G. Deep neural networks and tabular data: A survey. IEEE Trans. Neural Netw. Learn. Syst. 2022, 35, 7499–7519. [Google Scholar] [CrossRef] [PubMed]
Subasi, A.; Gursoy, M.I. EEG signal classification using PCA, ICA, LDA and support vector machines. Expert Syst. Appl. 2010, 37, 8659–8666. [Google Scholar] [CrossRef]
Kapoor, B.; Nagpal, B.; Jain, P.K.; Abraham, A.; Gabralla, L.A. Epileptic seizure prediction based on hybrid seek optimization tuned ensemble classifier using EEG signals. Sensors 2022, 23, 423. [Google Scholar] [CrossRef] [PubMed]
Hollmann, N.; Müller, S.; Eggensperger, K.; Hutter, F. Tabpfn: A transformer that solves small tabular classification problems in a second. arXiv 2022, arXiv:2207.01848. [Google Scholar]
Subasi, A. Epileptic seizure detection using dynamic wavelet network. Expert Syst. Appl. 2005, 29, 343–355. [Google Scholar] [CrossRef]
Bose, S.; Rama, V.; Rao, C.R. EEG signal analysis for seizure detection using discrete wavelet transform and random forest. In 2017 International Conference on Computer and Applications (ICCA); IEEE: New York, NY, USA, 2017; pp. 369–378. [Google Scholar]
Bhattacharyya, S.; Konar, A.; Tibarewala, D.; Khasnobish, A.; Janarthanan, R. Performance analysis of ensemble methods for multi-class classification of motor imagery EEG signal. In 2014 International Conference on Control, Instrumentation, Energy and Communication (CIEC); IEEE: New York, NY, USA, 2014; pp. 712–716. [Google Scholar]
Abhishek, S.; Kumar, S.; Mohan, N.; Soman, K. EEG based automated detection of seizure using machine learning approach and traditional features. Expert Syst. Appl. 2024, 251, 123991. [Google Scholar] [CrossRef]
Paneru, B. Interpretable Machine Learning for Epileptic Seizure Detection on the BEED Using LIME with an Ensemble Network. medRxiv 2025. [Google Scholar] [CrossRef]
Najmusseher; Banu, P.K.N.; Azar, A.T.; Kamal, N.A. Impact of Multi-domain Features for EEG Based Epileptic Seizures Classification. In International Conference on Advanced Intelligent Systems and Informatics; Springer: Berlin/Heidelberg, Germany, 2024; pp. 317–329. [Google Scholar]
Alshaya, H.; Hussain, M. EEG-based classification of epileptic seizure types using deep network model. Mathematics 2023, 11, 2286. [Google Scholar] [CrossRef]
Tsiouris, K.M.; Pezoulas, V.C.; Zervakis, M.; Konitsiotis, S.; Koutsouris, D.D.; Fotiadis, D.I. A long short-term memory deep learning network for the prediction of epileptic seizures using EEG signals. Comput. Biol. Med. 2018, 99, 24–37. [Google Scholar] [CrossRef]
Daoud, H.; Bayoumi, M.A. Efficient epileptic seizure prediction based on deep learning. IEEE Trans. Biomed. Circuits Syst. 2019, 13, 804–813. [Google Scholar] [CrossRef]
Mallick, S.; Baths, V. Novel deep learning framework for detection of epileptic seizures using EEG signals. Front. Comput. Neurosci. 2024, 18, 1340251. [Google Scholar] [CrossRef]
Najmusseher; Banu, P.K.N. Feature Engineering for Epileptic Seizure Classification Using SeqBoostNet. Int. J. Comput. Digit. Syst. 2024, 16, 1–14. [Google Scholar] [CrossRef]
Sajid, M.S.; Ghosh, A.K.; Nusrat, F. PaperNet: Efficient Temporal Convolutions and Channel Residual Attention for EEG Epilepsy Detection. arXiv 2025, arXiv:2512.22172. [Google Scholar]
Ye, H.J.; Liu, S.Y.; Chao, W.L. A closer look at tabpfn v2: Strength, limitation, and extension. arXiv 2025, arXiv:2502.17361. [Google Scholar] [CrossRef]
Grinsztajn, L.; Flöge, K.; Key, O.; Birkel, F.; Jund, P.; Roof, B.; Jäger, B.; Safaric, D.; Alessi, S.; Hayler, A.; et al. TabPFN-2.5: Advancing the State of the Art in Tabular Foundation Models. arXiv 2025, arXiv:2511.08667. [Google Scholar] [CrossRef]
Najmusseher; Banu, P.K.N. BEED: Bangalore EEG Epilepsy Dataset. UCI Machine Learning Repository. 2024. Available online: https://archive.ics.uci.edu/dataset/1134/beed:+bangalore+eeg+epilepsy+dataset (accessed on 28 June 2025).
Wang, C.; Liu, L.; Zhuo, W.; Xie, Y. An epileptic EEG detection method based on data augmentation and lightweight neural network. IEEE J. Transl. Eng. Health Med. 2023, 12, 22–31. [Google Scholar] [CrossRef] [PubMed]
Andrzejak, R.G.; Lehnertz, K.; Mormann, F.; Rieke, C.; David, P.; Elger, C.E. Indications of nonlinear deterministic and finite-dimensional structures in time series of brain electrical activity: Dependence on recording region and brain state. Phys. Rev. E 2001, 64, 061907. [Google Scholar] [CrossRef]
Parmar, A.; Katariya, R.; Patel, V. A review on random forest: An ensemble classifier. In International Conference on Intelligent Data Communication Technologies and Internet of Things; Springer: Berlin/Heidelberg, Germany, 2018; pp. 758–763. [Google Scholar]
Mienye, I.D.; Obaido, G.; Aruleba, K.; Dada, O.A. Enhanced prediction of chronic kidney disease using feature selection and boosted classifiers. In International Conference on Intelligent Systems Design and Applications; Springer: Berlin/Heidelberg, Germany, 2021; pp. 527–537. [Google Scholar]
Natekin, A.; Knoll, A. Gradient boosting machines, a tutorial. Front. Neurorobot. 2013, 7, 21. [Google Scholar] [CrossRef] [PubMed]
Mienye, I.D.; Jere, N. Optimized ensemble learning approach with explainable AI for improved heart disease prediction. Information 2024, 15, 394. [Google Scholar] [CrossRef]
Chen, T.; He, T.; Benesty, M.; Khotilovich, V.; Tang, Y.; Cho, H.; Chen, K.; Mitchell, R.; Cano, I.; Zhou, T.; et al. Xgboost: Extreme Gradient Boosting. R Package Version 0.4-2; 2015; Volume 1, pp. 1–4. Available online: http://r.meteo.uni.wroc.pl/web/packages/xgboost/vignettes/xgboost.pdf (accessed on 28 June 2025).
Schapire, R.E. Explaining adaboost. In Empirical Inference: Festschrift in Honor of Vladimir N. Vapnik; Springer: Berlin/Heidelberg, Germany, 2013; pp. 37–52. [Google Scholar]
Yin, M.; Wortman Vaughan, J.; Wallach, H. Understanding the effect of accuracy on trust in machine learning models. In Proceedings of the 2019 Chi Conference on Human Factors in Computing Systems, Glasgow, UK, 4–9 May 2019; pp. 1–12. [Google Scholar]
Amrouche, S.; Basara, L.; Calafiura, P.; Estrade, V.; Farrell, S.; Ferreira, D.R.; Finnie, L.; Finnie, N.; Germain, C.; Gligorov, V.V.; et al. The tracking machine learning challenge: Accuracy phase. In The NeurIPS’18 Competition: From Machine Learning to Intelligent Conversations; Springer: Berlin/Heidelberg, Germany, 2019; pp. 231–264. [Google Scholar]
Craik, A.; He, Y.; Contreras-Vidal, J.L. Deep learning for electroencephalogram (EEG) classification tasks: A review. J. Neural Eng. 2019, 16, 031001. [Google Scholar] [CrossRef]
Yankaskas, B.C.; Cleveland, R.J.; Schell, M.J.; Kozar, R. Association of recall rates with sensitivity and positive predictive values of screening mammography. Am. J. Roentgenol. 2001, 177, 543–549. [Google Scholar] [CrossRef]
Chicco, D.; Jurman, G. The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genom. 2020, 21, 6. [Google Scholar] [CrossRef] [PubMed]
Obaido, G.; Ogbuokiri, B.; Mienye, I.D.; Kasongo, S.M. A voting classifier for mortality prediction post-thoracic surgery. In International Conference on Intelligent Systems Design and Applications; Springer: Berlin/Heidelberg, Germany, 2022; pp. 263–272. [Google Scholar]
Obaido, G.; Ogbuokiri, B.; Swart, T.G.; Ayawei, N.; Kasongo, S.M.; Aruleba, K.; Mienye, I.D.; Aruleba, I.; Chukwu, W.; Osaye, F.; et al. An interpretable machine learning approach for hepatitis b diagnosis. Appl. Sci. 2022, 12, 11127. [Google Scholar] [CrossRef]
Gorriz, J.M.; Clemente, R.M.; Segovia, F.; Ramirez, J.; Ortiz, A.; Suckling, J. Is K-fold cross validation the best model selection method for Machine Learning? arXiv 2024, arXiv:2401.16407. [Google Scholar] [CrossRef]
Kumar, D.S.; Sriram, S.; Kumar, P. Enhancing Greenhouse Farming Through Machine Learning: A Focus on Feature Importance. In 2025 International Conference on Computing Technologies (ICOCT); IEEE: New York, NY, USA, 2025; pp. 1–6. [Google Scholar]
Akbari, H.; Esmaili, S.S.; Zadeh, S.F. Classification of seizure and seizure-free EEG signals based on empirical wavelet transform and phase space reconstruction. arXiv 2019, arXiv:1903.09728. [Google Scholar] [CrossRef]

Figure 2. The TabPFN process. Reprinted from Ye et al. [41].

Figure 3. ROC-AUC of the models on the BEED dataset.

Figure 4. ROC-AUC of the models on the Bonn dataset.

Table 1. Dataset summary and reproducibility details.

Item	BEED	Bonn
Subjects	10 subjects	5 subject groups (A–E); segment-level subject IDs not provided
Channels	19 EEG channels (10–20 system)	1 channel (single-channel segments)
Sampling rate	256 Hz	173.61 Hz
Segment length	1-s windows (256 samples per segment)	4097 samples per segment (∼23.6 s)
Classes	2 classes (Seizure, Non-Seizure)	5 classes (A–E)
Class balance	3750 seizure, 3750 non-seizure segments	100 segments per class (500 total)
Representation used	Provided tabular vectors; subject ID used for GroupKFold	Raw 4097-point segments treated as tabular feature vectors

Table 2. Hyperparameter values for the evaluated classifiers on the BEED dataset.

Classifier	Hyperparameter	Search Space	Optimal Value
RF	n_estimators	200, 500	200
	max_depth	None, 10, 25	10
	min_samples_split	2, 5, 10	5
XGBoost	n_estimators	300, 600	300
	max_depth	3, 5, 7	5
	learning_rate	0.03, 0.1	0.1
	subsample	0.7, 0.9, 1.0	0.9
	colsample_bytree	0.7, 0.9, 1.0	0.9
	min_child_weight	1, 5	1
	reg_lambda	1.0, 5.0, 10.0	1.0
GBM	n_estimators	100, 200, 400	200
	learning_rate	0.01, 0.05, 0.1	0.05
	max_depth	2, 3, 5	3
	subsample	0.7, 0.9, 1.0	0.9
AdaBoost	n_estimators	50, 100, 200, 400	100
AdaBoost	learning_rate	0.01, 0.05, 0.1, 0.5, 1.0	0.1
TabPFN	N_ensemble_configurations	8, 16, 32	16
	softmax_temperature	0.8, 1.0, 1.2	1.0
	balance_probabilities	False, True	True

Table 3. Hyperparameter values for the evaluated classifiers on the Uni of Bonn dataset.

Classifier	Hyperparameter	Search Space	Optimal Value
RF	n_estimators	200, 500	500
	max_depth	None, 10, 25	25
	min_samples_split	2, 5, 10	10
XGBoost	n_estimators	300, 600	300
	max_depth	3, 5, 7	7
	learning_rate	0.03, 0.1	0.1
	subsample	0.7, 0.9, 1.0	1.0
	colsample_bytree	0.7, 0.9, 1.0	1.0
	min_child_weight	1, 5	5
	reg_lambda	1.0, 5.0, 10.0	5.0
GBM	n_estimators	100, 200, 400	400
	learning_rate	0.01, 0.05, 0.1	0.05
	max_depth	2, 3, 5	3
	subsample	0.7, 0.9, 1.0	0.9
AdaBoost	n_estimators	50, 100, 200, 400	400
AdaBoost	learning_rate	0.01, 0.05, 0.1, 0.5, 1.0	0.5
TabPFN	N_ensemble_configurations	8, 16, 32	32
	softmax_temperature	0.8, 1.0, 1.2	1.0
	balance_probabilities	False, True	True

Table 4. Performance evaluation on the BEED dataset.

Model	Accuracy	Precision	Recall	F1-Score
XGBoost	0.888	0.889	0.888	0.886
Random Forest	0.875	0.875	0.875	0.872
Gradient Boosting	0.840	0.839	0.834	0.838
AdaBoost	0.648	0.682	0.648	0.620
TabPFN	0.997	0.992	0.997	0.995

Table 5. Performance evaluation on the Bonn dataset.

Model	Accuracy	Precision	Recall	F1-Score
XGBoost	0.989	0.991	0.990	0.990
Random Forest	0.988	0.989	0.989	0.989
Gradient Boosting	0.986	0.989	0.988	0.988
AdaBoost	0.931	0.928	0.940	0.934
TabPFN	0.996	0.997	0.997	0.997

Table 6. Recent studies reporting results on the BEED dataset, compared with our TabPFN results.

Study (Year)	Approach	Evaluation Notes	Performance
Najmusseher and Banu [39]	SeqBoostNet	BEED & BONN; Acc and other metrics	Acc. ≈ 96.71%, Acc. ≈ 97.11%
Sajid et al. [40]	PaperNet	BEED; subject-indep. protocol	Macro-F1 ≈ 0.96
Paneru [33]	Ensemble + LIME	BEED; nested CV	Acc. ≈ 97.06%
Najmusseher et al. [34]	Multi-domain features + AdaBoost/GNB	BEED & BONN; sample-level evaluation; Acc, ROC, confusion matrix	Acc. ≈ 99%, Acc. ≈ 98%
This work (2026)	TabPFN	10-fold GroupKFold OOF evaluation (subject-aware)	Acc. ≈ 99.7%; Macro-F1 ≈ 99.5%; AUC ≈ 99.9%

Table 7. Recent studies reporting results on the Bonn dataset, compared with our TabPFN results.

Study (Year)	Approach	Evaluation Notes	Performance
Subasi and Gursoy [26]	DWT + SVM/kNN	BONN; binary and multi-class; CV	Acc. ≈ 95.0%
Bhattacharyya et al. [31]	Boosted ensemble classifiers	BONN; sample-level CV	Acc. ≈ 97.0%
Tsiouris et al. [36]	LSTM-based deep learning	BONN; segment-level evaluation	Acc. ≈ 99.0%
Daoud and Bayoumi [37]	RNN-based seizure prediction	BONN; binary classification	Acc. ≈ 98.7%
Akbari et al. [61]	EWT + kNN	BONN; binary seizure detection	Acc. ≈ 98.0%
Aslam et al. [35]	CNN on time–frequency EEG	BONN; spectrogram-based CV	Acc. ≈ 99.2%
Mallick et al. [38]	1D-CNN + BiLSTM	BONN; multi-class evaluation	Acc. ≈ 99.0%
This work (2026)	TabPFN	10-fold stratified OOF evaluation (segment-level)	Acc. ≈ 99.6%; Macro-F1 ≈ 99.7%; AUC ≈ 99.9%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Obaido, G.; Esenogho, E. Evaluating EEG-Based Seizure Classification Using Foundation and Classical Ensemble Models. Appl. Sci. 2026, 16, 3120. https://doi.org/10.3390/app16073120

AMA Style

Obaido G, Esenogho E. Evaluating EEG-Based Seizure Classification Using Foundation and Classical Ensemble Models. Applied Sciences. 2026; 16(7):3120. https://doi.org/10.3390/app16073120

Chicago/Turabian Style

Obaido, George, and Ebenezer Esenogho. 2026. "Evaluating EEG-Based Seizure Classification Using Foundation and Classical Ensemble Models" Applied Sciences 16, no. 7: 3120. https://doi.org/10.3390/app16073120

APA Style

Obaido, G., & Esenogho, E. (2026). Evaluating EEG-Based Seizure Classification Using Foundation and Classical Ensemble Models. Applied Sciences, 16(7), 3120. https://doi.org/10.3390/app16073120

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Evaluating EEG-Based Seizure Classification Using Foundation and Classical Ensemble Models

Abstract

1. Introduction

2. Related Work

3. Materials and Methods

3.1. Datasets

3.1.1. Bangalore EEG Epilepsy Dataset (BEED)

3.1.2. University of Bonn EEG Dataset

3.2. Models Used

3.2.1. Random Forest

3.2.2. Gradient Boosting Machine

3.2.3. Extreme Gradient Boosting

3.2.4. AdaBoost

3.2.5. Tabular Prior-Data Fitted Network

3.3. Performance Metrics

3.4. Evaluation Framework

3.4.1. Hyperparameter Tuning

3.4.2. Leakage Controls and Fold Isolation

4. Results and Discussion

4.1. Experimental Results

4.2. Discussion

4.3. Limitations

5. Conclusions and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI