The evaluation was conducted on two publicly available EEG seizure classification datasets: the Bangalore EEG Epilepsy Dataset (BEED) and the University of Bonn EEG dataset. These datasets differ in recording conditions, class structure, and signal characteristics, allowing for a comprehensive assessment of model behavior across heterogeneous EEG data sources. All preprocessing, modeling, and evaluation procedures were implemented in Python (Python 3.14.3) using scikit-learn for classical ensemble models, XGBoost for gradient-boosted trees, and the TabPFN library for foundation-model-based inference.
For each dataset, EEG recordings were represented as fixed-length tabular feature vectors as provided by the datasets. Model evaluation was performed independently for each dataset. For BEED, subject-aware cross-validation using GroupKFold was applied to ensure that all segments from a given subject were confined to a single fold. For the Bonn dataset, where explicit subject identifiers are not provided at the segment level, stratified K-fold cross-validation was used to preserve class balance while preventing sample leakage across folds.
Hyperparameter optimization was conducted within each training fold only. The hyperparameter search spaces and selected configurations for all evaluated models are summarized in
Table 2 and
Table 3 for the BEED and Bonn datasets, respectively. Classical ensemble models, including Random Forest, Gradient Boosting, XGBoost, and AdaBoost, were tuned using cross-validation on the training data, while the TabPFN model was configured using inference-related parameters without dataset-specific training. Out-of-fold predictions were collected across all folds and used to compute segment-level performance metrics, including accuracy and macro-averaged F1-score. Discrimination performance across thresholds is summarized using one-vs-rest ROC–AUC curves in
Figure 3 and
Figure 4. The following subsections present quantitative results for each dataset and discuss comparative performance trends between classical ensemble methods and the foundation-model-based approach.
4.1. Experimental Results
To assess whether the observed improvements are statistically meaningful, we conducted paired, fold-wise significance testing using the Wilcoxon signed-rank test. For each of the 10 outer cross-validation folds, we computed fold-level macro-F1 scores for TabPFN and for each classical ensemble model under the same subject-aware GroupKFold evaluation protocol. A one-sided Wilcoxon signed-rank test with the alternative hypothesis that TabPFN outperforms the corresponding classical model was applied to the paired fold-level scores. The results indicate that TabPFN achieves statistically significant improvements over the classical ensemble baselines (p < 0.001).
Table 4 and
Table 5 summarize the classification performance on the BEED and University of Bonn EEG datasets, respectively. Across both datasets, classical ensemble methods, including Gradient Boosting, Random Forest, and XGBoost, achieve strong classification performance, with XGBoost consistently yielding the highest accuracy and F1-scores among the classical baselines. In contrast, AdaBoost exhibits noticeably lower performance, particularly on the BEED dataset, indicating limited robustness to increased signal variability.
TabPFN outperforms all classical models on both datasets. On the BEED dataset, TabPFN achieves near-perfect classification performance, substantially exceeding the best-performing classical ensemble. On the Bonn dataset, where all models perform strongly, TabPFN still attains the highest accuracy and F1-score, demonstrating consistent performance gains across datasets with differing EEG characteristics.
Figure 3 and
Figure 4 illustrate the corresponding ROC curves. In both cases, TabPFN achieves the highest area under the ROC curve, indicating superior discriminative ability across classification thresholds. The ROC curves of XGBoost and Random Forest follow closely, while AdaBoost shows weaker separation. Overall, these results indicate that the foundation-model-based approach provides improved and consistent performance compared to classical ensemble methods for EEG-based seizure classification.
4.2. Discussion
Segment-level seizure classification is a common component in automated EEG pipelines, supporting downstream aggregation, triage, and clinical decision support. While event-level detection with precise onset and offset timing remains clinically critical, reliable segment classification under subject-independent evaluation is an essential building block toward robust and deployable monitoring systems.
This study compared a tabular foundation model, TabPFN, against classical ensemble baselines for EEG-based seizure classification on two benchmark datasets: BEED and the Bonn EEG dataset. Across both datasets, TabPFN achieved the strongest overall performance. This outcome is consistent with the design objective of TabPFN, namely, rapid in-context probabilistic prediction on small-to-medium tabular datasets without task-specific weight optimization, enabled by prior-data fitted network pretraining [
28].
One plausible explanation is that pretraining on a large collection of synthetic tabular tasks induces a strong inductive bias tailored to small-to-medium structured problems. In contrast, classical ensembles must learn decision boundaries solely from the available training folds within each dataset. Under subject-aware cross-validation, where training data are constrained by fold isolation, such pretrained priors may offer advantages in data efficiency and robustness to subject-level heterogeneity. We emphasize that the advantage of TabPFN in this study does not imply superiority over temporally structured deep architectures, but rather demonstrates that pretrained tabular priors can be competitive when EEG is represented in fixed-length feature form. Empirically, recent studies on the BEED dataset report strong performance using engineered features and deep or ensemble-based architectures, but typically under different evaluation protocols (
Table 6). SeqBoostNet achieves competitive accuracy and macro-F1 on BEED [
39], and PaperNet reports strong macro-F1 under a subject-independent protocol [
40]. Other studies employing interpretable ensembles [
33] or multi-domain feature extraction combined with boosting classifiers [
34] also demonstrate high accuracy, though predominantly under sample-level evaluation settings. In this context, the consistently strong performance of TabPFN under a strict subject-aware GroupKFold protocol suggests that tabular foundation models can serve as a competitive alternative for EEG seizure classification when signals are represented in fixed-length tabular form. The high performance observed on BEED likely reflects strong separability under the provided segment-level feature representation and controlled recording conditions. Although subject-aware fold isolation was enforced to prevent leakage, the dataset may exhibit relatively clean signal characteristics and limited inter-subject variability compared to larger, heterogeneous clinical EEG corpora. These results should therefore be interpreted within the context of benchmark-level evaluation rather than as definitive evidence of performance in fully unconstrained clinical environments.
On the University of Bonn benchmark, numerous classical, ensemble, and deep learning approaches have reported high performance. Early work combining wavelet-based feature extraction with SVM or kNN classifiers achieved solid accuracy under binary and multi-class settings [
26]. Boosted ensemble classifiers further improved performance [
31], while later deep learning models, including LSTM-based temporal architectures [
36] and RNN-based frameworks [
37], reported near-ceiling accuracy on segment-level evaluations. Hybrid signal-processing pipelines [
61] and convolutional models operating on time–frequency representations [
35] likewise achieved strong results. More recent 1D-CNN and BiLSTM hybrids further confirm the strong separability of the Bonn dataset under sample-level protocols [
38].
Regarding interpretability, classical ensemble baselines admit established post-hoc analyses (e.g., permutation importance) that can highlight which input dimensions most strongly influence predictions under the chosen representation. In contrast, TabPFN performs in-context probabilistic inference through a pretrained transformer and does not expose feature attributions in the same manner as tree-based ensembles. Faithful explanations for tabular foundation models remain an active research area; we therefore treat systematic explanation of TabPFN decisions as future work, while noting that the controlled comparison here is conducted under identical tabular representations and strict leakage controls.
Within this broader literature, the TabPFN results reported here are comparable to the strongest published results on the Bonn benchmark (
Table 7), while relying on a simpler modeling pipeline that operates directly on tabular segment representations. Importantly, performance was evaluated using strict fold isolation and out-of-fold aggregation, reinforcing that the reported gains are observed under subject-aware evaluation rather than sample-level splitting. It should be noted that, due to the absence of segment-level subject identifiers in the Bonn dataset, evaluation is conducted at the segment level rather than under strict subject isolation, and results should be interpreted in that context.
In practical EEG monitoring systems, segment-level classification is typically embedded within a larger pipeline that includes artifact handling, temporal smoothing, and event-level aggregation. A key operational advantage of TabPFN in this setting is that it does not require dataset-specific gradient-based retraining, which can simplify deployment when labeled data are limited or when rapid adaptation is needed. However, clinical deployment also depends on latency, robustness to artifacts, and calibration of predictive uncertainty; these aspects warrant targeted evaluation on continuous, multi-channel clinical EEG corpora in future work.
4.3. Limitations
Several limitations should be acknowledged. First, cross-study comparisons are inherently imperfect due to differences in preprocessing pipelines, feature construction, label definitions, and evaluation protocols. Reported performance metrics across papers may therefore not be strictly comparable. Second, TabPFN operates on fixed-length tabular vectors and does not explicitly model temporal dynamics or inter-channel spatial relationships, which are often exploited by convolutional, recurrent, transformer-based, or graph neural architectures. While tabular representations are computationally efficient and widely used, they may not fully capture structured dependencies present in raw multichannel EEG signals.
Third, conclusions are bounded by the selected benchmark datasets. Although BEED and Bonn are widely used in seizure classification research, further validation on larger-scale, multi-center, and clinically heterogeneous datasets would be necessary to establish robustness in real-world deployment scenarios.
Finally, this study does not provide feature-attribution explanations for TabPFN predictions. Although SHAP or LIME-based explanations are straightforward for tree-based ensemble models, adapting them faithfully to in-context tabular foundation models is non-trivial and may require dedicated methodological development. Therefore, we focus on subject-independent evaluation and leave foundation-model interpretability in EEG settings as an important direction for future work.